mercredi 4 août 2021

Random sampling stratified based on column B while making sure unique values of column C is picked at least once across all strata - Pandas

I have a dataset of 3 columns, and 600K rows, let's say A, B and C for column names. I need to randomly sampled this dataset stratified based on column A; but I also want to make sure unique values of B (~22K unique values) across all groups of A picked up at least once.

Currently, when I use the following code based on some mock data, I created below, to show my steps and problem.

Mock-up Data:

import pandas as pd
import random
import string
import numpy as np

# Some random data frame generation with columns A, B, C
strata_groups=['st1', 'st2', 'st3', 'st4']
A = random.choices(strata_groups, k=10000)
B = random.choices(string.ascii_lowercase,k=10000)
C = np.random.randint(1, 6, 10000)
original_data=pd.DataFrame({'A': A, 'B': B, 'C': C})

Stratified Sampling based on Column A with predefined proportions

stratified_data = pd.DataFrame(columns=original_data.columns)
strata_names_under_col_A=['st1', 'st2', 'st3', 'st4']
strata_prop=[0.25, 0.40, 0.15, 0.20] # let's assume 4 values of A we stratify
desired_sample_size=900 # Let's say we want a sample of 900 out of 10k

for i in range(len(strata_prop)):
    length_of_strata=round(desired_sample_size * strata_prop[i-1]
    data_filtered_to_strat=original_data[original_data['A'] == strata_names_under_col_A[i-1]]
    data_temp=data_filtered_to_strat.sample(replace=True, n=length_of_strat, random_state=some_seed)
    stratified_data=pd.concat([stratified_data, data_temp])

For the output of the code above, I do achieve desired proportions based on column A in my very original data (not mock up). However, when I look at unique counts of B under original data, I find something like 22600 samples; while when I look at sampled one, I find something around 22450. So I am missing 150 unique values of B in the sample. Also, I do not need to have a multi strata of AxB. B is not necessarily stratified, just need to occur at least once in the sample.

I could not generate the exact same problem with the above data, and cannot share the actual, but I was wondering how I can achieve to make sure while having desired proportions of column A, I grab every single unique column B represented. It does not matter what strata of column A they fall under. Every unique value can occur under every stratified group, or only one of the groups. As long as it is captured in one group, it is enough for me.




Aucun commentaire:

Enregistrer un commentaire