random: random sampling of many groups within dataframe IF group size is greater than a value

lundi 28 mars 2022

random sampling of many groups within dataframe IF group size is greater than a value

I'm a newbie. Consider the following. For any LOC_ID that has more than 3 unique SUB_IDs, I take a random sample of 3 unique SUB_IDs to piece together a new dataframe with concatenate:

dataframe_in

df_concat = pd.DataFrame(columns=['LOC_ID', 'SUB_ID'])

for loc in test_df['LOC_ID'].unique():
    sub_ids = test_df[test_df['LOC_ID'] == loc]['SUB_ID'].unique()
    if len(sub_ids) > 3:
        np.random.seed(622)
        sub_ids_max = np.random.choice(sub_ids, size=3, replace=False)
        df_sample = test_df[test_df['SUB_ID'].isin(sub_ids_max)]
    else:
        df_sample = test_df[test_df['SUB_ID'].isin(sub_ids)]
    df_concat = pd.concat([df_concat, df_sample], ignore_index=True)

dataframe_out

The following is the real case where I also have a year component with 6 years. Each LOC_ID within locids_sub has more than 10 unique SUB_IDs in at least one year. Per LOC_ID, per year, I need to ensure that there are no more than 10 unique SUB_ID's:

max_subid = 10
years = df_bb['Year'].unique()

count_df = df_bb.groupby(['LOC_ID', 'Year']).nunique().reset_index()
locids_sub = count_df[count_df['SUB_ID'] > max_subid]['LOC_ID'].unique()

# Subset df_bb by locids_sub
df_bb_sub = df_bb[df_bb['LOC_ID'].isin(locids_sub)][['Year', 'LOC_ID', 'SUB_ID']]

df_concat = pd.DataFrame(columns=df_bb.columns)  # Initializing df

for year in years:
    for loc in locids_sub:
        sub_ids = df_bb_sub[(df_bb_sub['Year'] == year) & (df_bb_sub['LOC_ID'] == loc)]['SUB_ID'].unique()
        if len(sub_ids) > max_subid:
            np.random.seed(year+int(loc[-2:]))
            sub_ids_max = np.random.choice(sub_ids, size=max_subid, replace=False)
            df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids_max)]
        else:
            df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids)]
        df_concat = pd.concat([df_concat, df_sample], ignore_index=True)

With 6 years, 1460 LOC_IDs in locids_sub, and 1828201 rows in df_bb, this takes 30 minutes to run.
Please let me know how to make this more efficient.

random

lundi 28 mars 2022

random sampling of many groups within dataframe IF group size is greater than a value

Aucun commentaire:

Enregistrer un commentaire