lundi 28 mars 2022

random sampling of many groups within dataframe IF group size is greater than a value

I'm a newbie. Consider the following. For any LOC_ID that has more than 3 unique SUB_IDs, I take a random sample of 3 unique SUB_IDs to piece together a new dataframe with concatenate:

dataframe_in

df_concat = pd.DataFrame(columns=['LOC_ID', 'SUB_ID'])

for loc in test_df['LOC_ID'].unique():
    sub_ids = test_df[test_df['LOC_ID'] == loc]['SUB_ID'].unique()
    if len(sub_ids) > 3:
        np.random.seed(622)
        sub_ids_max = np.random.choice(sub_ids, size=3, replace=False)
        df_sample = test_df[test_df['SUB_ID'].isin(sub_ids_max)]
    else:
        df_sample = test_df[test_df['SUB_ID'].isin(sub_ids)]
    df_concat = pd.concat([df_concat, df_sample], ignore_index=True)

dataframe_out

The following is the real case where I also have a year component with 6 years. Each LOC_ID within locids_sub has more than 10 unique SUB_IDs in at least one year. Per LOC_ID, per year, I need to ensure that there are no more than 10 unique SUB_ID's:

max_subid = 10
years = df_bb['Year'].unique()

count_df = df_bb.groupby(['LOC_ID', 'Year']).nunique().reset_index()
locids_sub = count_df[count_df['SUB_ID'] > max_subid]['LOC_ID'].unique()

# Subset df_bb by locids_sub
df_bb_sub = df_bb[df_bb['LOC_ID'].isin(locids_sub)][['Year', 'LOC_ID', 'SUB_ID']]

df_concat = pd.DataFrame(columns=df_bb.columns)  # Initializing df

for year in years:
    for loc in locids_sub:
        sub_ids = df_bb_sub[(df_bb_sub['Year'] == year) & (df_bb_sub['LOC_ID'] == loc)]['SUB_ID'].unique()
        if len(sub_ids) > max_subid:
            np.random.seed(year+int(loc[-2:]))
            sub_ids_max = np.random.choice(sub_ids, size=max_subid, replace=False)
            df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids_max)]
        else:
            df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids)]
        df_concat = pd.concat([df_concat, df_sample], ignore_index=True)

With 6 years, 1460 LOC_IDs in locids_sub, and 1828201 rows in df_bb, this takes 30 minutes to run.
Please let me know how to make this more efficient.




Aucun commentaire:

Enregistrer un commentaire