I'm a newbie. Consider the following. For any LOC_ID that has more than 3 unique SUB_IDs, I take a random sample of 3 unique SUB_IDs to piece together a new dataframe with concatenate:
df_concat = pd.DataFrame(columns=['LOC_ID', 'SUB_ID'])
for loc in test_df['LOC_ID'].unique():
sub_ids = test_df[test_df['LOC_ID'] == loc]['SUB_ID'].unique()
if len(sub_ids) > 3:
np.random.seed(622)
sub_ids_max = np.random.choice(sub_ids, size=3, replace=False)
df_sample = test_df[test_df['SUB_ID'].isin(sub_ids_max)]
else:
df_sample = test_df[test_df['SUB_ID'].isin(sub_ids)]
df_concat = pd.concat([df_concat, df_sample], ignore_index=True)
The following is the real case where I also have a year component with 6 years. Each LOC_ID within locids_sub has more than 10 unique SUB_IDs in at least one year. Per LOC_ID, per year, I need to ensure that there are no more than 10 unique SUB_ID's:
max_subid = 10
years = df_bb['Year'].unique()
count_df = df_bb.groupby(['LOC_ID', 'Year']).nunique().reset_index()
locids_sub = count_df[count_df['SUB_ID'] > max_subid]['LOC_ID'].unique()
# Subset df_bb by locids_sub
df_bb_sub = df_bb[df_bb['LOC_ID'].isin(locids_sub)][['Year', 'LOC_ID', 'SUB_ID']]
df_concat = pd.DataFrame(columns=df_bb.columns) # Initializing df
for year in years:
for loc in locids_sub:
sub_ids = df_bb_sub[(df_bb_sub['Year'] == year) & (df_bb_sub['LOC_ID'] == loc)]['SUB_ID'].unique()
if len(sub_ids) > max_subid:
np.random.seed(year+int(loc[-2:]))
sub_ids_max = np.random.choice(sub_ids, size=max_subid, replace=False)
df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids_max)]
else:
df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids)]
df_concat = pd.concat([df_concat, df_sample], ignore_index=True)
With 6 years, 1460 LOC_IDs in locids_sub, and 1828201 rows in df_bb, this takes 30 minutes to run.
Please let me know how to make this more efficient.
Aucun commentaire:
Enregistrer un commentaire