dimanche 12 janvier 2020

random oversampling for multiple categories in python

I'm trying to do random sampling method on an unbalanced dataset to predict the appropriate 'category' for the given 'description'.

df_1['Category'].value_counts().loc[lambda x : x>1]

the categories are too many and uneven. I want to bring them all to an equal level so the machine learning model will not predict always let say 'iam~ki-000' as they are too many.

 iam~ki-000                378
 iam~ki-002                180
 iam~ki-049                 99
 iam~ki-050                 91
 iam~ki-057                 91
                          ... 
 iam~ki-077                  2 

So far I can come up with only one solution and that is very ineffective:(

That is to do an individual calculation to multiply each category to oversample the dataset. There are almost 90 categories in total. Can someone help me out to write a function that aggregates all categories evenly?

ki-057 = dataframe['Category'] == iam~ki-000
df_try = df[ki-057]
df = df.append([df_try]*4,ignore_index=True)



Aucun commentaire:

Enregistrer un commentaire