samedi 21 décembre 2019

How to sample different number of rows from each group in DataFrame (Python)

I have a dataframe with a category column. Df has different number of rows for each category.

category number_of_rows
cat1     19189
cat2     13193
cat3     4500
cat4     1914
cat5     568
cat6     473
cat7     216
cat8     206
cat9     197
cat10    147
cat11    130
cat12    49
cat13    38
cat14    35
cat15    35
cat16    30
cat17    29
cat18    9
cat19    4
cat20    4
cat21    1
cat22    1
cat23    1

I want to select different number of rows from each category. (Instead of n fixed number of rows from each category)

Example input:
size_1 : {"cat1": 40, "cat2": 20, "cat3": 15, "cat4": 11, ...}
Example input: 
size_2 : {"cat1": 51, "cat2": 42, "cat3": 18, "cat4": 21, ...}

What I want to do is actually a stratified sampling with given number of instances corresponding to each category.

Also, it should be randomly selected. For example, I don't need the top 40 values for size_1.["cat1"], I need random 40 values.

Thanks for the help.




Aucun commentaire:

Enregistrer un commentaire