samedi 18 janvier 2020

Stratified random sampling is missing one of the values to stratify on

When I run a basic to see the counts of my clusters as follows:

a.groupby('clusters').count() 

my results look like so:

clusters         a         b         c
0                10000     10000     10000
1                10000     10000     10000
2                20000     20000     20000

I then want to stratify sample say by these amounts to get a prorated amount of output columns and use the below code as so:

stratify = data.sample(n=10000, weights='clusters', random_state=0)

so that in this fake example my dataset should decrease by a factor of 4 and if I do the same groupby on the new dataframe I create based on the 1 line of above I should I get row 0 to be =2500, row 1 to be =2500 and row 2 to be = 5000, however, for some I have no clue what it can be reason what I get instead I get the correct output for rows 1 and 2 but row 0 just disappears:

stratify.groupby('clusters').count()

the output looks as follows

clusters         a         b         c
1                2500      2500      2500
2                5000      5000      5000

Why in the world did my 1st row disappear? There looks to be nothing special about it...




Aucun commentaire:

Enregistrer un commentaire