random: Stratified random sampling is missing one of the values to stratify on

samedi 18 janvier 2020

Stratified random sampling is missing one of the values to stratify on

When I run a basic to see the counts of my clusters as follows:

a.groupby('clusters').count()

my results look like so:

clusters         a         b         c
0                10000     10000     10000
1                10000     10000     10000
2                20000     20000     20000

I then want to stratify sample say by these amounts to get a prorated amount of output columns and use the below code as so:

stratify = data.sample(n=10000, weights='clusters', random_state=0)

so that in this fake example my dataset should decrease by a factor of 4 and if I do the same groupby on the new dataframe I create based on the 1 line of above I should I get row 0 to be =2500, row 1 to be =2500 and row 2 to be = 5000, however, for some I have no clue what it can be reason what I get instead I get the correct output for rows 1 and 2 but row 0 just disappears:

stratify.groupby('clusters').count()

the output looks as follows

clusters         a         b         c
1                2500      2500      2500
2                5000      5000      5000

Why in the world did my 1st row disappear? There looks to be nothing special about it...

random

samedi 18 janvier 2020

Stratified random sampling is missing one of the values to stratify on

Aucun commentaire:

Enregistrer un commentaire