vendredi 19 mars 2021

How can I choose a random sample of size n from values from a single pandas dataframe column, with repeating values occurring a maximum of 2 times?

My dataframe looks like this:

Identifier       Strain     Other columns, etc.
1                  A
2                  C
3                  D
4                  B
5                  A
6                  C
7                  C
8                  B
9                  D
10                 A
11                 D
12                 D

I want to choose n rows at random while maintaining diversity in the strain values. For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.

I've tried converting the Strain column into a numpy array and using the method random.choice but that didn't seem to run. I've also tried using .sample but it does not maximize strain diversity.

This is my latest attempt which outputs a sample of size 7 in order (identifiers 0-7) and the Strains are all the same.

randomsample = df[df.Strain == np.random.choice(df['Strain'].unique())].reset_index(drop=True)



Aucun commentaire:

Enregistrer un commentaire