My dataframe looks like this:
Identifier Strain Other columns, etc.
1 A
2 C
3 D
4 B
5 A
6 C
7 C
8 B
9 D
10 A
11 D
12 D
I want to choose n rows at random while maintaining diversity in the strain values. For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.
I've tried converting the Strain column into a numpy array and using the method random.choice but that didn't seem to run. I've also tried using .sample but it does not maximize strain diversity.
This is my latest attempt which outputs a sample of size 7 in order (identifiers 0-7) and the Strains are all the same.
randomsample = df[df.Strain == np.random.choice(df['Strain'].unique())].reset_index(drop=True)
Aucun commentaire:
Enregistrer un commentaire