I have a list of fake DNA sequences for 50 species. Each species has 10,000 DNA sequences. And each sequence is duplicated 10 times.
In total, my file has 5,000,000 DNA sequences.
Out of these 5 million sequences, I want to pick 500,000 sequences such that my new set of sequences is normally distributed.
My approach for now is: generate a normal distribution and randomly pick 500,000 elements from it. and then use those elements as indexes to select sequences. I am not confident this is a correct approach.
Is there a better approach to achieve this result ?
Aucun commentaire:
Enregistrer un commentaire