This question already has an answer here:
The data: Using R studio, I have created a dataframe of Cluster data consisting of two columns: 1) Sequence Numbers and 2) Cluster they belong to.
Image reference: https://i.stack.imgur.com/3tXTt.png. Apologies for not being able to post the source code as it's part of a much larger ongoing project and can't be isolated.
The dataframe is 195 entries long. Column 1 is sequential from 1-195 while Column 2 consists of 10 cluster numbers that are repeated, according to which sequences belong to it. So for instance in the 20-row-excerpt of the dataframe I've printed out below you can see sequences 2-12 all belong to cluster 5.
Seq Cluster
1 10
2 5
3 5
4 5
5 5
6 5
7 5
8 5
9 5
10 5
11 5
12 5
13 4
14 4
15 3
16 4
17 4
18 4
19 2
20 8
My aim: I would like to randomly sample one sequence from each of the 10 clusters and subset it into a new database.
So for instance: one random sampled sequence from sequences 2-12
However I am unsure how randomly sample only between each cluster separately.
By running:
nrow(unique(dfCluster))
I can receive an output of each cluster and one non-redundant sequence that belongs to it, but that's not exactly random it's just the first corresponding value per cluster group.
Author note: Please let me know if I can further clarify any of these steps, and apologies for it being rather long-winded
Aucun commentaire:
Enregistrer un commentaire