mardi 23 octobre 2018

Sample from specific rows in a dataframe column [duplicate]

This question already has an answer here:

The data: Using R studio, I have created a dataframe of Cluster data consisting of two columns: 1) Sequence Numbers and 2) Cluster they belong to.

Image reference: https://i.stack.imgur.com/3tXTt.png. Apologies for not being able to post the source code as it's part of a much larger ongoing project and can't be isolated.

The dataframe is 195 entries long. Column 1 is sequential from 1-195 while Column 2 consists of 10 cluster numbers that are repeated, according to which sequences belong to it. So for instance in the 20-row-excerpt of the dataframe I've printed out below you can see sequences 2-12 all belong to cluster 5.

 Seq Cluster
    1 10
    2 5
    3 5
    4 5
    5 5
    6 5
    7 5
    8 5
    9 5
    10 5
    11 5
    12 5
    13 4
    14 4
    15 3
    16 4
    17 4
    18 4
    19 2
    20 8

My aim: I would like to randomly sample one sequence from each of the 10 clusters and subset it into a new database.

So for instance: one random sampled sequence from sequences 2-12

However I am unsure how randomly sample only between each cluster separately.

By running:

nrow(unique(dfCluster))

I can receive an output of each cluster and one non-redundant sequence that belongs to it, but that's not exactly random it's just the first corresponding value per cluster group.

Author note: Please let me know if I can further clarify any of these steps, and apologies for it being rather long-winded




Aucun commentaire:

Enregistrer un commentaire