lundi 19 juillet 2021

Stratified random sampling with no repeated IDs

I have a dataset where each id has multiple samples and can be stratified into group variable. I would like to do random sampling, stratified by group, but not have the id be repeated (i.e. each id only appears once in the output).

I have tried to modify some existing solutions, however, all seem to sample the data and include multiple samples from a single id across the groups:

I have tried the following, thinking replace = FALSE may help to ensure that only 1 sample from each id is used, but this still does not do what I want.

set.seed(1)
# Data 
data <- data.frame(
  id = c("A", "C", "B", "D", "E", "F", "A", "A", "B", "B", "B", "D", "D", "E", "E", "F"),
  group = c("1", "1", "2", "2", "3", "3", "2", "1", "1", "2", "3", "2", "3", "2", "1", "3"),
  length = c("54", "52", "43", "42", "60", "46", "59", "60", "51", "45", "47", "58", "48", "46", "56", "57"))

# Stratified random sampling by group 
sample <- data %>%
  distinct %>%
  group_by(group) %>%
  sample_n(2, replace = FALSE) %>%
  left_join(data)

sample output:

id group length
A   1   60      
C   1   52      
D   2   42      
A   2   59      
B   3   47      
E   3   60      

However, as seen above, the id= A is repeated in group 1 and 2. The ideal output I would like should look something like this where each id appears only once and samples are stratified by group:

id group length
A   1   54      
C   1   52      
B   2   43      
D   2   42      
E   3   60      
F   3   46

Is there a way to customise the existing solutions so that when sampling for each group, if an id has already been used for another group, it will be excluded and not sampled for another group? I know I can add %>% distinct(id) to my code but I believe this would not be random anymore as distinct() just picks up the first row for that id. Thank you for any help!




Aucun commentaire:

Enregistrer un commentaire