I have a dataset where each id
has multiple samples and can be stratified into group
variable. I would like to do random sampling, stratified by group
, but not have the id
be repeated (i.e. each id
only appears once in the output).
I have tried to modify some existing solutions, however, all seem to sample the data and include multiple samples from a single id
across the groups:
- random sampling - matrix
- Stratified random sampling from data frame
- Stratified random sampling in R
- Stratified random sampling from data frame
I have tried the following, thinking replace = FALSE
may help to ensure that only 1 sample from each id
is used, but this still does not do what I want.
set.seed(1)
# Data
data <- data.frame(
id = c("A", "C", "B", "D", "E", "F", "A", "A", "B", "B", "B", "D", "D", "E", "E", "F"),
group = c("1", "1", "2", "2", "3", "3", "2", "1", "1", "2", "3", "2", "3", "2", "1", "3"),
length = c("54", "52", "43", "42", "60", "46", "59", "60", "51", "45", "47", "58", "48", "46", "56", "57"))
# Stratified random sampling by group
sample <- data %>%
distinct %>%
group_by(group) %>%
sample_n(2, replace = FALSE) %>%
left_join(data)
sample
output:
id group length
A 1 60
C 1 52
D 2 42
A 2 59
B 3 47
E 3 60
However, as seen above, the id
= A is repeated in group
1 and 2. The ideal output I would like should look something like this where each id
appears only once and samples are stratified by group
:
id group length
A 1 54
C 1 52
B 2 43
D 2 42
E 3 60
F 3 46
Is there a way to customise the existing solutions so that when sampling for each group
, if an id
has already been used for another group
, it will be excluded and not sampled for another group
? I know I can add %>% distinct(id)
to my code but I believe this would not be random anymore as distinct()
just picks up the first row for that id
. Thank you for any help!
Aucun commentaire:
Enregistrer un commentaire