dimanche 14 juillet 2019

Random sampling from a data.table with different draws based on the categorical value in a column

I have a data table with different 20 sample IDs. Now I want to reduce the sample size randomly with a fixed distribution of IDs, meaning that I want to randomly draw lets say 7 values out of 'A' and 5 values out of 'B' so my data.table has 12 rows instead of 20 and than build the mean of a column I generated. Now I want to repeat that 100 times via bootstrapping and see if the means vary, so I want to do some statistics like sd, mean, etc. on it.

The background is I have a small set and a bigger sample set. I want to reduce the bigger sample set to evaluate the accurarcy of the smaller sample set. I am fairly new to R and appreciate any help. Thanks

data <- data.table(Sample = c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','B'),
                   weight=rnorm(1:22),
                   height=rnorm(1:22))

# I want to draw randomly 7 values out of A and 5 values out of B and than get the mean of this new df and do that whole step 100 times
#to again build the mean over all 100 replicates

set.seed(4561)

new_df <- data %>%
  group_by(Sample) %>% 
  nest() %>%            
  mutate(n = c(7,5)) %>% 
  mutate(samp = map2(data, n, sample_n)) %>% 
  select(Sample, samp) %>%
  unnest() %>%
  mutate(diff.height.weight = height-weight) %>%
  mutate(means = mean(diff.height.weight))%>%
  bootstraps(means, times=100)





Aucun commentaire:

Enregistrer un commentaire