mardi 23 juillet 2019

How to randomly sample dataframe (sample_n) and calculate summary statistics after using group_by, and iterate 999 times?

I want to resample my dataframe (test_df) and calculate summary statistics (mean and standard deviation) of a numeric response variable (sp_rich), after grouping data based on two categorical factors (plant_sp = plant species, and site). I would then like this process to be iterated, say 999 times. Additionally, I would like to resample the data frame using multiple sample sizes, and calculate the above statistics and perform the iteration.

Ultimately, I would really like this to be in a dplyr/tidy framework, as I am more familiar with this style, but am open to base R/other options.

So here is an example data frame:

test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
                                       "plant_1", "plant_1", "plant_1", "plant_1", "plant_1", 
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2",
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2"), 
                          site = c("a", "a", "a", "a", "a",  
                                   "b", "b", "b", "b", "b",  
                                   "a", "a", "a", "a", "a",
                                   "b", "b", "b", "b", "b"),
                          sp_rich = c(5, 3, 5, 3, 5, 
                                      7, 8, 8, 8, 10,
                                      1, 4, 5, 6, 3, 
                                      7, 3, 12, 12,11)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "site", "sp_rich"))

# I can calculate the summary statistics for one iteration,   
and for one sample size at a time:

mean_calc <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 3)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = n())
mean_calc

# I can also manually perform the calculations manually for   
each sample size, and put the data together (hack):

# Do this manually for two different samples sizes
mean_calc_3 <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 3)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = 3)
mean_calc_3

mean_calc_4 <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 4)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = 4)
mean_calc_4

mean_calc <- bind_rows(mean_calc_3, mean_calc_4) 
(mean_calc <- mean_calc %>%
    group_by(plant_sp, site, sample_size) %>%
    arrange(sample_size, plant_sp, site))

I would really like to automate performing these calculate across multiple sample sizes (e.g. n = 3, n = 4, in this example, the proper data would have ~ 5-10 different sizes classes), and then iterate this entire process 999 times.

The structure of the mean_calc df is ultimately the output that I am looking for, just instead of calculating the mean and sd once, the summary statistics are calculated 999 times and averaged.




Aucun commentaire:

Enregistrer un commentaire