random: How to randomly sample dataframe (sample_n) and calculate summary statistics after using group

mardi 23 juillet 2019

How to randomly sample dataframe (sample_n) and calculate summary statistics after using group_by, and iterate 999 times?

I want to resample my dataframe (test_df) and calculate summary statistics (mean and standard deviation) of a numeric response variable (sp_rich), after grouping data based on two categorical factors (plant_sp = plant species, and site). I would then like this process to be iterated, say 999 times. Additionally, I would like to resample the data frame using multiple sample sizes, and calculate the above statistics and perform the iteration.

Ultimately, I would really like this to be in a dplyr/tidy framework, as I am more familiar with this style, but am open to base R/other options.

So here is an example data frame:

test_df <- structure(list(plant_sp = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
                                       "plant_1", "plant_1", "plant_1", "plant_1", "plant_1", 
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2",
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2"), 
                          site = c("a", "a", "a", "a", "a",  
                                   "b", "b", "b", "b", "b",  
                                   "a", "a", "a", "a", "a",
                                   "b", "b", "b", "b", "b"),
                          sp_rich = c(5, 3, 5, 3, 5, 
                                      7, 8, 8, 8, 10,
                                      1, 4, 5, 6, 3, 
                                      7, 3, 12, 12,11)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "site", "sp_rich"))

# I can calculate the summary statistics for one iteration,   
and for one sample size at a time:

mean_calc <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 3)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = n())
mean_calc

# I can also manually perform the calculations manually for   
each sample size, and put the data together (hack):

# Do this manually for two different samples sizes
mean_calc_3 <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 3)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = 3)
mean_calc_3

mean_calc_4 <- test_df %>%
  group_by(plant_sp, site) %>%
  do(sample_n(., 4)) %>%
  summarise(mean = mean(sp_rich),
            sd = sd((sp_rich))) %>%
  mutate(sample_size = 4)
mean_calc_4

mean_calc <- bind_rows(mean_calc_3, mean_calc_4) 
(mean_calc <- mean_calc %>%
    group_by(plant_sp, site, sample_size) %>%
    arrange(sample_size, plant_sp, site))

I would really like to automate performing these calculate across multiple sample sizes (e.g. n = 3, n = 4, in this example, the proper data would have ~ 5-10 different sizes classes), and then iterate this entire process 999 times.

The structure of the mean_calc df is ultimately the output that I am looking for, just instead of calculating the mean and sd once, the summary statistics are calculated 999 times and averaged.

random

mardi 23 juillet 2019

How to randomly sample dataframe (sample_n) and calculate summary statistics after using group_by, and iterate 999 times?

Aucun commentaire:

Enregistrer un commentaire