mercredi 2 mai 2018

Sampling design based on a set of rules from data.table or data.frame in R

Just edited to simplify the approach. Let's say we have a data.table with the following structure:

 library(data.table)
 library(foreach)
 set.seed(123)
 dt=foreach(i=1:length(letters),.combine='rbind')%do%{
   group=rep(letters[i],runif(1, 1, 10))
   data.table(group)
 }
 dt$y<-round(runif(nrow(dt), 10, 20), digits=0)
 dt$x<-round(runif(nrow(dt), 30, 40), digits=0)
 dt[,.N, by=group][order(N)]

I want to run a series of regressions within a 'foreach' loop on different samples that need to be constructed based on the following rules:

  1. Create an initial sample with 26 new observations, one randomly selected from each group. No replacement.
  2. The next samples should increasingly add a set of 26 observations as in (1) until it is possible to have at least one case from each group.
  3. When there are no cases left from a group(s) draw additional cases from other groups until 26 observations are completed. Groups with the least amount of observations should be priority.
  4. Regress y on x every time a new sample is drawn until we use the whole cases in dt.

At the end I'd expect to have a list ('models') with the results from the regression in each sample, ideally using 'foreach' and 'data.table' to potentially scale the approach up but willing to consider data.frame solutions and of course base R tricks.




Aucun commentaire:

Enregistrer un commentaire