jeudi 1 mars 2018

Stratified Random Sampling with multiple conditions in R

I have a dataset of 10000 survey responses. I need to create a stratified random sample of ~2000 observations such that

  1. There are is equal representation from each of the 4 Campaign groups (i.e ~500 each)
  2. The Age groups are divided in the following ratio: A1: 0.25 A2: 0.25 A3: 0.25 A4: 0.25
  3. Hispanic_origin should be divided in the ratio: Hispanic: 0.40 Non-Hispanic: 0.6
  4. Race category matching as closely as possible to the original datset

Here is my original dataset of 10,000 samples

ndf = 10000
original <- data.frame(ID=sample(ndf), 
Name=sample(ndf),Campaign=sample(x = c("D2D", "F2F", "TM", "WW"), 
 size = ndf, prob=c(0.15,0.05,0.4,0.4), replace=TRUE),Age=sample(x = 
 c("A1","A2","A3","A4"),size = ndf,prob = c(0.2,0.3,0.2,0.3) 
 ,replace=TRUE),Hispanic_origin = sample(x = c("Hispanic","Non-
 Hispanic"),size =ndf, prob = c(0.3, 0.7) ,replace=TRUE), Race = 
 sample(x =c("American Indian or Alaska Native", "Asian","Black or 
 African American","Native Hawaiian or Pacific Islander", 
 "White","Two or more races"),size = ndf, prob = ,replace=TRUE))

I have tried using the "survey" package in R to make a stratified sample where the characteristics of the sample data is similar to the original dataset. But in my case I am trying to create a sample that is different from the original dataset, satisfying required specifics in distribution.

I've also referred to questions previously asked:

  1. Random Sample with multiple probabilities in R
  2. generate random integers between two values with a given probability using R

I tried this:

library(survey)
rpss <- function(stratum, n) {
  props <- table(stratum)/length(stratum)
  nstrat <- as.vector(round(n*props))
  nstrat[nstrat==0] <- 1
  names(nstrat) <- names(props)
  stratsample(stratum, nstrat)
}

# take a random proportional stratified sample of size 2000
selrows <- rpss(stratum=interaction(original$Campaign, original$Age,original$Hispanic_origin,original$Race, drop=TRUE), n=2000)
final_sample <- original[selrows, ]

I need help is changing this function such that it takes the new probabilities for each group while creating a new random sample of ~2000 observations.




Aucun commentaire:

Enregistrer un commentaire