mardi 9 août 2022

resampling dataframe/survey troubleshoot

I am having trouble resampling my dataframe. Need some help?

I have a household survey in country X. Country X is divided into 3000 counties of different population sizes. The % of sampled households varied by county size. Smaller counties were sampled at close to 100%. As the county grew in population the NUMBER of households increased but the PROPORTION dropped.

I want to readjust the proportion of sampled households to be the same across all counties (1.45%). I calculated how many households I need to include in my final dataframe.

household.prop <- data.frame(county.id=c(1001,1002,2001,2003),
                     total.households=c(201071,10007,12834,3465),
                     new.households=c(2916, 145, 186, 50))

county.id is the county id; total.households is the total number of households in the county; new.households is the number of households I want to randomly sample from that county.

My second dataframe contains the household id for each household in the county. Note how ids are repeated across counties. Below is an example of what my dataframe looks like (there would be 201071 rows for county 1001, 10007 for 1002, 12834 for 2001, and 3465 for 2002).

household.ids <- data.frame(county.id=c(1001,1001,1001,1002,1002,1002),
                     household.id=c(100001,100002,100003,100001,100002,100003))

What is an efficient way to randomly sample the specified number of households from each county? In other words, I need to extract 2916 household ids from county 1001, 145 from 1002, 186 from 2001, and 50 from 2002.

Ideally, I would like a vector with unique ids (resampled.ids) that I could use to filter my original dataset. As in:

total.data <- total.data %>% 
  filter(household.id %in% resampled.ids)

IMPORTANT: My dataset contains 10 million rows grouped in 3000 counties. The code needs to be efficient otherwise my PC will crash.

Thanks a bunch!




Aucun commentaire:

Enregistrer un commentaire