I have a population level data set of 12m learners.
I am trying to sample at the school level as I do not have the compute for 12m learners.
I need to maintain whole schools in order to subsequently calculate school level stats, such as the fraction of female learners in the school.
If I simple random sample, small schools will be over represented at the learner level.
I want the average school size(learner_count
), at the learner level, to be within sampling error of the population school size at the learner level. However, my code is returning data where the learner-weighted sampled school-level school size is the same as population learner-level school size but the sample learner-level school size is much much higher than than the population learner-level school size. I have tried creating school size weights (learners at school/all learners) as the weight in the sample function, but that returns the same result.
I am using the following code to sample:
#Create school level data
unique_schools <- all_learners %>%
filter(!is.na(school_id),
!is.na(learner_count))%>%
group_by(school_id) %>%
summarise(school_id = first(school_id),
learner_count = first(learner_count))
#%>%
# mutate(learners_tot = sum(learners_master, na.rm = T),
# learner_weight = learner_count/learners_tot) %>%
# select(school_id, learner_weight)
# Randomly select 1000 schools
selected_schools <- unique_schools %>%
sample_n(1000, weight = learner_count, replace = FALSE) #or weight = learner_weight
# Create the `sample` column and filter the dataset
sample <- all_learners %>%
mutate(sample = if_else(school_id %in% selected_schools$school_id, 1, 0)) %>%
filter(sample == 1)
Aucun commentaire:
Enregistrer un commentaire