vendredi 28 avril 2023

Learner weighted sampling overweighting learners only at the school level

I have a population level data set of 12m learners.

I am trying to sample at the school level as I do not have the compute for 12m learners.

I need to maintain whole schools in order to subsequently calculate school level stats, such as the fraction of female learners in the school.

If I simple random sample, small schools will be over represented at the learner level.

I want the average school size(learner_count), at the learner level, to be within sampling error of the population school size at the learner level. However, my code is returning data where the learner-weighted sampled school-level school size is the same as population learner-level school size but the sample learner-level school size is much much higher than than the population learner-level school size. I have tried creating school size weights (learners at school/all learners) as the weight in the sample function, but that returns the same result.

I am using the following code to sample:

#Create school level data
  unique_schools <- all_learners %>%
    filter(!is.na(school_id),
           !is.na(learner_count))%>% 
    group_by(school_id) %>% 
    summarise(school_id = first(school_id),
              learner_count = first(learner_count)) 
#%>% 
# mutate(learners_tot =  sum(learners_master, na.rm = T),
#     learner_weight = learner_count/learners_tot) %>% 
# select(school_id, learner_weight)

# Randomly select 1000 schools
  selected_schools <- unique_schools %>%
    sample_n(1000, weight = learner_count, replace = FALSE) #or weight = learner_weight

# Create the `sample` column and filter the dataset
  sample <- all_learners %>%
    mutate(sample = if_else(school_id %in% selected_schools$school_id, 1, 0)) %>%
    filter(sample == 1)



Aucun commentaire:

Enregistrer un commentaire