I am trying to make a mock employee data set to practice some analyses. I already have a mock data set that has fake employee names, work ids, gender, and ethnicity. I also want to add other variables, such as supervisor status and pay grade. However, male employees are more likely to be supervisors, for instance, than female employees in the actual dataset, so rather than telling R to make 30% of cases supervisor and 70% non-supervisors, I want R to make 20% of female cases and 30% of male cases supervisors.
I've tried using case_when() or group_by() along with the sample() function, but I can't get it to work.
An ideal solution would be able to be scaled further than just dichotomous variables because pay grade and ethnicity have 5 levels. In addition, if I could scale the solution to account for multiple variables (say, gender and ethnicity), that would be the best.
Here's some fake data with 5 male and 5 female cases. For this case, let's say I want 40% of male cases supervisors (2/5) and only 20% of female cases supervisors (1/5).
library(tidyverse)
test <- tibble(emp_num = 1:10,
ethnicity = c("White", "White", "Hispanic", "Black", "Asian", "White", "White", "Hispanic", "Black", "Asian"),
gender = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"))
Here is how the answer should look with the correct proportions (of course, which employee number is supervisor doesn't matter for this case, just as long as the different proportions by male and female emerge).
sample_answer <- tibble(emp_num = 1:10,
ethnicity = c("White", "White", "Hispanic", "Black", "Asian", "White", "White", "Hispanic", "Black", "Asian"),
gender = c("Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female"),
sup_status = c("Supervisor", "Supervisor", "Supervisor", "Non-Super", "Non-Super", "Non-Super", "Non-Super", "Non-Super", "Non-Super", "Non-Super"))
Aucun commentaire:
Enregistrer un commentaire