lundi 1 mars 2021

Reshuffling data and running multiple regressions

I am trying to replicate several studies that use the same longitudinal dataset. Observations are by individual-year, and individuals are nested within countries. I'm trying to demonstrate that there is selection bias. Namely, the primary iv (age) is positively associated with binary y only because countries that have more of y are more likely to select older individuals. I can't use an instrument or multilevel model for theoretical reasons.

One way I am trying to show this is by re-sampling the entry age across individuals within the same country. So the distribution of age roughly stays the same in each country, but the value now differs across individuals within that country.

I already know how to do this once. But what I need help with is how to do this resampling 100's of times AND run a logistic regression each time. I want to keep track of the coefficients and p-values for each regression in a data frame. Or somehow "pool" all of the coefficients and p-values into one. The idea here is that I want to show that even if you randomly shuffle the entry age of individuals within countries, you will still get the same result - a positive and significant association.

data:

set.seed(123)
library(dplyr)
n <- 50
df <- data.frame(country = rep(c("A", "B"), each = n), individual = rep(c(1:50, 51:100), each = 1))
df <- df %>%
  mutate(age = ifelse(country == "A", round(rnorm(n, 60,5)), round(rnorm(n, 30,5))))

df <- df[rep(row.names(df), sample(1:10, 100, replace = T)), 1:ncol(df)]

df <- df %>%
  group_by(individual) %>%
  mutate(age = age + seq(1:length(individual)),
         y = ifelse(country == "A", rbinom(n, 1, prob = .75), rbinom(n, 1, prob = .25))) %>%
  ungroup()

mod1 <- glm.cluster(df, y ~ age, cluster = "individual", family = binomial(link = "logit"))
summary(mod1)

In the original df sometimes the individual-year observation is duplicated for dv = 1 (if individual did y more than once in a year), but I am not sure how to recreate that in reproducible example. Although, I don't think it really matters when trying to reproduce what I am trying to do.

I can do the above reshuffle method once:

seednum <- sample(1:1000,1)
set.seed(seednum)

df1 <- df %>% distinct(individual, .keep_all = T)

df1 <- df1 %>%
  group_by(country) %>%
  mutate(age_random = sample(age, length(country)))

df1 <- df1[,c("individual", "age_random")]
df <- plyr::join(df, df1, by = "individual", match = "first")

df <- df %>%
  group_by(individual) %>%
  mutate(age_random = age_random + seq(1:length(individual))) %>%
  ungroup()

mod1 <- glm.cluster(df, y ~ age_random, cluster = "individual", family = binomial(link = "logit"))
summary(mod1)



Aucun commentaire:

Enregistrer un commentaire