lundi 6 septembre 2021

In R, sample [n] unique IDs for whom the value of [column] equals zero

Background

Here's d, an R dataframe:

d <- data.frame(ID = c("a","a","b","b","c","d","d"), 
                gender = c(0,0,0,0,0,1,1), 
                zip = c(48601,48601,NA,29910,54220,NA,44663),stringsAsFactors=FALSE)

It looks like this:

ID  gender    zip
 a       0  48601
 a       0  48601
 b       0     NA
 b       0  29910
 c       0  54220
 d       1     NA
 d       1  44663

The Problem

I'd like to sample conditionally from d, but I'm getting tripped up on the details.

Specifically, I'd like to sample ...

  • All the rows of a certain number (2, in this case) of unique d$ID ...
  • ... in rows for which d$gender is zero

Phrased differently, I'm saying to R: "sample 2 distinct IDs who have gender = 0".

What I want is a dataframe d2 that could look like this:

ID  gender    zip
 a       0  48601
 a       0  48601
 b       0     NA
 b       0  29910

Because it's sampling, of course, it could also look something like this:

ID   gender     zip
b         0      NA
b         0   29910
c         0   54220

The real dataset I'm working with has hundreds of thousands of unique ID; I want to sample from them (instead of just subsetting all of them) because it'll take too much memory to use them all in my analysis and, for statistical reasons, I don't need all those ID.

What I've tried

I've attempted things like this:

set.seed(123)
d2 <- sample(subset(unique(d$ID), d$gender==0), size = 2) %>% as.data.frame()

This runs, but the output is odd:

.
a
d

I've also seen several posts asking about conditional sampling (in fact I've made one myself before), but my parameters are slightly different and can't quite find what I need. I think I'm not too far from a solution, but it eludes me enough to ask for your help. Thanks.




Aucun commentaire:

Enregistrer un commentaire