dimanche 4 février 2018

Extracting certain levels more than others

I'm trying to simulate the sampling of wildlife from a given site. I've made a species list that contains all species that can be found at that site and their associated rarity.

df <- data.frame(rarity = rep(c('common', 'uncommon', 'rare'), each = 2),
                 species = letters[1:6])
print(df)
    rarity species
1   common       a
2   common       b
3 uncommon       c
4 uncommon       d
5     rare       e
6     rare       f

I then create another data set based on the random sampling of rows from df.

df.sampled <- df[sample(1:nrow(df), 30, T),]

The trouble is that this isn't realistic; you're not going to encounter rare species as frequently as uncommon species as common species. For example, 6 out of 10 animals encountered should be common, 3 out of 10 animals should be uncommon, and 1 out of 10 animals shouldbe rare. Here, we're getting all three rarities at equal frequency:

df.matrix <- matrix(NA, ncol = 3, nrow = 1000)
for(i in 1:1000){
  df.sampled <- df[sample(1:6, 30, T),]
  df.matrix[i,] <- c(table(df.sampled$rarity))
}
apply(df.matrix, 2, mean)

Is there a way I can sample particular rows more often than others given their rarity? I have a feeling qnorm() should be used, but I could be wrong...




Aucun commentaire:

Enregistrer un commentaire