I'm trying to sample a very big data frame with many factor levels. From each level I would like to sample 1 row when assigning probabilities for different rows. I found solutions for this problem using several approaches("selecting n random rows across all levels of a factor within a dataframe")
The issue is speed, and I thought that using tapply will show better results. However I try to figure out how to incorporate probabilities in the script. Let's say: df <- data.frame(fac=c('A','A','B','B'),p=c(0.1,0.9,0.8,0.2))
I would like to sample df and get 1 line from each 'fac' level according to the probability in column 'p'. e.g must of the times I expect the second row of level 'A' and the first of level 'B' to be sampled. Using tapply the code for sampling per level is: df[tapply[(1:nrow(df),df$fac,sample,1),]
Where can I incorporate the argument 'prob' from function 'sample' to achieve my goal?
Any answer will be greatly appreciated including alternative method for the same task which will likely to work fast on a >1000000 rows data frame. Thanks!
Aucun commentaire:
Enregistrer un commentaire