mercredi 30 novembre 2016

Random samples but grouped by certain values in columns

I've looked everywhere but can't find someone doing this. I imagine there must be a way in R though.

I have a dataset of around 200k rows that looks like this:

Report ID | Month | Day | Year | Location ID | comments
1             4       1    2015       200          blah blah blah
2            11       3    2014       100          blah blah blah 
3             4       5    2015       203          blah blah blah
4             8      30    2012       204          blah blah blah
5            11       5    2013       204          blah blah blah
6            11       1    2015       100          blah blah blah  
7            11      10    2013       204          blah blah blah

I need to create a random sample of report IDs that has an even distribution of location IDs, year, and months. I know this wouldn't truly be a random sample, but location ID skews pretty heavily to some locations and some months have way more reports than others.

I have tried various sampling and sub setting techniques in R, but they all seem to want to sample the data set as a whole and I've been unable to locate a way where I can ask the sample to provide say 500 report ids for each location. Let alone be able to then say, within this 500, I want an even distribution of years and months. Any suggestions?




Aucun commentaire:

Enregistrer un commentaire