I want to select a random sample of rows from a large R data frame df
(around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df
looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID
: 1433 and WEEK
: 52. When I generate a random sample of rows from df
, I must have at least one row each for each StoreID
and each WEEK
value.
I used the function sample_frac
in dplyr
in various trials but that does not ensure that all distinct values of StoreID
and WEEK
are included at least once in the resulting sample. How can I achieve what I want?
Aucun commentaire:
Enregistrer un commentaire