jeudi 25 juillet 2019

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:

StoreID      WEEK      Units      Value          ProdID
2001         1         1          3.5            20702
2001         2         2          3              20705
2002         32        3          6              23568
2002         35        5          15             24025
2003         1         2          10             21253

I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.

I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?




Aucun commentaire:

Enregistrer un commentaire