Summary:
How do I sample at the group level from a large panel dataset without loading to memory?
Note: I use python-3.x.
Detail:
I have a large dataframe/csv file (>20GB) and impractical to fit in memory. The structure of the data is a panel, which means it consists of groups of observations with the same id. For example, there are 20 million people with 100 observations each.
I want to sample at the user level, which means for any sample, it should include all the observations for each user.
Ideas:
-
Make a hash of the
idin a way that is agnostic to the format of theid(within reason). For example, the id could be the list of users by number or some alphanumeric sequence. -
I would then filter this hash with a
filterfunction that accepts a % of them. -
Then run the
filteracross each observation, building a dataframe.
The problem is that I'm not confident about steps 1 or 2. I'm not sure how to create a hash that is random for all reasonable statistical purposes.
I'm sure this is a solved problem. Does anyone have any ideas?
Aucun commentaire:
Enregistrer un commentaire