jeudi 6 juillet 2017

Existing Package or Strategy to Randomly Panel Data by Users in Large Dataframe

Summary:

How do I sample at the group level from a large panel dataset without loading to memory?

Note: I use python-3.x.

Detail:

I have a large dataframe/csv file (>20GB) and impractical to fit in memory. The structure of the data is a panel, which means it consists of groups of observations with the same id. For example, there are 20 million people with 100 observations each.

I want to sample at the user level, which means for any sample, it should include all the observations for each user.

Ideas:

  1. Make a hash of the id in a way that is agnostic to the format of the id (within reason). For example, the id could be the list of users by number or some alphanumeric sequence.

  2. I would then filter this hash with a filter function that accepts a % of them.

  3. Then run the filter across each observation, building a dataframe.

The problem is that I'm not confident about steps 1 or 2. I'm not sure how to create a hash that is random for all reasonable statistical purposes.

I'm sure this is a solved problem. Does anyone have any ideas?




Aucun commentaire:

Enregistrer un commentaire