random: Pandas representative sampling across multiple columns

lundi 23 novembre 2020

Pandas representative sampling across multiple columns

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representative of the population as a whole across all characteristics.

Suppose I have a dataframe which represents a workforce of 650 people as follows:

import pandas as pd
import numpy as np
c = np.random.choice

colours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']

df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),
              'favourite_colour':c(colours, 650),
              'favourite_knight':c(knights, 650),
              'favourite_quality':c(qualities, 650)})

I can get a sample of the above that reflects the distribution of a single column as follows:

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)

# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])

# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

This will return a sample dataframe (df_sample) such that:

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

My question is this: How can I generate a sample dataframe (df_sample) such that the above i.e.:

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

is true for all columns (except 'name_id') instead of just one of them? population of 650 with a sample size of 140 is approximately the sizes I'm working with so performance isn't too much of an issue. I'll happily accept solutions that take a couple of minutes to run as this will still be considerably faster than producing the above sample manually. Thank you for any help.

random

lundi 23 novembre 2020

Pandas representative sampling across multiple columns

Aucun commentaire:

Enregistrer un commentaire