vendredi 3 décembre 2021

Parallel sampling and groupby in pandas

I have a large df (>=100k rows and 40 columns) that I am looking repeatedly sample and groupby. The code below works, but I was wondering if there is a way to speed up the process by parallelising any part of the process. The df can live in shared memory, and nothing gets changed in the df, just need to return 1 or more aggregates for each column.

import pandas as pd
import numpy as np
from tqdm import tqdm

data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

data['variant'] = np.repeat(['A', 'B'],50)

samples_list = []
for i in tqdm(range(0,1000)):
    df = data.sample(
            frac=1, # take the same number of samples as there are rows
            replace=True, # allow the same row to be drawn multiple times
            random_state=i # set state to be i for reproduceability
            ).groupby(['variant']).agg(
                {
                    'A': 'count',
                    'B': [np.nanmean, np.sum, np.median, 'count'],
                    'C': [np.nanmean, np.sum],
                    'D': [np.sum]
                }
                )
    df['experiment'] = i
    samples_list.append( df )

# Convert to a df
samples = pd.concat(samples_list)

samples.head()



Aucun commentaire:

Enregistrer un commentaire