I have a large df (>=100k rows and 40 columns) that I am looking repeatedly sample and groupby. The code below works, but I was wondering if there is a way to speed up the process by parallelising any part of the process. The df can live in shared memory, and nothing gets changed in the df, just need to return 1 or more aggregates for each column.
import pandas as pd
import numpy as np
from tqdm import tqdm
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
data['variant'] = np.repeat(['A', 'B'],50)
samples_list = []
for i in tqdm(range(0,1000)):
df = data.sample(
frac=1, # take the same number of samples as there are rows
replace=True, # allow the same row to be drawn multiple times
random_state=i # set state to be i for reproduceability
).groupby(['variant']).agg(
{
'A': 'count',
'B': [np.nanmean, np.sum, np.median, 'count'],
'C': [np.nanmean, np.sum],
'D': [np.sum]
}
)
df['experiment'] = i
samples_list.append( df )
# Convert to a df
samples = pd.concat(samples_list)
samples.head()
Aucun commentaire:
Enregistrer un commentaire