random: Most efficient way to select samples from pandas dataframe

jeudi 8 juin 2023

Most efficient way to select samples from pandas dataframe

I have a simple dataframe shape: (10000,8). (you may take this for test)

I need to extract 100 sample without replacement in each itteration of loop (think 100K itterations)

The first approach that comes to mind is using DataFrame.sample() but after some test it seems the process time is high

import timeit

print(timeit.timeit('df_sample = df.sample(100, replace=False)', 
                    setup="import pandas; df = pandas.read_csv('Free_Test_Data_500KB_CSV-1.csv')",
                    number=100_000))


print(timeit.timeit('df_sample = df.iloc[random.sample(range(len(df)),100)]', 
                    setup="import pandas; import random; df = pandas.read_csv('Free_Test_Data_500KB_CSV-1.csv')",
                    number=100_000))

and this is the reult:

25.144644900006824
8.81066109999665

If I set "replace=True" the process time would be close to the second snippet, but this is not what I want. (I tested and think random.sample would return sampling without replacement)

Is there any better/more efficient approach for small number of samples?

P.S. I also noticed that if I increase the number of samples from 100 to 1000, the result would be totally different, but my usecase is for small number of samples

28.4684341000102
37.28109739998763

random

jeudi 8 juin 2023

Most efficient way to select samples from pandas dataframe

Aucun commentaire:

Enregistrer un commentaire