I have a simple dataframe shape: (10000,8). (you may take this for test)
I need to extract 100 sample without replacement in each itteration of loop (think 100K itterations)
The first approach that comes to mind is using DataFrame.sample() but after some test it seems the process time is high
import timeit
print(timeit.timeit('df_sample = df.sample(100, replace=False)',
setup="import pandas; df = pandas.read_csv('Free_Test_Data_500KB_CSV-1.csv')",
number=100_000))
print(timeit.timeit('df_sample = df.iloc[random.sample(range(len(df)),100)]',
setup="import pandas; import random; df = pandas.read_csv('Free_Test_Data_500KB_CSV-1.csv')",
number=100_000))
and this is the reult:
25.144644900006824
8.81066109999665
If I set "replace=True" the process time would be close to the second snippet, but this is not what I want. (I tested and think random.sample would return sampling without replacement)
Is there any better/more efficient approach for small number of samples?
P.S. I also noticed that if I increase the number of samples from 100 to 1000, the result would be totally different, but my usecase is for small number of samples
28.4684341000102
37.28109739998763
Aucun commentaire:
Enregistrer un commentaire