mardi 11 décembre 2018

How to randomly sample files from a filesystem in Python

Is there a performant way to sample files from a file system until you hit a target sample size in Python?

For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.

Currently, for small-ish flat directories of ~100k or so, I can do something like this:

import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
    print(direntry.path)

However, this doesn't scale up well. So, I thought maybe put the random check in the loop. This sort of works, but has the problem of if the number of files in the directory is close the sample_size, it may not pick up the full target sample_size and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.

import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
    if random.randint(0, 10) < 5:
        continue
    print(direntry.path)
    count += 1
    if count >= sample_size:
        print("reached sample_size")
        break

Any ideas on how to randomly sample a large selection of files from a large directory structure?




Aucun commentaire:

Enregistrer un commentaire