vendredi 29 avril 2016

Most efficient way to randomly sample directory in Python

I have a directory with millions of items in it on a fairly slow disk. I want to sample 100 of those items randomly, and I want to do it using a glob as well.

One way to do it is to get a glob of every file in the directory, then sample that:

files = sorted(glob.glob('*.xml'))
file_count = len(files)
random_files = random.sample(
    range(0, file_count),
    100
)

But this is really slow because I have to build up the big list of millions of files, which has to do a lot of disk crawling.

Is there a faster way to do this that doesn't hit the disk as much? It doesn't have to be a perfectly distributed sample or even do exactly 100 items, provided it's fast.

I'm thinking that:

  • Maybe we can use the inodes to be faster?
  • Maybe we can select items without knowing the entirety of what's on disk?
  • Maybe there's some shortcut that can make this faster.



Aucun commentaire:

Enregistrer un commentaire