I have a directory with millions of items in it on a fairly slow disk. I want to sample 100 of those items randomly, and I want to do it using a glob
as well.
One way to do it is to get a glob of every file in the directory, then sample that:
files = sorted(glob.glob('*.xml'))
file_count = len(files)
random_files = random.sample(
range(0, file_count),
100
)
But this is really slow because I have to build up the big list of millions of files, which has to do a lot of disk crawling.
Is there a faster way to do this that doesn't hit the disk as much? It doesn't have to be a perfectly distributed sample or even do exactly 100 items, provided it's fast.
I'm thinking that:
- Maybe we can use the inodes to be faster?
- Maybe we can select items without knowing the entirety of what's on disk?
- Maybe there's some shortcut that can make this faster.
Aucun commentaire:
Enregistrer un commentaire