mardi 30 avril 2019

Randomly sampling from large combination generator

At a high level, I'm trying to sample n_samples items across all combinations of n items from a list. At small values of n and relatively small list lengths (n <= 5, len(list) < 75) this is fine - I just use itertools to generate combinations, convert to a list, and randomly sample the correct number using random.sample.

However, my use case requires that I generate the combinations, randomly sample several thousand elements, and then remove one of the combinations from the list and start again with the smaller list.

This creates a problem at high values of n and the len(list) - with 120 list items and n = 5, this usecase means that I have to do list conversion many times and I thus become time constrained by generator --> list conversion for a generator with ~190 million items. This takes an extremely long time (in excess of 20 minutes for especially bad examples).

The usecase doesn't require statistically uniform samples or anything, and I am purely using sampling because with high n and long lists processing every possible combination is computationally impractical and fast processing is extremely important.

I switched to using the iterator.islice method to only take the first n_samples items from the generator and use those. That dramatically increases speed (the example which took 20 minutes now takes 34 seconds), but performance is taking a hit. I think this is due to how itertools generates combinations - as an example,

list(itertools.combinations(list(range(4)), 2))

produces this list: [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

so it seems that if I have a long enough list and a large enough n, sampling even 100,000+ items by just pulling them off the generator will result in 100,000+ items where the first element is the same which is not ideal. As I said, I don't need perfect random sampling, but I think my performance crash from using this method instead of randomly sampling across the whole list is due to this.

Basically, I need a good way to efficiently sample n_samples items (where n_samples is somewhere from 10k to 500k) from all possible combinations of length n (where n is typically in a range of around 2-8) from a list of length which can vary from ~20 to ~200.

Thanks so much for any advice or resources you can provide!




Aucun commentaire:

Enregistrer un commentaire