vendredi 26 février 2016

Random sample from a very long sequence, in python

I have a long python generator that I want to "thin out" by randomly selecting a subset of values. Unfortunately, random.sample() will not work with arbitrary sequences. Apparently, it needs something that supports the len() operation (and perhaps non-sequential access to the sequence, but that's not clear). And I don't want to build an enormous list just so I can thin it out.

As a matter of fact, it is possible to sample from a sequence uniformly without knowing its length-- there's a nice algorithm in Programming perl that does just that. But does anyone know of a standard python module that provides this functionality?

Demo of the problem (Python 3)

>>> import itertools, random
>>> random.sample(iter("abcd"), 2)
...
TypeError: Population must be a sequence or set.  For dicts, use list(d).

On Python 2, the error is more transparent:

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    random.sample(iter("abcd"), 2)
  File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/random.py", line 321, in sample
    n = len(population)
TypeError: object of type 'iterator' has no len()

If there's no alternative to random.sample(), I'd try my luck with wrapping the generator into an object that provides a __len__ method (I can find out the length in advance). So I'll accept an answer that shows how to do that cleanly.




Aucun commentaire:

Enregistrer un commentaire