mardi 19 juillet 2016

Return non-duplicate random values from a large set

I would like a function that will produce k pseudo-random values from a set of n integers, zero to n-1, without repeating any previous result. k is less than or equal to n. O(n) memory is unacceptable.

Normally if I wanted duplicate-free random values I'd shuffle an array, but that's O(n) memory. n is likely to be too large for that to work.

If the size of the set is a power of two, then I would probably devise a set of 1:1 hash-like operations which accept randomised keys. Each hash would guarantee that every input produces a unique output within the same range. Then I would simply count from zero to k and hash each value to produce the next result. This isn't ideal because it doesn't guarantee a fair shuffle, or that the first k values have a statistically fair chance of producing any possible value.

If the size isn't a power of two, then in theory I could do in another base whatever I did in base two for my previous approach; so I could break n down into its prime factors and treat each factor as a digit, then devise similar hash-like functions to those used above. This probably isn't easy -- particularly if n is prime -- because I can't necessarily apply the same careful hash design for every possible factor.

The best practical solution that I can come up with is to pick the next power of two greater or equal to n, apply hashes, and then drop any out-of-range values and repeat. Unlike truly random streams I know for certain that there are a finite number of unusable results so I know that I won't continue rejecting values forever.

What I don't like about that is the uncertainty about the distribution of results. Is there a better way to handle this situation?

One thing I'm thinking of is a hash-like operation where I contrive to have the first j results perfectly random through carefully designed mapping, and then any results between j and k would simply extrapolate on that pattern (albeit in a predictable way). The value j would be chosen to produce a tolerable memory footprint.




Aucun commentaire:

Enregistrer un commentaire