vendredi 27 avril 2018

constant memory reservoir sampling, O(k) possible?

I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.

The algorithm I am currently using is basically as follows:

for each element in input stream
    if random()<k/n
        decrement k
        output element
        if k = 0
            halt
        end if
    end if
    decrement n
end for

The function random() generates a number from [0..1) on a random distribution, and I trust its principle of operation is straightforward.

Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). It seems to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I'm wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). I would still like to keep the requirement of not requiring any additional memory, however.




Aucun commentaire:

Enregistrer un commentaire