I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.
The algorithm I am currently using is basically as follows:
for each element in input stream
if random()<k/n
decrement k
output element
if k = 0
halt
end if
end if
decrement n
end for
The function random() generates a number from [0..1) on a random distribution, and I trust its principle of operation is straightforward.
Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). It seems to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I'm wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). I would still like to keep the requirement of not requiring any additional memory, however.
Aucun commentaire:
Enregistrer un commentaire