samedi 21 avril 2018

How many random requests do I need to make to a set of records to get 80% of the records?

Suppose I have an array of 100_000 records ( this is Ruby code, but any language will do)

ary = ['apple','orange','dog','tomato', 12, 17,'cat','tiger' .... ]
results = []

I can only make random calls to the array ( I cannot traverse it in any way)

results << ary.sample 
# in ruby this will pull a random record from the array, and 
# push into results array

How many random calls like that, do I need to make, to get least 80% of records from ary. Or expressed another way - what should be the size of results so that results.uniq will contain around 80_000 records from ary.

From my rusty memory of Stats class in college, I think it's needs to be 2*result set size = or around 160_000 requests ( assuming random function is random, and there is no some other underling issue) . My testing seems to confirm this.

ary = [*1..100_000];
result = [];  
160_000.times{result << ary.sample}; 
result.uniq.size # ~ 80k

This is stats, so we are talking about probabilities, not guaranteed results. I just need a reasonable guess.

So the question really, what's the formula to confirm this?




Aucun commentaire:

Enregistrer un commentaire