Let's say I have N data points on m different machines (distributed) and N is in the order of millions and I want to get a K sample of the data point in a distributed fashion. And also I don't know how many data points I have in each machine. One way is to go over each machine and each data point, generate a random number r and if r <= K / 10, keep it as one of the samples otherwise go to the next data point. On expectation, I should have K / n samples from all points. However, I want exactly K /N points, how can I make sure I have all the data points K / N (and exactly K / N) with only one pass over the data?
Aucun commentaire:
Enregistrer un commentaire