samedi 7 juillet 2018

Probability of generating a set of M elements from an array of size N

I'm trying to understand solution for the following task: Randomly generate a set of M elements from an array of size N. Each element must have equal probability of being chosen.

I found the following solution (I've already read this question, and this one, but I still have questions that are too long for comments):

int rand(int min, int max) { 
  return min + (int)(Math.random() * (max - min + 1));
}

int[] generateSet(int[] arr, int m, int n) {
    if (n + 1 == m) { //base case
        int[] set = new int[m];
        for (int k = 0; k < m; k++) {
            set[k] = arr[k];
        }
        return set;
    }

    int[] set = generateSet(arr, m, n - 1);
    int r = rand(0, n);
    if (r < m) set[r] = arr[n];
    return set;
}
// rand() function returns inclusive value 
// i.e. rand(0, 5) will return from 0 to 5

This code was found in book "Cracking the coding interview" (Section Hard, Task 3). Author explains it as follows:

Suppose we have an algorithm that can pull a random set of m elements from an array of size n - 1. How can we use this algorithm to pull a random set of m elements from an array of size n? We can first pull a random set of size m from the first n - 1 elements. Then, we just need to decide if array[n] should be inserted into our subset (which would require pulling out a random element from it). An easy way to do this is to pick a random number k from 0 through n. If k < m, then insert array[n] into subset[k]. This will both "fairly" (i.e., with proportional probability) insert array[n] into the subset and "fairly" remove a random element from the subset. This is even cleaner to write iteratively. In this approach, we initialize an array subset to be the first m elements in original. Then, we iterate through the array, starting at element m, inserting array[i] into the subset at (random) position k whenever k < m.

I fully understand the base case.It says that: if we have an array of size N and M == N, therefore, we should return first M elements from the array, because each of them will be selected with equal probability.

Than comes the hard part (recursive case) that I do not understand at all.

  1. Code generates set of size M from array of size N - 1
  2. Now code should decide add or not "new" element arr[N] to the set
  3. With the probability M / N code decides add or not "new" element. Random works as follow:

    1. Generates random number r between 0 and N
    2. If (r < m) it means that r was generated with M / N probability
    3. Also if (r < m) it means that with probability 1 / M we will change one of M elements in the set.

I have no idea where M / N probability comes from. I think code should be implemented as follow instead:

  1. We need to decide take Nth element or not. We have N elements so we should decide take or not with 1 / N probability i.e. r = random(0, n)
  2. If r == n then we should take this element
  3. Next with probability 1/M we should find what element in the set we will replace, so it will be set[random(0, m)] = arr[n]

Therefore, the whole recursive part will be:

int[] set = generateSet(arr, m, n - 1);
int r = rand(0, n);
if (r == n) {
  int indexToReplace = random(0, m);
  set[indexToReplace] = arr[n];     
}
return set;

Could you help me understand:

  1. Why we should decide with M / N probability replace element in the set or not?
  2. Is my code incorrect?



Aucun commentaire:

Enregistrer un commentaire