mardi 25 février 2020

Elements of Programming Interview: Sample Online Data Algorithm

I am going through a book called "Elements of Programming Interview" and have gotten stuck at the following problem:

Implement an algorithm that takes as input an array of distinct elements and a size, and returns a subset of the given size of the array elements. All subsets should be equally likely. Return the result in input array itself.

The solution they provide below is:

import random
def random_sampling(k, A):
    for i in range(k):
        # Generate a random index in [i, len(A) - 1].
        r = random.randint(i, len(A) - 1)
        A[i], A[r] = A[r], A[i]


A = [3, 7, 5, 11]
k = 3

print(random_sampling(k, A))

I so not understand what the authors are trying to do intuitively. Their explanation is below

Another approach is to enumerate all subsets of size k and then select one at random from these. Since there are (n to k) subsets of size k, the time and space complexity are huge. The key to efficiently building a random subset of size exactly k is to first build one of size k - 1 and then adding one more element, selected randomly from the rest. The problem is trivial when k = 1. We make one call to the random number generator, take the returned value mod n (call it r), and swap A[0] with A[r]. The entry A[0] now holds the result.

For k > 1, we begin by choosing one element at random as above and we now repeat the same process with n - 1 element sub-array A[1, n -1]. Eventually, the random subset occupies the slots A[0, k - 1] and the remaining elements are in the last n - k slots.

Intuitively, if all subsets of size k are equally likely, then the construction process ensures that the subset of size k + 1 are also equally likely. A formal proof for this uses mathematical induction - the induction hypothesis is that every permutation of every size k subset of A is equally likely to be in A[0, k -1].

As a concrete example, let the input be A = <3, 7, 5, 11> and the size be 3. In the first iteration, we use the random number generator to pick a random integer in the interval [0,3]. Let the returned random number be 2. We swap A[0] with A[2] - now the array is <5, 7, 3, 11>. Now we pick a random integer in the interval [1, 3]. Let the returned random number be 3. We swap A[1] with A[3] - now the resulting array is <5, 11, 3, 7>. Now we pick a random integer in the interval [2,3]. Let the returned random number be 2. When we swap A[2] with itself the resulting array is unchanged. The random subset consists of he first three entries, ie., {5, 11, 3}.

Sorry for the long text; my questions are this

  1. What is the key to efficiency they are referring to? Its not clicking in my head
  2. What did they mean by "eventually, the random subset occupies the slots A[0, k-1] and the remaining elements are in the last n - k slots"
  3. is there a clear reason why "every permutation of every size k subset of A is equally likely to be in A[0, k - 1]"?
  4. Can you explain the theory behind the algorithm in clearer terms?
  5. What is the return of the algorithm supposed to be?

thanks




Aucun commentaire:

Enregistrer un commentaire