mercredi 26 février 2020

Elements of Programming Interview: Sample Online Data Algorithm

I am going through a book called "Elements of Programming Interview" and have gotten stuck at the following problem:

Implement an algorithm that takes as input an array of distinct elements and a size, and returns a subset of the given size of the array elements. All subsets should be equally likely. Return the result in input array itself.

The solution they provide below is:

import random
def random_sampling(k, A):
    for i in range(k):
        # Generate a random index in [i, len(A) - 1].
        r = random.randint(i, len(A) - 1)
        A[i], A[r] = A[r], A[i]


A = [3, 7, 5, 11]
k = 3

print(random_sampling(k, A))

I so not understand what the authors are trying to do intuitively. Their explanation is below

Another approach is to enumerate all subsets of size k and then select one at random from these. Since there are (n to k) subsets of size k, the time and space complexity are huge. The key to efficiently building a random subset of size exactly k is to first build one of size k - 1 and then adding one more element, selected randomly from the rest. The problem is trivial when k = 1. We make one call to the random number generator, take the returned value mod n (call it r), and swap A[0] with A[r]. The entry A[0] now holds the result.

For k > 1, we begin by choosing one element at random as above and we now repeat the same process with n - 1 element sub-array A[1, n -1]. Eventually, the random subset occupies the slots A[0, k - 1] and the remaining elements are in the last n - k slots.

Intuitively, if all subsets of size k are equally likely, then the construction process ensures that the subset of size k + 1 are also equally likely. A formal proof for this uses mathematical induction - the induction hypothesis is that every permutation of every size k subset of A is equally likely to be in A[0, k -1].

As a concrete example, let the input be A = <3, 7, 5, 11> and the size be 3. In the first iteration, we use the random number generator to pick a random integer in the interval [0,3]. Let the returned random number be 2. We swap A[0] with A[2] - now the array is <5, 7, 3, 11>. Now we pick a random integer in the interval [1, 3]. Let the returned random number be 3. We swap A[1] with A[3] - now the resulting array is <5, 11, 3, 7>. Now we pick a random integer in the interval [2,3]. Let the returned random number be 2. When we swap A[2] with itself the resulting array is unchanged. The random subset consists of he first three entries, ie., {5, 11, 3}.

Can someone translate the authors notes to something an amateur can understand?




Aucun commentaire:

Enregistrer un commentaire