mardi 15 décembre 2020

Iterate through a subset of a Cartesian product where all elements are selected (near-)equally

I have a large set of samples which can be described by three parameters (let's call them a, b, and c), for instance in a tuple (a, b, c), which each can have a finite number of values. For example, there are 25 possible values for a (indexed 0..24), 20 possible values for b, and 3 possible values for c. Every combination of a, b, and c is represented in this dataset, so in this example, my dataset has 1500 samples (25 × 20 × 3).

I want to randomly select a subset of n samples from this dataset (without duplicates). However, this random sample must have the property that all possible values for a, b, and c are represented equally (or as close to equally as possible, if the number of selected samples is not divisible by the number of possible values for the parameters).

For example, if I select 100 samples, I want each value of a to be represented 4 times, each value of b to be represented 5 times, and each value of c to be represented 33 times (one value can be represented 34 times to satisfy the total number of samples selected, and it doesn't matter which value this is). I do not care about the exact combinations of (a, b, c), as long as the total number of occurrences of each parameter value is correct.

My current implementation is as follows:

import random

n_a = 25
n_b = 20
n_c = 3

n_desired = 100

# generate random ordering for selections
order_a = random.sample(range(n_a), k=n_a)
order_b = random.sample(range(n_b), k=n_b)
order_c = random.sample(range(n_c), k=n_c)

# select random samples
samples = []
for i in range(n_desired):
    idx_a = order_a[i % n_a]
    idx_b = order_b[i % n_b]
    idx_c = order_c[i % n_c]

    samples.append((idx_a, idx_b, idx_c))

(I know this code can be written a bit differently, e.g. combining all the operations on a, b, and c using list comprehensions or using itertools.cycle instead of i % n indexing, but I find this more readable, also because a, b, and c have meaningful—but irrelevant for this question—names in the original code.)

By generating random orderings of the possible values of a, b, and c and cycling through them, we ensure that the number of occurrences of the parameter values never differ by more than 1 (first, all parameter values are selected once, then twice, then thrice, etc.)

We can verify that this code achieves the desired result (equal representation (±1) of all possible parameter values):

from collections import Counter

count_a = Counter()
count_b = Counter()
count_c = Counter()

count_a.update(sample[0] for sample in samples)
count_b.update(sample[1] for sample in samples)
count_c.update(sample[2] for sample in samples)

print(f'a values are represented between {min(count_a.values())} and {max(count_a.values())} times')
print(f'b values are represented between {min(count_b.values())} and {max(count_b.values())} times')
print(f'c values are represented between {min(count_c.values())} and {max(count_c.values())} times')

This prints the following result:

a values are represented between 4 and 4 times
b values are represented between 5 and 5 times
c values are represented between 33 and 34 times

However, an issue with this implementation is that it only works if n_desired ≤ lcm(n_a, n_b, n_c), where lcm() is the least common multiple (the smallest positive integer divisible by n_a, n_b, and n_c). In our example, lcm(n_a, n_b, n_c) = lcm(25, 20, 3) = 300. If we run the above implementation with n_desired > 300, we will see that the selected samples repeat with a period of 300. This is undesired, as this ignores 80% of the original dataset and doesn't allow us to select more than 300 unique samples.

A simple solution would be to make sure that lcm(n_a, n_b, n_c) = n_a × n_b × n_c, which would be the case if all three were prime. However, I would like this algorithm to work for any values, partly because I cannot ensure that all values are prime (for example, in my application, n_a is always the result of an integer squared).

Simply generating a list using itertools.product(range(n_a), range(n_b), range(n_c)) provides me with all possible combinations, but these are ordered sequentially, and by shuffling this full list and selecting the first n_desired samples, we lose the property of equal representation of all possible parameter values.

This is the point where I got stuck, since I am not knowledgeable enough in combinatorics to solve this issue nor do I know which terms I need to search for to find a solution. How would I solve this problem?




Aucun commentaire:

Enregistrer un commentaire