mercredi 21 août 2019

C++ random non-repeated integers with weights

I want to efficiently generate a random sample of unique (non-repeated) integers in a (closed) range [0, rnd_max], with each number in the range being possible to choose, and each being associated with a sample weight (the more weight, the more likely it should be that the number is chosen, with probability exactly weight[i] / sum(weight[]) to be chosen next if it's not already taken in the sample).

I see C++ has std::discrete_distribution which can generate random weighted integers, but if I use it to generate random integers and discard repeated ones, when the sample to take is large relative to the length of the possible range, there will be a lot of failed samples which are already taken, resulting in a highly inefficient procedure. It's not clear to me if Floyd's algorithm has some extension to the case with sample weights (https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin) - I personally cannot think of one.

It's also possible to e.g. use std::discrete_distribution dropping the weight to zero, or performing a partial weighted shuffle like in this answer: C++. Weighted std::shuffle - but in that answer, std::discrete_distribution is re-generated at each iteration and thus the running time becomes quadratic (it needs to cycle through the weights that are passed to it every time).

In wondering what could be an efficient weighted random sample for unique integers in C++, that would work well for varying sample sizes (e.g. from 1% to 90% of sampled numbers in the available range).

#include <vector>
#include <random>
#include <algorithm>

int main()
{
    size_t rnd_max = 1e5;
    size_t ntake = 1e3;

    unsigned int seed = 12345;
    std::mt19937 rng(seed);
    std::gamma_distribution<double> rgamma(1.0, 1.0);
    std::vector<double> weights(rnd_max);
    for (double &w : weights) w = rgamma(rng);

    std::vector<int> chosen_sample(ntake);
    // sampler goes here...

    return 0;
}




Aucun commentaire:

Enregistrer un commentaire