samedi 3 octobre 2020

How to create random samples without replacement of a population in a fast way?

I have a problem where I need to create m samples of size n of a population without replacement. Additionally, this samples must retain the original order of the population vector. All of this being super fast.

Population = [50, 30, 12, 24, 420, 243, 173, 194, 123, 43, 21, 64, 34...]

300 samples of a combination of 3 
[[24, 21, 34], [50, 194, 21], [12, 173, 64], [30, 173, 194].... [12, 243, 34]]

This samples must be independent and in my case I need to preserve the order of the original population array. There's multiple posible answers but none of them are fast, they're all being bottlenecks on my code. I'm Using NumPy for the random number generation.

Some of the most promising ways are the followings:

  1. Use Numpy.random.choice which almost does the job but can only do so with replacement which produces samples with repeated numbers. This is very fast but then I would need to get rid off the bad samples in a fast way.
    gen = np.random.default_rng()
    def random_combination(population, sample, number = 3):
        with_replacement_samples = gen.choice(len(population), size=(sample, number))
        pairs = np.sort(with_replacement_samples)
        positions= positions[pairs]

        for i in positions:

            if i[0] == i[2] or i[0] == i[1] or i[1]== i[2]:
               continue #I would need to generate new sample each time ... #if is expensive


            yield I

  1. Another way is using this method I found in another answer but its very slow
def random_combination4(posiciones, sample, number = 3):
    pair = np.argpartition(gen.random((sample, len(posiciones))), number - 1, axis=-1)[:, :number]
    pair = np.sort(pair)
    for i in posiciones[pair]:
        yield I

  1. The last interesting way is using this guy method but this solutions was before nunmpy solved its performance issues with random numbers.
def random_combination(population, sample, number = 3, probabilities =  None):
    if probabilities is None:
        replicated_probabilities = np.tile( np.full(shape=num_elements,fill_value=1/num_elements), (num_samples, 1))
    else:
        replicated_probabilities = np.tile(probabilities, (num_samples, 1))
    # replicate probabilities as many times as `num_samples`
    # get random shifting numbers & scale them correctly
    random_shifts = gen.random(replicated_probabilities.shape)
    random_shifts /= random_shifts.sum(axis=1)[:, np.newaxis]
    # shift by numbers & find largest (by finding the smallest of the negative)
    shifted_probabilities = random_shifts - replicated_probabilities
    combinations = np.sort( np.argpartition(shifted_probabilities, sample_size, axis=1)[:, :sample_size])
    combinations = np.sort(combinations)
    for i in combinations:

        yield population[I] 


n. The last way would be to use a for but that's really expensive

def random_combination2(population, sample, number = 3):
    for i in range(sample):
        pair = np.sort( gen.choice(len(population), size = number, replace = False))
        yield population[pair[0]], population[pair[1]], population[pair[2]]



Aucun commentaire:

Enregistrer un commentaire