jeudi 14 juin 2018

python random sampling based on a distribution

Before going to the topic, let's first take a look on the python's default sampling method,

>>> import random
>>> c=[1,2,3,100,101,102,103,104,105,106,109,110,111,112,113,114]
>>> random.sample(c,1)
[103]
>>> random.sample(c,1)
[3]
>>> random.sample(c,1)
[3]
>>> random.sample(c,1)
[2]
>>> random.sample(c,1)
[3]
>>> random.sample(c,1)
[2]
>>> random.sample(c,1)
[106]
>>> random.sample(c,1)
[3]
>>> random.sample(c,1)
[105]
>>> random.sample(c,1)
[110]
>>> random.sample(c,1)
[103]
>>> random.sample(c,1)

From the source code we can easily see what it actually does (below is the major portion of the code from the link),

selected = set()
selected_add = selected.add
for i in xrange(k):
    j = _int(random() * n)
    while j in selected:
        j = _int(random() * n)
        selected_add(j)
        result[i] = population[j]

This sampling method has randomly chosen an index. In case of that, there is a chance that a very non-likely population member got selected. Say for example 1 in the above example.

But let's concentrate on a more realistic scenario. Let's assume you have 16 number which represents the frequency of some label from 0-15.

freq array = [1, 2, 3, 100, 100, 100, 102, 102, 102, 100, 99, 50, 20, 1, 2, 3]

index of each position represents the label type. Like from the above list we can say that the total number of population on label 0 is 1, the total number of population on label 3 is 100, the total number of population of label 2 is 3 etc.

now if you want to select 5 members from the population, can we generate a new list which tells that I should take X number of members from label Y,

A sample: (maybe not the answer)

new_array = [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

it means we should take 1 member from label 4-7.

So maybe the question is well ask in the following manner,

How to sample members from a population based on some Normal distribution. (For the time being, let's strict it to Normal Distribution)

I searched for functions in both python.random and np.random library but could not get anything useful. Your idea or suggestion is highly appreciated and if possible code also.




Aucun commentaire:

Enregistrer un commentaire