mercredi 2 janvier 2019

How to produce a budget-constrained weighted random sample, where items have varying probabilities and weights?

Suppose I want to select two records from a set of three, where the probabilities of the three are 0.1, 0.5, and 0.4, respectively. Per this SO answer, numpy.random.choice will work:

import pandas as pd
from numpy import random

df = pd.DataFrame({
    'id': [1, 2, 3],
    'prob': [0.1, 0.5, 0.4]
})

random.seed(0)
random.choice(df.id, p=df.prob, size=2, replace=False)
# array([2, 3])

Now suppose each item also has a weight, and rather than selecting two items, I want to select a maximum weight. So if these items have weight of 4, 5, and 6, and I have a budget of 10, I could select {1, 2}, {1, 3}, or {3}. The relative probabilities of each item being included would still be governed by the probabilities (though in practice I think an algorithm would return item 1 more often because its low weight can serve as a filler).

Is there a way to adapt random.choice for this, or another approach to yield this result?




Aucun commentaire:

Enregistrer un commentaire