jeudi 25 janvier 2018

sampling data based on posterior joint-probabilities

I have a dataset and would like get a sample based on probabilities that I manually set.

Example: (id = user, score(sort by desc), b1-b6(dummy variable)), 1 represents users have this feature, 0 otherwise

id score b1 b2 b3 b4 b5 b6

1 0.99 1 0 0 0 1 0

2 0.98 1 0 0 0 0 0

3 0.97 1 1 1 0 1 1

4 0.96 0 1 0 0 0 0

A parameter set (p1,p2,p3,p4,p5,p6) is given that controls the proportion of users having this feature in columns (b1,b2,b3,b4,b5,b6) respectively

Let's see I set p1 = 0.1, p2 = 0.2, p3 = 0.9, p4 = 0.32, p5 = 0.2, p6 = 0.21 And it's expected to sample from the dataset whose distribution is approximately follow the p1-p6 values.

about 10% of users have 1 in b1, 20% users have 1 in b2 and so on)

Problem is the original dataset has its distributions across b1 to b6, and how to get a sample from it, which has the distributions that follows the p1-p6 values

Any thoughts would be appreciated

UPDATES It's to draw a sample from a large dataset (1k sample from 1000k) that follows the distributions (p1,p2 etc.),instead of simulating phony data

Approach 1: It may be solved by repeating random sample. and using the closest one(need resampling or iteration tricks).

Approach 2: using linear optimisation algorithm(may be complicated, as 2^6 possibilities, and needs to solve large equations)




Aucun commentaire:

Enregistrer un commentaire