I have a dataset and would like get a sample based on probabilities that I manually set.
Example: (id = user, score(sort by desc), b1-b6(dummy variable)), 1 represents users have this feature, 0 otherwise
id score b1 b2 b3 b4 b5 b6
1 0.99 1 0 0 0 1 0
2 0.98 1 0 0 0 0 0
3 0.97 1 1 1 0 1 1
4 0.96 0 1 0 0 0 0
A parameter set (p1,p2,p3,p4,p5,p6) is given that controls the proportion of users having this feature in columns (b1,b2,b3,b4,b5,b6) respectively
Let's see I set p1 = 0.1, p2 = 0.2, p3 = 0.9, p4 = 0.32, p5 = 0.2, p6 = 0.21 And it's expected to sample from the dataset whose distribution is approximately follow the p1-p6 values.
about 10% of users have 1 in b1, 20% users have 1 in b2 and so on)
Problem is the original dataset has its distributions across b1 to b6, and how to get a sample from it, which has the distributions that follows the p1-p6 values
Any thoughts would be appreciated
UPDATES It's to draw a sample from a large dataset (1k sample from 1000k) that follows the distributions (p1,p2 etc.),instead of simulating phony data
Approach 1: It may be solved by repeating random sample. and using the closest one(need resampling or iteration tricks).
Approach 2: using linear optimisation algorithm(may be complicated, as 2^6 possibilities, and needs to solve large equations)
Aucun commentaire:
Enregistrer un commentaire