jeudi 30 juin 2022

Python/numpy - conditional sampling of variables, distribution of subsequent value is based on result of previous value

I am trying to generate a random sample of multiple variables which are loosely related to each other. Meaning that "allowed" values of some variables depend on the value which is set for another variable.

For simplicity let's imagine that I have just two variables - A and B and let's say that both of them have uniform or gaussian distribution (we don't really care which exact distribution they follow and can accept both). For discussion let's assume both have uniform distribution.

Let's say that variable A can take any value between 0 and 100. We can easily sample from this distribution, say, 1000 data points.

Now, we also want to generate values for variable B, which can take any value between, say, 50 and 150. The catch here is that there is a constraint in resulting sample - sum of values A and B must be between 60 and 160.

Final catch is that each time we run the sampling process precise boundaries of sampling are changing (for example in one case A can be between 0 and 100 as above, next day it needs to be between -10 and 75 etc). Basically from day to day precise boundaries of sampling are evolving.

Right now we do it in a very inefficient way - generate completely random grid of A and B values independently, than eliminate all of the A and B combinations which don't satisfy constraints which we specify and than use them in subsequent steps. For example such grid could look like:

enter image description here

However, as you guess it is super-inefficient. In reality we have a lot of variables (30+) and large set of constraints we apply. Completely random generation of grid leads to instances where after applying all constraints we end up with no points satisfying all constraints if we don't use large enough sample size - and to ensure we always have at least some points we need to generate grid with millions points. Beyond that each time we re-run the sampling procedure we get different resulting dataset - sometimes all points are getting eliminated, sometimes we get 10 points as result and sometimes - 1000.

So my question is - is there a way to do it more efficiently in a "statistically correct way", ideally in a way which will allow us to specify how many sample points satisfying all constraints we want to get in the end of the day. Any guidance or pointers to some code examples will be much appreciated.




Aucun commentaire:

Enregistrer un commentaire