dimanche 26 juillet 2020

Sampling without replacement with unequal, dependent probabilities

So from answers to the question, Randomly Generating Combinations From Variable Weights, I've managed to sample without replacement given unequal probabilities. I've implemented this in my particular context and it works great.

Now I have access to additional information about my distribution. Specifically, I've been using a neural network to generate these probabilities so far, and I've now trained my neural network to output a probability for each pair of unique objects, in addition to probabilities for individual objects.

So if the population is [A, B, C, D, E, F] and I'll be sampling groups of size 3, then I have access to the probability that the resulting sample will contain both A and B, or the probability that the resulting sample will contain both C and E, etc. I also still have access to the probabilities that the sample contains any of the individual elements.

The question above has made me realize that sampling is a much harder problem than I had initially expected it to be. Still, I'm wondering if there's any way this information about the probability of these pairs could be used to make my sample better match my actual target distribution.

Using the method discussed in the first answer of the question above, where gradient descent is used to find probabilities that result in the correct conditional probabilities, when supplied the following probabilities:

[0.349003, 0.56702, 0.384279, 0.627456, 0.585684, 0.486558] (for the respective elements of the population from above)

a sample of 3,000,000 groups follows the following distribution, where intersections of the matrix represent the probability that a sampled group contains both the elements that are intersected:

   A          B          C          D          E          F
A [0.34860067 0.15068367 0.09183067 0.17385267 0.15775267 0.12308167]
B [0.15068367 0.567596   0.168863   0.309292   0.28285467 0.22349867]
C [0.09183067 0.168863   0.384101   0.19396467 0.175895   0.13764867]
D [0.17385267 0.309292   0.19396467 0.62724833 0.32153967 0.25584767]
E [0.15775267 0.28285467 0.175895   0.32153967 0.58571833 0.23339467]
F [0.12308167 0.22349867 0.13764867 0.25584767 0.23339467 0.48673567]

I want to, for example, be able to provide my sampling algorithm with the following matrix:

   A        B        C        D        E        F
A [0.349003 0.148131 0.15166  0.098827 0.151485 0.147903]
B [0.148131 0.56702  0.140829 0.327032 0.331278 0.18677 ]
C [0.15166  0.140829 0.384279 0.190329 0.09551  0.19023 ]
D [0.098827 0.327032 0.190329 0.627456 0.391803 0.246921]
E [0.151485 0.331278 0.09551  0.391803 0.585684 0.201292]
F [0.147903 0.18677  0.19023  0.246921 0.201292 0.486558]

and have its sample's distribution closer to matching this supplied matrix than if only individual probabilities were used as was done above.

I'm okay with the sampling algorithm not perfectly matching the probabilities, though it would be nice if that was possible. Based on my own attempts at this problem, I realize that things are more difficult because though P(A and B) is supplied, it is impossible to derive P(A and B and C) from just the information in the matrix. Basically my question is, is there any way for me to use this additional information I have (probabilities for pairs) in order to produce samples with distributions that better model my intended distribution than if I used only the probabilities for each individual object.




Aucun commentaire:

Enregistrer un commentaire