vendredi 16 février 2018

Coin tosses, arithmetic of random variables, and PyMC3

I find myself wanting to perform arithmetic of random variables in Python; for the sake of example, let us consider the experiment of repeatedly tossing two independent fair coins and counting the number of heads.

Sampling from each random variable independently is straightforward with scipy.stats, and we can start getting results right away

In [5]: scipy.stats.bernoulli(0.5).rvs(10) + scipy.stats.bernoulli(0.5).rvs(10)
Out[5]: array([1, 0, 0, 0, 1, 1, 1, 2, 1, 2])

Now, a pessimist would remark that we wouldn't even have to go that far and could instead just do np.random.randint(2, size=10) + np.random.randint(2, size=10), and a cynic would notice that we could just calculate the sum and never have to sample anything.

And they'd be right. So, say that we have many more variables and more complex operations to perform on them, and graphical models quickly becomes useful. That is, we'd want to operate on the random variables themselves and only start sampling when our graph of computation is set up. In lea, which does exactly that, the example above becomes

In [1]: from lea import Lea

In [7]: (Lea.bernoulli(0.5) + Lea.bernoulli(0.5)).random(10)
Out[7]: (0, 2, 0, 2, 0, 2, 1, 1, 1, 2)

Works like a charm. Enter PyMC3, one of the more popular libraries for probabilistic programming. Now, PyMC3 is intended for usage with MCMC and Bayesian modeling in particular, but it has the building blocks we need for our experiment above. Alas,

In [1]: import pymc3 as pm

In [2]: with pm.Model() as model:
   ...:     x = pm.Bernoulli('x', 0.5)
   ...:     y = pm.Bernoulli('y', 0.5)
   ...:     z = pm.Deterministic('z', x+y)
   ...:     trace = pm.sample(10)
   ...:
Assigned BinaryGibbsMetropolis to x
Assigned BinaryGibbsMetropolis to y
100%|███████████████████████████████████████| 510/510 [00:02<00:00, 254.22it/s]

In [3]: trace['z']
Out[3]: array([2, 0, 2, 0, 2, 0, 2, 0, 2, 0], dtype=int64)

Not exactly random. Unfortunately, I lack the theoretical understanding of why the Gibbs sampler produces this particular result (and really I should probably just hit the books). Using step=pm.Metropolis() instead, we get the correct distribution at the end of the day, even if the individual samples are not pairwise independent (as is to be expected from MCMC).

In [8]: with pm.Model() as model:
   ...:     x = pm.Bernoulli('x', 0.5)
   ...:     y = pm.Bernoulli('y', 0.5)
   ...:     z = pm.Deterministic('z', x+y)
   ...:     trace = pm.sample(10000, step=pm.Metropolis())
   ...:
100%|██████████████████████████████████████████████████████████████████████████████████████████| 10500/10500 [00:02<00:00, 5161.18it/s]

In [14]: collections.Counter(trace['z'])
Out[14]: Counter({0: 2493, 1: 5024, 2: 2483})

So, maybe I could just go ahead and use pm.Metropolis for simulating my post-arithmetic distribution, but I'd be afraid that I was missing something, and so the question finally becomes: Why does the step-less simulation above fail, and are there any pitfalls in using PyMC3 for ordinary, non-MC, MC?




Aucun commentaire:

Enregistrer un commentaire