vendredi 24 avril 2020

What are the exact semantics of RandomGen's split function supposed to be?

It seems that in Haskell, the behavior of split can be strongly dependent on the chosen (pseudo)random number generator (PRNG).

I was drawn to split by looking at various ways within the API to generate random numbers.

Say we take the comparatively recent Threefish generator.

We will need:

 λ> import System.Random
 λ> import System.Random.TF
 λ> import Control.Monad.Random

Next, we generate sequences of 20 two-digits numbers in 3 different ways:

 λ> tg0 = mkTFGen 42
 λ> 
 λ> evalRand (sequence (replicate 20 $ getRandomR (10,99))) tg0
[62,99,78,18,38,21,54,25,54,94,62,57,55,69,98,78,99,73,59,24]
 λ> 
 λ> take 20 evalRand (sequence (repeat getRandomR (10,99))) tg0
[62,99,78,18,38,21,54,25,54,94,62,57,55,69,98,78,99,73,59,24]
 λ> 
 λ> take 20 $ evalRand (getRandomRs (10,99)) tg0
[62,99,78,18,38,21,54,25,54,94,62,57,55,69,98,78,99,73,59,24]
 λ> 

We get 3 times the same sequence, regardless of whether we're take'ing a prefix of an unlimited sequence or not.

But this is not so for the “standard” StdGen generator. Here:

 λ> 
 λ> sg0 = mkStdGen 42
 λ> 
 λ> evalRand (sequence (replicate 20 $ getRandomR (10,99))) sg0
[69,79,35,32,84,89,76,39,16,73,73,37,34,96,12,23,45,59,31,34]
 λ> 
 λ> take 20 evalRand (sequence (repeat getRandomR (10,99))) sg0
[69,79,35,32,84,89,76,39,16,73,73,37,34,96,12,23,45,59,31,34]
 λ> 
 λ> take 20 $ evalRand (getRandomRs (10,99)) sg0
[33,66,18,33,73,38,73,35,59,82,42,44,48,20,58,21,89,14,43,14]
 λ> 

So here, the third sequence differs from the other ones. It turns out that this is because getRandomRs calls split on the current generator.

This can be checked directly in this way:

 λ> 
 λ> (sg0a, sg0b) = split sg0
 λ> (tg0a, tg0b) = split tg0
 λ> 
 λ>: {
|λ> let { getRandomDouble::RandomGen g => g -> Double ;
|λ> getRandomDouble g = fst $ random g }
|λ>: }
 λ> 
 λ> λ> 
 λ> getRandomDouble tg0
0.6499718678721916
 λ> getRandomDouble tg0a
0.6499718678721916
 λ> 
 λ> getRandomDouble tg0b
0.7724686371301966
 λ> 
 λ> getRandomDouble sg0
1.0663729393723398e-2
 λ> getRandomDouble sg0a
0.36531519389010025
 λ> 
 λ> getRandomDouble sg0b
0.7740913257381021
 λ> 

So, in the case of the Threefish generator, the left component of the split (used by getRandomRs) generates the same sequence as the original. For the StdGen generator, it generates a different sequence, leading to the surprising result noted in the beginning.

The source code for the stdSplit function is here. It includes a comment:

    -- no statistical foundation for this!

Question: is the behavior of stdSplit legal (in conformity with the language standard) and/or statistically legitimate? I understand there might be a necessity not to break existing client code, but are there guidelines about this to be applied to newer PRNGs?

__Note: __ It is rather unusual for a PRNG API to provide a split function. It is more common to provide an advance function, like here in Python/NumPy for example, and extra generators can be created as required using that advance function. For some PRNGs, the advance function has a cost of only O(log(numSteps)).




Aucun commentaire:

Enregistrer un commentaire