random: Threadsafe Cython RNG nogil Parallelization

vendredi 4 septembre 2020

Threadsafe Cython RNG nogil Parallelization

I'm attempting to generate random numbers inside a nogil prange loop in Cython by extending numpy.random RNGs. I'm attempting to use this example, and combine it with intuitions from here to create separate bit generators for each thread to avoid locking requirements. I am clearly misunderstanding something, however, as the bitgen_t entity's state and memory location are the same across what I assumed would be different bit generators with independent states. The result is that, e.g., when generating 1000 random numbers in this way, there are something like 40% duplicate numbers (which makes sense, given there is only actually one bit generator being used). When generating large amounts of random numbers, it also segfaults, which I imagine is related.

If there is an easier way to do this, let me know!

Here is current code:

#!/usr/bin/env python3
#cython: language_level=3
import numpy as np
cimport numpy as np
cimport cython
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer
from cython.parallel import prange, threadid
from numpy.random import PCG64, SeedSequence, Generator
from numpy.random cimport bitgen_t
from libc.stdlib cimport malloc, free
from libc.stdint cimport uintptr_t


@cython.boundscheck(False)
@cython.wraparound(False)
def uniforms(Py_ssize_t n):
    """
    Create an array of `n` uniformly distributed doubles.
    A 'real' distribution would want to process the values into
    some non-uniform distribution
    """
    cdef Py_ssize_t i, thid
    cdef const char *capsule_name = "BitGenerator"
    cdef double[:] random_values = np.ones(n, dtype='float64') * -1.

    # Create an RNG for each thread so we don't have to lock them.
    cdef Py_ssize_t num_threads = 16
    capsules = [Generator(PCG64(s)).bit_generator.capsule for s in SeedSequence(1234).spawn(num_threads)]

    cdef bitgen_t **rngs_c

    try:
        # Cast capsules into ptr array for use outside of gil.
        rngs_c = <bitgen_t**>malloc(num_threads * sizeof(bitgen_t*))
        for i, capsule in enumerate(capsules):
            if not PyCapsule_IsValid(capsule, capsule_name):
                raise ValueError("Invalid pointer to anon_func_state")
            rngs_c[i] = <bitgen_t *> PyCapsule_GetPointer(capsule, capsule_name)
            print(<uintptr_t>rngs_c[i])

        with nogil:
            for i in prange(n, schedule="static", num_threads=num_threads):
                thid = threadid()
                rng = rngs_c[thid]
                random_values[i] = rng.next_double(rng.state)
    finally:
        free(rngs_c)

    return np.asarray(random_values)

If I run this as is, I get this output (note the print(<uintptr_t>rngs_c[i]) that prints the address of the bitgen_t as retrieved from the different capsules.

>>> x = uniforms(1000)
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
4931169440
>>> np.unique(x).size
594

Note that compiling this (and observing issues related to parallelization) requires compiling with openmp. Details can be found here.

random

vendredi 4 septembre 2020

Threadsafe Cython RNG nogil Parallelization

Aucun commentaire:

Enregistrer un commentaire