jeudi 20 juin 2019

Fastest precise normal RNG for CUDA

For simulations, I need to generate billions of normally-distributed random numbers in CUDA, while Box-Muller transformation is not an option (see here) because for floats it never produces numbers beyond +-5.7681074142 (thus it loses high-impact numbers with substantial probability). So curand is not an option because it uses Box-Muller transform (I've looked at the source code as of CUDA 10.0).

Now I am considering the following methods/algorithms:

  1. Marsaglia polar method
  2. Ziggurat algorithm
  3. Inverse transform sampling method
  4. Ratio-of-uniforms

Marsaglia polar and Ziggurat algorithms have loops, leading to thread divergence in CUDA warps. Thus it's hard for me to estimate how well they will perform in practice. Furthermore, the Ziggurat algorithm seems able to get initialized with different constants and I don't know how they affect precision and performance. Inverse transform sampling method seems to have a branch which will be followed by at least 1 thread in a warp usually, thus making all other threads to wait. Furthermore, I'm not sure about its precision - please, clarify. I haven't found comprehensive info on method #4 yet - please, clarify.

The underlying integer PRNG is xorshift+.

Having said the above, before going on and benchmarking all the above methods, I would like to ask you to:

  1. Mention other promising methods for CUDA that generate normally-distributed values (while such imprecise methods as Box-Muller is not an option).
  2. Give me some tips on how to implement some of the above or the other methods in CUDA.



Aucun commentaire:

Enregistrer un commentaire