samedi 21 mai 2016

Minimize the number of cuRAND states stored during a MC simulation

I am currently writing a Monte-Carlo simulation in CUDA. As such, I need to generate lots of random numbers on the fly using the cuRAND library. Each thread processes one element in a huge float array (omitted in the example) and generates 1 or 2 random numbers per kernel invocation.

The usual approach (see example below) seems to be to allocate one state per thread. The states are reused in subsequent kernel invocations. However, this does not scale well when the number of threads increases (up to 10⁸ in my application), since it becomes the dominant memory (and bandwidth) usage in the program.

I know that one possible approach would be to process multiple array elements per thread in a grid stride loop fashion. Here I want to investigate a different method.

I am aware that for the compute capability I am targeting (3.5), the maximum number of resident threads per SM is 2048, i.e. 2 blocks in the example below.
Is it possible to use only 2048 states per multiprocessor, regardless of the total number of threads ? All random numbers generated should still be statistically independent.

I think this might be done if each resident thread were associated a unique index modulo 2048, which could then be used to fetch the state from an array. Does such an index exist ?

More generally, are there other ways to reduce the memory footprint of the RNG states ?

Example code (one state per thread)

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <assert.h>

#define gridSize 100
#define blockSize 1024
#define iter 100

__global__ void rng_init(unsigned long long seed, curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  curand_init(seed, Idx, 0, &states[Idx]);
}

__global__ void kernel(curandState * states) {
  const size_t Idx = blockIdx.x * blockDim.x + threadIdx.x;
  const float x = curand_uniform(&states[Idx]);
  // Do stuff with x ...
}

int main() {
  curandState * states;
  assert(cudaMalloc(&states, gridSize*blockSize*sizeof(curandState)) == cudaSuccess);
  rng_init<<<gridSize,blockSize>>>(clock(), states);
  assert(cudaDeviceSynchronize() == cudaSuccess);

  for (size_t i = 0 ; i < iter ; ++i) {
    kernel<<<gridSize,blockSize>>>(states);
    assert(cudaDeviceSynchronize() == cudaSuccess);
  }
  return 0;
}




Aucun commentaire:

Enregistrer un commentaire