mercredi 20 octobre 2021

CUDA curand state pointers inside C++ class

I'm currently trying to create a C++ class to use with curand. I would like to have this class hold a pointer to the curand states on the device, which I can then pass to a kernel for it to use. For example:

__global__ void setup_kernel(curandState *state, int localsites);

class cudaRNG {
  
    public:

    curandState *devStates;
    int nthreads;
    int blocksize;
    int gridsize;
    int localsites;

    cudaRNG(){
   
        localsites = 32*32*32;
        nthreads = 64;
        gridsize = (localsites + nthreads - 1) / nthreads;

        cudaMalloc((void **)&devStates,localsites*sizeof(curandState));
        setup_kernel<<<gridsize,nthreads>>>(devStates, localsites);
    }

    ~cudaRNG(){
         cudaFree(devStates);
    }
}

__global__ void setup_kernel(curandState *state, int localsites){
  
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    if (id < localsites) curand_init(0,0,0,&state[id]);
}

I understand that global kernels cannot be static methods in classes, which is why I put the setup kernel outside of the class.

The code compiles fine, but when I run it and instantiate a member of this cudaRNG class, I get the following error: Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)). This happens in the curand_init step. Given the error itself, I can only surmise that the issue lies in passing a host pointer to the GPU kernel, which is run on the device and thus it can't find the address requested. However, the part I am confused about is if I do the exact same process of creating a curandState object and allocating device memory and calling setup_kernel but outside this class, then it works normally. For instance, if I were to do:

__global__ void setup_kernel(curandState *state, int localsites);

int main(){

    curandState *devStates;
    int nthreads = 64;
    int gridsize = (localsites + nthreads - 1) / nthreads;
    int localsites = 32*32*32;

    cudaMalloc((void **)&devStates,localsites*sizeof(curandState));
    setup_kernel<<<gridsize,nthreads>>>(devStates, localsites);

} 

__global__ void setup_kernel(curandState *state, int localsites){
  
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    if (id < localsites) curand_init(0,0,0,&state[id]);
}

Then this runs as expected. Is this a scope issue? Or is the manner in which pointers are allocated inside classes fundamentally different and thus incompatible with CUDA kernels?

Finally, is there any way of making this work? As in, having a class to initialise and contain the device RNG states so they can be passed to the relevant kernels as necessary?

Thank you very much in advance!




Aucun commentaire:

Enregistrer un commentaire