There are a lot of questions about optimizing the number of threads and blocks in a CUDA function, but everything I've found is about matching them to the problem size.
I have a task with no problem size. My inputs are random numbers and my output is to be reduced to a single scalar. For simplicity, suppose that I'm computing π by throwing uniformly random points x ∊ (‒1, 1) and y ∊ (‒1, 1) to see how many fall within the unit circle. I'll need a random number generator and an output array as large as my total number of threads so that each thread can increment its count without atomics/contention. Then I'll need to sum the output array to get a single scalar.
I think the optimal way to do this is to use 1024 threads on exactly 1 block, since everything that I read says that an NVIDIA GPU can run no more than 1024 threads at a time. I don't see any advantage to stopping a thread and continuing the random number generation in the next block, and if the random number generator is stateful, it could be better to keep it running in a for
loop within the thread, anyway.
Am I missing something? And why isn't this use-case more common (why don't I see more questions about it)? Monte Carlo's a thing...
Aucun commentaire:
Enregistrer un commentaire