dimanche 29 octobre 2023

Optimal CUDA thread/block count for a function with no inputs (random sampling)

There are a lot of questions about optimizing the number of threads and blocks in a CUDA function, but everything I've found is about matching them to the problem size.

I have a task with no problem size. My inputs are random numbers and my output is to be reduced to a single scalar. For simplicity, suppose that I'm computing π by throwing uniformly random points x ∊ (‒1, 1) and y ∊ (‒1, 1) to see how many fall within the unit circle. I'll need a random number generator and an output array as large as my total number of threads so that each thread can increment its count without atomics/contention. Then I'll need to sum the output array to get a single scalar.

I think the optimal way to do this is to use 1024 threads on exactly 1 block, since everything that I read says that an NVIDIA GPU can run no more than 1024 threads at a time. I don't see any advantage to stopping a thread and continuing the random number generation in the next block, and if the random number generator is stateful, it could be better to keep it running in a for loop within the thread, anyway.

Am I missing something? And why isn't this use-case more common (why don't I see more questions about it)? Monte Carlo's a thing...




Aucun commentaire:

Enregistrer un commentaire