vendredi 25 décembre 2015

Weighted random sampling with CUDA

What is the most efficient way to implement weighted random sampling with CUDA?

I have some iterative process and I have a lot of objects with weights. I need to sample one of them at each iteration and recalculate all weights (in a way based on the selected object). And so on.

In pure C++ it will be something like this:

while(1) {
    std::partial_sum(w.begin(), w.end(), w_cumsum.begin());
    int i = std::upper_bound(w_cumsum.begin(), w_cumsum.end(), rand() * w_cumsum.back()) - w_cumsum.begin();
    recalc_weights(w, objects[i]);
}

But I'm not sure which way is faster - to copy values from GPU to host memory and make upper_bound binary search at host, or to make a binary search at GPU (which will work in one thread at one block and can be really slow), or both are bad and there is some much better way?

Also a more general question - is it fine to use a single-block single-thread call of a device-function if we need to do some non-parallel task on small amounts of data, or better way is to copy all needed data into host memory, proceed it there, and after that copy it back to the device?




Aucun commentaire:

Enregistrer un commentaire