I have just started using R's doParallel package to split up large tasks to run concurrently.
I need the random number generator (RNG) inside a given worker to be as independent as possible from all the others. I also need its seed to be programmatically specified, if desired, to make that worker reproduce results.
The package doParallel vignette fails on this point: it merely mentions that
When using multicore-like functionality, the doParallel package allows you to specify various options [...like...] "set.seed"
So, when exactly are you using "multicore-like functionality"? (I do not see where the vignette defines this, but it seems to be when you use an explicit cluster.) Can you use "multicore-like functionality" on all operating systems (e.g. Windows, probably not)? What does "set.seed" exactly do?
So, I began web searching. It was hard to find pure doParallel specific advice. But the general form of this question has bedeviled a legion of others, such as this and this.
When I look at proposed solutions, I am confused: many seem to be package specific (e.g. call clusterSetupRNG), even worse some packages are operating system specific (e.g. multicore only works on Unix), and finally it is not obvious if they will reliably work with doParallel (which is an interface between foreach and parallel, and parallel is a merger of multicore and snow).
I noticed that one commonality of all the solutions I have seen is that they want to do configuration outside of each worker, before you start the computation, to influence the random numbers each one subsequently internally generates.
What about the opposite approach: have each worker explicitly call set.seed with a unique argument?
To make this concrete, consider this code (adapted from the 2nd stackoverflow link above):
library(doParallel)
cluster = makeCluster(2)
registerDoParallel(cluster)
set.seed(123)
a = foreach(i = 1:2, .combine = cbind) %dopar% {rnorm(5)}
a
set.seed(123)
b = foreach(i = 1:2, .combine = cbind) %dopar% {rnorm(5)}
b
stopCluster(cluster)
Executing it, I too see that a and b are different, just like mpiktas.
Now consider my proposed alternative solution: put the call to set.seed inside each worker, and supply each worker with a unique seed:
library(doParallel)
cluster = makeCluster(2)
registerDoParallel(cluster)
f = function(i) {
set.seed(i)
return( rnorm(5) )
}
a = foreach(i = 1:2, .combine = cbind) %dopar% f(i)
a
b = foreach(i = 1:2, .combine = cbind) %dopar% f(i)
b
stopCluster(cluster)
Executing the revised code, a and b are now the same, which indicates that each worker correctly used its specified seed.
Are there any problems with this approach?
The above example is problematic because, for simplicity, it simply used the worker count index as its seed. Something like a proper hash of that count would surely be better. In the calculations that I will be doing, I have several parameters supplied to each worker that I could easily individually hash and then XOR together which will likely generate excellently separated unique seeds. So this will likely not be a problem for me, but it is something to be aware of.
The bigger problem is that I do not know if there are subtle dependencies which will cause the worker RNGs to interfere with each other and so ruin the desired independence. My mental model is that all of R's concurrency packages execute each worker in its own dedicated R process, not a thread within the parent process. Processes should mostly be isolated from each other. So, each worker's RNG ought to be totally independent of each other EXCEPT for how it is initialized: the default initialization could come from state that is outside the process, and so processes could still influence each other. But if each worker, at its start, explicitly calls set.seed, do you eliminate that possibility, and so eliminate RNG interference?
The best low level discussion that I have found on this is section "6 Random-number generation" in the package parallel vignette.
The second paragraph describes how an R process's initial seed comes from .Random.seed and how R processes can interfere with each other via that.
The third paragraph starts with this sentence: "The alternative is to set separate seeds for each worker process in some reproducible way from the seed in the master process". The first part of that sentence sounds compatible with what I am proposing. What bothers me is the second part, where it claims that each worker's seed needs to come "in some reproducible way from the seed in the master process". I do not follow that--the requirement that worker seeds depend on the master seed seems too strict to me. That paragraph's second sentence states the concern "This is generally plenty safe enough, but there have been worries that the random-number streams in the workers might somehow get into step".
I await your excellent informed feedback.
Aucun commentaire:
Enregistrer un commentaire