I am trying to generate random numbers on the individual processes of a FORK
cluster. I found Steve Weston's answer that outlines three ways. However, that answer is mainly geared towards using parallel::mclapply
, and I am interested to know whether I can achieve something similar with a manually created FORK
cluster. And, if the approach below has any side effects I am not aware of.
The problem.
# Ensure the `.Random.seed` exists in the main process.
rnorm(1)
# Print the seed.
print(.Random.seed)
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# Generate random numbers on the cluster (the problem).
for(i in 1:2) {
# Feedback for clarity.
cat("Run number:", i, "\n")
# Create the `FORK` cluster.
backend <- parallel::makeCluster(2, "FORK")
# Check the values of `.Random.seed` on the child processes.
print(parallel::clusterEvalQ(backend, { .Random.seed }))
# Generate random numbers.
print(parallel::parSapply(backend, 1:2, function(x) rnorm(1)))
# Stop the cluster.
parallel::stopCluster(backend)
# Feedback for clarity.
cat(rep("-", 30), "\n\n")
}
# Run number: 1
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] 0.4935776 0.4935776
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Run number: 2
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] 0.4935776 0.4935776
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# ...
As you can see, in accordance with Steve Weston's answer, the child processes are created with a copy of the .Random.seed
that exists in the main process. Hence, if nothing moved the .Random.seed
stream forward (or it didn't get removed), each time I create a new cluster, I end up with the same numbers.
A perhaps naive approach.
The documentation of mcparallel
says the if the RNG exists it will be copied by the child processes, otherwise each child process will initiate the RNG based on the clock time and process ID:
If
mc.set.seed = FALSE
, the child process has the same initial random number generator (RNG) state as the current R session. If the RNG has been used (or .Random.seed was restored from a saved workspace), the child will start drawing random numbers at the same point as the current session. If the RNG has not yet been used, the child will set a seed based on the time and process ID when it first uses the RNG: this is pretty much guaranteed to give a different random-number stream from the current session and any other child process.
Hence, is it safe to remove the .Random.seed
copied from the main process and let the child processes initialize their own based on the clock time and process ID? For example, something like the following:
# ...
# Generate random numbers on the cluster (a potential approach)?
for(i in 1:2) {
# Feedback for clarity.
cat("Run number:", i, "\n")
# Create the `FORK` cluster.
backend <- parallel::makeCluster(2, "FORK")
# Check the inherited values of `.Random.seed` on the child processes.
print(parallel::clusterEvalQ(backend, { .Random.seed }))
# Upon creation of the cluster remove the inherited `.Random.seed`.
parallel::clusterEvalQ(backend, rm(list = ls(all.names = TRUE)))
# Generate random numbers.
print(parallel::parSapply(backend, 1:2, function(x) rnorm(1)))
# Check the created values of `.Random.seed` on the child processes.
print(parallel::clusterEvalQ(backend, { .Random.seed }))
# Stop the cluster.
parallel::stopCluster(backend)
# Feedback for clarity.
cat(rep("-", 30), "\n\n")
}
# Which results in the following output.
# Run number: 1
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] -0.5715241 1.1185240
# [1] 10407 526289805 353889827 1059054224 -1056295096 -865117501 2032268800
# [1] 10407 154241933 -599585698 -497710325 -694601912 -34060352 -398990459
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Run number: 2
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] 10407 -1622412906 1293611072 -102526474 -427981783 -110857513 1641877986
# [1] -1.1293323 -0.8339863
# [1] 10407 1114672013 -219668828 1458973439 -2063321272 -775346660 1678610177
# [1] 10407 -1649177715 -1213192943 -1982050350 1287862088 -2081396332 185873261
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
A quick mention. I tried using parallel::clusterSetRNGStream
, but I didn't have much luck there.
Aucun commentaire:
Enregistrer un commentaire