mardi 3 août 2021

How to generate random numbers on a `FORK` cluster using `parallel::parSapply` in `R`?

I am trying to generate random numbers on the individual processes of a FORK cluster. I found Steve Weston's answer that outlines three ways. However, that answer is mainly geared towards using parallel::mclapply, and I am interested to know whether I can achieve something similar with a manually created FORK cluster. And, if the approach below has any side effects I am not aware of.

The problem.

# Ensure the `.Random.seed` exists in the main process.
rnorm(1)

# Print the seed.
print(.Random.seed)
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986

# Generate random numbers on the cluster (the problem).
for(i in 1:2) {
    # Feedback for clarity.
    cat("Run number:", i, "\n")

    # Create the `FORK` cluster.
    backend <- parallel::makeCluster(2, "FORK")

    # Check the values of `.Random.seed` on the child processes.
    print(parallel::clusterEvalQ(backend, { .Random.seed }))

    # Generate random numbers.
    print(parallel::parSapply(backend, 1:2, function(x) rnorm(1)))

    # Stop the cluster.
    parallel::stopCluster(backend)

    # Feedback for clarity.
    cat(rep("-", 30), "\n\n")
}

# Run number: 1 
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] 0.4935776 0.4935776
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
#
# Run number: 2 
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] 0.4935776 0.4935776
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

# ...

As you can see, in accordance with Steve Weston's answer, the child processes are created with a copy of the .Random.seed that exists in the main process. Hence, if nothing moved the .Random.seed stream forward (or it didn't get removed), each time I create a new cluster, I end up with the same numbers.

A perhaps naive approach.

The documentation of mcparallel says the if the RNG exists it will be copied by the child processes, otherwise each child process will initiate the RNG based on the clock time and process ID:

If mc.set.seed = FALSE, the child process has the same initial random number generator (RNG) state as the current R session. If the RNG has been used (or .Random.seed was restored from a saved workspace), the child will start drawing random numbers at the same point as the current session. If the RNG has not yet been used, the child will set a seed based on the time and process ID when it first uses the RNG: this is pretty much guaranteed to give a different random-number stream from the current session and any other child process.

Hence, is it safe to remove the .Random.seed copied from the main process and let the child processes initialize their own based on the clock time and process ID? For example, something like the following:

# ...

# Generate random numbers on the cluster (a potential approach)?
for(i in 1:2) {
    # Feedback for clarity.
    cat("Run number:", i, "\n")

    # Create the `FORK` cluster.
    backend <- parallel::makeCluster(2, "FORK")

    # Check the inherited values of `.Random.seed` on the child processes.
    print(parallel::clusterEvalQ(backend, { .Random.seed }))

    # Upon creation of the cluster remove the inherited `.Random.seed`.
    parallel::clusterEvalQ(backend, rm(list = ls(all.names = TRUE)))

    # Generate random numbers.
    print(parallel::parSapply(backend, 1:2, function(x) rnorm(1)))

    # Check the created values of `.Random.seed` on the child processes.
    print(parallel::clusterEvalQ(backend, { .Random.seed }))

    # Stop the cluster.
    parallel::stopCluster(backend)

    # Feedback for clarity.
    cat(rep("-", 30), "\n\n")
}

# Which results in the following output.
# Run number: 1 
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] -0.5715241  1.1185240
# [1] 10407   526289805   353889827  1059054224 -1056295096  -865117501  2032268800
# [1] 10407  154241933 -599585698 -497710325 -694601912  -34060352 -398990459
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
# 
# Run number: 2 
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] 10407 -1622412906  1293611072  -102526474  -427981783  -110857513  1641877986
# [1] -1.1293323 -0.8339863
# [1] 10407  1114672013  -219668828  1458973439 -2063321272  -775346660  1678610177
# [1] 10407 -1649177715 -1213192943 -1982050350  1287862088 -2081396332   185873261
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

A quick mention. I tried using parallel::clusterSetRNGStream, but I didn't have much luck there.




Aucun commentaire:

Enregistrer un commentaire