mercredi 22 mai 2019

R caret: fully reproducible results with parallel rfe on different machines

I have the following code using random forest as method which is fully reproducible if you run it in parallel mode on the same machine:

library(doParallel)
library(caret)

recursive_feature_elimination <- function(dat){

  all_preds <- dat[,which(names(dat) %in% c("Time", "Chick", "Diet"))]
  response <- dat[,which(names(dat) == "weight")]

  sizes <- c(1:(ncol(all_preds)-1))

  # set seeds manually
  set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
  # an optional vector of integers for the size. The vector should have length of length(sizes)+1
  # length is n_repeats*nresampling+1
  seeds <- vector(mode = "list", length = 16)
  for(i in 1:15) seeds[[i]]<- sample.int(n=1000, size = length(sizes)+1)
  # for the last model
  seeds[[16]]<-sample.int(1000, 1)
  seeds_list <- list(rfe_seeds = seeds,
                     train_seeds = NA)

  # specify rfeControl
  contr <- caret::rfeControl(functions=rfFuncs, method="repeatedcv", number=3, repeats=5, 
                             saveDetails = TRUE, seeds = seeds, allowParallel = TRUE)

  # recursive feature elimination caret 
  results <- caret::rfe(x = all_preds, 
                        y = response,
                        sizes = sizes, 
                        method ="rf",
                        ntree = 250, 
                        metric= "RMSE", 
                        rfeControl=contr )


 return(results)


}

dat <- as.data.frame(ChickWeight)

cores <- detectCores()
cl <- makePSOCKcluster(cores, outfile="")
registerDoParallel(cl)
results <- recursive_feature_elimination(dat)
stopCluster(cl)
registerDoSEQ()

The outcome on my machine is:

 Variables  RMSE Rsquared   MAE RMSESD RsquaredSD MAESD Selected
         1 39.14   0.6978 24.60  2.755    0.02908 1.697         
         2 23.12   0.8998 13.90  2.675    0.02273 1.361        *
         3 28.18   0.8997 20.32  2.243    0.01915 1.225         

The top 2 variables (out of 2):
   Time, Chick

I am using a Windows OS with one CPU and 4 cores. If the code is run on a UNIX OS using multiple CPUs with multiple cores, the outcome is different. I think this behaviour shows up because of the random number generation, which obviously differs between my system and the multi-CPU system.

How can I produce fully reproducible results independent of the OS and independent of how many CPUs and cores used for parallelization?

How can I assure that the same random numbers are used for each internal process of rfe and randomForest no matter in which sequence during the parallel computing the process is run?

How are the random numbers generated for each parallel process?




Aucun commentaire:

Enregistrer un commentaire