mardi 1 juin 2021

Very weird behaviour of xgboost in R


I am learning to use the XGBoost package in R and I encountered some very weird behaviour that I'm not sure how to explain. Perhaps someone can give me some directions. I simplified the R code as much as possible:

rm(list = ls())
library(xgboost)
setwd("/home/my_username/Documents/R_files")

my_data <- read.csv("my_data.csv")
my_data$outcome_01 = ifelse(my_data$outcome_continuous > 0.0, 1, 0)

reg_features = c("feature_1", "feature_2")
class_features = c("feature_1", "feature_3")

set.seed(93571)
train_data = my_data[seq(1, nrow(my_data), 2), ]

mm_reg_train = model.matrix(~ . + 0, data = train_data[, reg_features])
train_DM_reg = xgb.DMatrix(data = mm_reg_train, label = train_data$outcome_continuous)

var_nrounds = 190
xgb_reg_model = xgb.train(data = train_DM_reg, booster = "gbtree", objective = "reg:squarederror",
                          nrounds = var_nrounds, eta = 0.07,
                          max_depth = 5, min_child_weight = 0.8, subsample = 0.6, colsample_bytree = 1.0,
                          verbose = F)

mm_class_train = model.matrix(~ . + 0, data = train_data[, class_features])
train_DM_class = xgb.DMatrix(data = mm_class_train, label = train_data$outcome_01)

xgb_class_model = xgb.train(data = train_DM_class, booster = "gbtree", objective = "binary:logistic",
                            eval_metric = 'auc', nrounds = 70, eta = 0.1,
                            max_depth = 3, min_child_weight = 0.5, subsample = 0.75, colsample_bytree = 0.5,
                            verbose = F)

probabilities = predict(xgb_class_model, newdata = train_DM_class, type = "response")
print(paste0("simple check: ", sum(probabilities)), quote = F)

Here is the problem: The outcome of sum(probabilities) depends on the value of var_nrounds!
How could that be!? After all var_rounds enters only in xgb_reg_model, while the probabilities are computed with the xgb_class_model that does (should) not know anything about the value of var_rounds. The only thing that I change in this code is the value of var_rounds and yet the sum of probabilities changes when I rerun it. It also changes deterministically, i.e., with var_rounds = 190 I always get (with my data) 5324.3 and with var_rounds = 285: 5322.8. However, if I remove the line set.seed(93571), then the result changes non-deterministically every time I rerun the code.
Could it be that XGBoost has some sort of in-built stochastic behaviour that changes depending on the number of rounds run beforehand in another model and that also gets controlled by setting a seed somewhere in the code before training the XGBoost? Any ideas?




Aucun commentaire:

Enregistrer un commentaire