I am learning to use the XGBoost package in R and I encountered some very weird behaviour that I'm not sure how to explain. Perhaps someone can give me some directions. I simplified the R code as much as possible:
rm(list = ls())
library(xgboost)
setwd("/home/my_username/Documents/R_files")
my_data <- read.csv("my_data.csv")
my_data$outcome_01 = ifelse(my_data$outcome_continuous > 0.0, 1, 0)
reg_features = c("feature_1", "feature_2")
class_features = c("feature_1", "feature_3")
set.seed(93571)
train_data = my_data[seq(1, nrow(my_data), 2), ]
mm_reg_train = model.matrix(~ . + 0, data = train_data[, reg_features])
train_DM_reg = xgb.DMatrix(data = mm_reg_train, label = train_data$outcome_continuous)
var_nrounds = 190
xgb_reg_model = xgb.train(data = train_DM_reg, booster = "gbtree", objective = "reg:squarederror",
nrounds = var_nrounds, eta = 0.07,
max_depth = 5, min_child_weight = 0.8, subsample = 0.6, colsample_bytree = 1.0,
verbose = F)
mm_class_train = model.matrix(~ . + 0, data = train_data[, class_features])
train_DM_class = xgb.DMatrix(data = mm_class_train, label = train_data$outcome_01)
xgb_class_model = xgb.train(data = train_DM_class, booster = "gbtree", objective = "binary:logistic",
eval_metric = 'auc', nrounds = 70, eta = 0.1,
max_depth = 3, min_child_weight = 0.5, subsample = 0.75, colsample_bytree = 0.5,
verbose = F)
probabilities = predict(xgb_class_model, newdata = train_DM_class, type = "response")
print(paste0("simple check: ", sum(probabilities)), quote = F)
Here is the problem: The outcome of sum(probabilities)
depends on the value of var_nrounds
!
How could that be!? After all var_rounds
enters only in xgb_reg_model
, while the probabilities are computed with the xgb_class_model
that does (should) not know anything about the value of var_rounds
. The only thing that I change in this code is the value of var_rounds
and yet the sum of probabilities changes when I rerun it. It also changes deterministically, i.e., with var_rounds = 190
I always get (with my data) 5324.3
and with var_rounds = 285
: 5322.8
. However, if I remove the line set.seed(93571)
, then the result changes non-deterministically every time I rerun the code.
Could it be that XGBoost has some sort of in-built stochastic behaviour that changes depending on the number of rounds run beforehand in another model and that also gets controlled by setting a seed somewhere in the code before training the XGBoost? Any ideas?
Aucun commentaire:
Enregistrer un commentaire