mercredi 12 octobre 2022

non-reproducible results in repeated nested CV

My results lack reproducibility, and I don't understand why.

Context:

Using sklearn, I am training a pipeline

[transform, dimredux, model]

where transform is non-stochastic.

I am using an inner CV for model selection and a repeated outer CV for performance estimation.

I use GridSearchCV to search over a few dimension reduction / feature selection methods, a few models, and a few hyper-parameter values for those models. This is the inner CV of the nested CV.

This GridSearchCV is nested within an outer CV and is repeated several times for estimating the generalization/performance.

The results need to be reproducible, therefore I provide a value for each random_state argument I can find. Yet, my results seem to vary.

Overview of random_state arguments:

  • As I use the full data set in a repeated nested CV context, no train test split is done. So no randomness there.

  • dim redux: PCA, SelectKBest(score_func=mutual_info_classif()), and RFE(estimator=SVC()) all have a random_state argument, which is seeded with np.random.seed(42), I also have VarianceThreshold which does not have such argument

  • models: LogisticRegression, RandomForestClassifier, SVC, GradientBoostingClassifier all have a random_state argument, which is seeded with np.random.seed(42)

  • (inner) GridSearchCV: For the cv argument, I provide a StratifiedKFold instance with np.random.seed(42) as value for the random_state argument

  • repeated outer CV: a RepeatedStratifiedKFold instance is used, again with the same value for the random_state argument. Upon checking, the indices are different across splits and repeats and are reproducible.

I first used large numbers as value for the random_state, then integers, now the np.random.seed(42). However with every 'setup' I am not able to reproduce my results across complete repetitions of the repeated nested CV.

Versions:

  • Python 3.8.13
  • sklearn 1.0.2
  • numpy 1.21.5



Aucun commentaire:

Enregistrer un commentaire