jeudi 28 octobre 2021

sklearn RandomForestClassifier.fit() not reproducible despite set random state and same input

While tuning a random forest model using Scikit-learn I noticed that its accuracy score was different after different runs, even though I used the same RandomForestClassifier instance and the same data as input. I tried googling and the stackExchange search function, but the only case I could find vaguely similar to this one is this post, but there the problem was instantiating the classifier without proper random state, which is not the case for my problem.

I'm using the following code:

clf = RandomForestClassifier( n_estimators=65, max_features = 9, max_depth= 'sqrt', random_state = np.random.RandomState(123) )

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state = np.random.RandomState(159) )
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)

X and y are my data and the corresponding labels, but I found the dataset didn't influence the problem. When I run the train_test_split line I get the same split every time, so no randomness there. Running predict() with the same fitted model also gives the same results every time, suggesting my problem is different from the post I linked to above. However, after every time I run fit(), predict() will give a different prediction! This happens even when I don't touch X_train and y_train. So just running these two lines

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

gives a different result every time. As far as I can tell from the documentation .fit() is not supposed to do anything random. Without reproducible output it is impossible to tune the model, so I'm pretty sure there is an error somewhere. What am I missing? Has anyone encountered this before, or does anyone have any clue as to why this is happening?




Aucun commentaire:

Enregistrer un commentaire