So, basically, I'm using a RF for descriptive modelling as follows:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(y), y)
class_weights = dict(enumerate(class_weights))
class_weights
{0: 0.5561096747856852, 1: 4.955559597429368}
clf = RandomForestClassifier(class_weight=class_weights, random_state=0)
clf = clf.fit(X, y)
cross_val_score(clf, X, y, cv=10, scoring='f1').mean()
And plotting variables importance as:
import matplotlib.pyplot as plt
def plot_importances(clf, features, n):
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
if n:
indices = indices[:n]
plt.figure(figsize=(10, 5))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], align='center')
plt.xticks(range(len(indices)), features[indices], rotation=90)
plt.xlim([-1, len(indices)])
plt.show()
return features[indices]
imp = plot_importances(clf, X.columns, 30)
I was expecting variable importances to be the same across multiple runs. However, their importances changes whenever I re-run the notebook.
I don't understand why is that. Is it related to the cross_val_score method somehow?
Aucun commentaire:
Enregistrer un commentaire