vendredi 26 février 2021

SelectKBest for regression `f_regression` behaves weird when changing the random_state parameter when splitting

I am working on a regression project using the Audi dataset from Kaggle.

I have looked at other notebooks and i saw that people use SelectKbest. I tried using the same thing, but when I was splitting my data to train-test I used random_state = 42 and when I tried using the SelectKBest I had a lot of warnings:

minmax_scaler = MinMaxScaler()

#cars_df_with_dummies -> this is a dataframe I created with pd.get_dummies(cars_df)
cars_scaled = minmax_scaler.fit_transform(cars_df_with_dummies)
cars_scaled = pd.DataFrame(cars_scaled, columns = cars_df_with_dummies.columns)


scaled_price_label = cars_scaled['price']
scaled_cars_without_price = cars_scaled.drop(['price'],axis=1)

X_train,X_test,y_train,y_test = train_test_split(scaled_cars_without_price,scaled_price_label,test_size=0.2,random_state=42)

The code to select the best features (was taken from the top rated notebook):

column_names = cars_df_with_dummies.drop(columns = ['price']).columns

no_of_features = []
r_squared_train = []
r_squared_test = []


for k in range(3, 35, 2): # From 3 to 35 variables (every single one)
    selector = SelectKBest(f_regression, k = k)
    X_train_transformed = selector.fit_transform(X_train, y_train)
    X_test_transformed = selector.transform(X_test)
    regressor = LinearRegression()
    regressor.fit(X_train_transformed, y_train)
    no_of_features.append(k)
    r_squared_train.append(regressor.score(X_train_transformed, y_train))
    r_squared_test.append(regressor.score(X_test_transformed, y_test))
    
sns.lineplot(x = no_of_features, y = r_squared_train, legend = 'full')
sns.lineplot(x = no_of_features, y = r_squared_test, legend = 'full')
plt.show()

/anaconda3/lib/python3.8/site-packages/sklearn/feature_selection/_univariate_selection.py:302: RuntimeWarning: invalid value encountered in true_divide
  corr /= X_norms
/anaconda3/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:1932: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= _a)

As soon as I changed to random_state=0 in the code above it worked!, but I have no idea why. I have read about the random_state parameter and people said it doesn't matter which value is assigned to it.

From the Docs:

An integer

Use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls. However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds. Popular integer random seeds are 0 and 42.

why did the random_state parameter affect the SelectKbest ?




Aucun commentaire:

Enregistrer un commentaire