I am working on a regression project using the Audi dataset from Kaggle.
I have looked at other notebooks and i saw that people use SelectKbest. I tried using the same thing, but when I was splitting my data to train-test I used random_state = 42
and when I tried using the SelectKBest I had a lot of warnings:
minmax_scaler = MinMaxScaler()
#cars_df_with_dummies -> this is a dataframe I created with pd.get_dummies(cars_df)
cars_scaled = minmax_scaler.fit_transform(cars_df_with_dummies)
cars_scaled = pd.DataFrame(cars_scaled, columns = cars_df_with_dummies.columns)
scaled_price_label = cars_scaled['price']
scaled_cars_without_price = cars_scaled.drop(['price'],axis=1)
X_train,X_test,y_train,y_test = train_test_split(scaled_cars_without_price,scaled_price_label,test_size=0.2,random_state=42)
The code to select the best features (was taken from the top rated notebook):
column_names = cars_df_with_dummies.drop(columns = ['price']).columns
no_of_features = []
r_squared_train = []
r_squared_test = []
for k in range(3, 35, 2): # From 3 to 35 variables (every single one)
selector = SelectKBest(f_regression, k = k)
X_train_transformed = selector.fit_transform(X_train, y_train)
X_test_transformed = selector.transform(X_test)
regressor = LinearRegression()
regressor.fit(X_train_transformed, y_train)
no_of_features.append(k)
r_squared_train.append(regressor.score(X_train_transformed, y_train))
r_squared_test.append(regressor.score(X_test_transformed, y_test))
sns.lineplot(x = no_of_features, y = r_squared_train, legend = 'full')
sns.lineplot(x = no_of_features, y = r_squared_test, legend = 'full')
plt.show()
/anaconda3/lib/python3.8/site-packages/sklearn/feature_selection/_univariate_selection.py:302: RuntimeWarning: invalid value encountered in true_divide
corr /= X_norms
/anaconda3/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:1932: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= _a)
As soon as I changed to random_state=0
in the code above it worked!, but I have no idea why. I have read about the random_state
parameter and people said it doesn't matter which value is assigned to it.
From the Docs:
An integer
Use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls. However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds. Popular integer random seeds are 0 and 42.
why did the random_state
parameter affect the SelectKbest ?
Aucun commentaire:
Enregistrer un commentaire