mardi 28 mai 2019

Function for finding ROC score of a feature after randomly shuffling training target data is not acting random

I am trying to write a function that will give the average ROC score of 10 logistic regression classifiers that are each trained on a different random shuffling of the training target data for one feature at a time. (for the purpose of comparing against the non shuffled roc score) But I am getting very strange and non random results for each roc score.

I have tried using np.random.shuffle instead of pd.sample and got the same result

from sklearn import metrics
from sklearn.linear_model import LogisticRegression

def shuffled_roc(df, feature):
    df = df.sample(frac=1, random_state=0)
    x = df[feature][np.isfinite(df[feature])].copy()
    y = df['target'][np.isfinite(df[feature])].copy()

    x_train = x.iloc[:int(0.8*len(x))]
    y_train = y.iloc[:int(0.8*len(x))]

    x_test = x.iloc[int(0.8*len(x)):]
    y_test = y.iloc[int(0.8*len(x)):]

    y_train_shuffled = y_train.sample(frac=1).reset_index(drop=True)

    rocs = []
    for i in range(10):
        y_train_shuffled = y_train_shuffled.sample(frac=1).reset_index(drop=True)
        lr = LogisticRegression(solver = 'lbfgs').fit(x_train.values.reshape(-1,1), y_train_shuffled)

        roc = metrics.roc_auc_score(y_test, lr.predict_proba(x_test.values.reshape(-1,1))[:,1])
        rocs.append(roc)
    print(rocs)
    return np.mean(rocs)
shuffled_roc(df_accident, 'target_suspension_count')

I expect 10 different values for the 10 roc scores but instead I get

[0.7572317596566523, 0.24276824034334765, 0.24276824034334765, 0.7572317596566523, 0.7572317596566523, 0.7572317596566523, 0.24276824034334765, 0.7572317596566523, 0.7572317596566523, 0.24276824034334765]




Aucun commentaire:

Enregistrer un commentaire