mardi 26 avril 2016

Scikit-learn SVC always giving accuracy 0 on random data cross validation

In the following code I create a random sample set of size 50, with 20 features each. I then generate a random target vector composed of half True and half False values.

All of the values are stored in Pandas objects, since this simulates a real scenario in which the data will be given in that way.

I then perform a manual leave-one-out inside a loop, each time selecting an index, dropping its respective data, fitting the rest of the data using a default SVC, and finally running a prediction on the left-out data.

import random
import numpy as np
import pandas as pd
from sklearn.svm import SVC

n_samp = 50
m_features = 20

X_val = np.random.rand(n_samp, m_features)
X = pd.DataFrame(X_val, index=range(n_samp))
# print X_val

y_val = [True] * (n_samp/2) + [False] * (n_samp/2)
random.shuffle(y_val)
y = pd.Series(y_val, index=range(n_samp))
# print y_val

seccess_count = 0
for idx in y.index:
    clf = SVC()  # Can be inside or outside loop. Result is the same.

    # Leave-one-out for the fitting phase
    loo_X = X.drop(idx)
    loo_y = y.drop(idx)
    clf.fit(loo_X.values, loo_y.values)

    # Make a prediction on the sample that was left out
    pred_X = X.loc[idx:idx]
    pred_result = clf.predict(pred_X.values)
    print y.loc[idx], pred_result[0]  # Actual value vs. predicted value - always opposite!
    is_success = y.loc[idx] == pred_result[0]
    seccess_count += 1 if is_success else 0

print '\nSeccess Count:', seccess_count  # Almost always 0!

Now here's the strange part - I expect to get an accuracy of about 50%, since this is random data, but instead I almost always get exactly 0! I say almost always, since every about 10 runs of this exact code I get a few correct hits.

What's really crazy to me is that if I choose the answers opposite to those predicted, I will get 100% accuracy. On random data!

What am I missing here?




Aucun commentaire:

Enregistrer un commentaire