jeudi 21 avril 2016

python scikit-learn - Weird behaviour using LabelShuffleSplit

Following the scikit-learn documentation for LabelShuffleSplit, I wish to randomise my train/validation batches to ensure I'm training on all possible data (e.g. for an ensemble).

According to the doc, I should see something like:

>>> from sklearn.cross_validation import LabelShuffleSplit

>>> labels = [1, 1, 2, 2, 3, 3, 4, 4]
>>> slo = LabelShuffleSplit(labels, n_iter=4, test_size=0.5, random_state=0)
>>> for train, test in slo:
>>>     print("%s %s" % (train, test))
...
[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]

So to test the behaviour, I also tried using labels = [0, 0, 0, 0, 0, 0, 0, 0] which returned:

... 
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]

I understand in this case that is doesn't really matter which indices are put into the train/validation sets, but I was hoping it would still be a 50%:50% split???




Aucun commentaire:

Enregistrer un commentaire