Following the scikit-learn documentation for LabelShuffleSplit, I wish to randomise my train/validation batches to ensure I'm training on all possible data (e.g. for an ensemble).
According to the doc, I should see something like:
>>> from sklearn.cross_validation import LabelShuffleSplit
>>> labels = [1, 1, 2, 2, 3, 3, 4, 4]
>>> slo = LabelShuffleSplit(labels, n_iter=4, test_size=0.5, random_state=0)
>>> for train, test in slo:
>>> print("%s %s" % (train, test))
...
[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]
So to test the behaviour, I also tried using labels = [0, 0, 0, 0, 0, 0, 0, 0]
which returned:
...
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
I understand in this case that is doesn't really matter which indices are put into the train/validation sets, but I was hoping it would still be a 50%:50% split???
Aucun commentaire:
Enregistrer un commentaire