jeudi 5 septembre 2019

Random subsets of a dataset

I would like to compare the classification performance (accuracy) of different classifiers (e.g. CNN, SVM.....), depending on the size of the training data set. Given is a dataset of images (e.g., MNIST), from which 80% of the images are randomly determined but in compliance with class balance. Subsequently, 80% of the images for the next smaller subset are to be determined from this subset in the same way again. This is repeated until finally a small training amout of about 1000 images is reached. Each of the classifiers should now be trained with each these subsets.

The aim is to be able to make a statement like for example that from a training size of 5000 images the classifier A is significantly better than classifier B.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size= 0.2, stratify=y)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, random_state=0, test_size= 0.2, stratify=y_train)
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_train_2, y_train_2, random_state=0, test_size= 0.8, stratify=y_train_2)
.....
.....
.....

My problem is that I am not sure if this is really random sampling when I use the above code. Would it better to get the subsets, e.g. using numpy.random.randint?

For any help, I would be very grateful.




Aucun commentaire:

Enregistrer un commentaire