mardi 11 septembre 2018

Splitting training data with equal number rows for each classes

I have a very large dataset of about 314554097 rows and 3 columns. The third column is the class. The dataset has two class 0 and 1. I need split the data into test and training data. To split the data I can use

from sklearn.cross_validation import train_test_split . 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.75, random_state = 0)  

But, The dataset contains about 99 percent of class 0 and only 1 percent of class 1. In the training dataset, I need an equal number of class 0 and class 1 say 30000 rows of both classes. How can I do it?




Aucun commentaire:

Enregistrer un commentaire