mercredi 21 décembre 2016

How to create a balanced training and an unbalanced test data set in R?

I have a data set with 10,000 observations. My target variable has two class - "Y" and "N. Below is the distribution of "Y" and "N"

> table(data$Target_Var)
Y    N 
2000 8000 

Now I want to create a balanced Training data set such that 50% (1000) of the "Y" is in Training. As the training data set is supposed to be balanced, it will have another 1000 rows with "N". Total number of observations = 2000.

table(Training$Target_Var)
Y    N 
1000 1000

The Test data set will be unbalanced but with same ratio of "Y" and "N" as in the population i.e. Test will have 5000 rows of observation with 1000 "Y" and 4000 rows of "N".

table(Test$Target_Var)
Y    N 
1000 4000 

Now, I can write a function to do it, but is there any inbuilt R function which can do this. I explored sampling functions of caret and sampling packages, but could not find any function which creates BALANCED training data set. SMOTE does this but by creating new observations.




Aucun commentaire:

Enregistrer un commentaire