jeudi 30 avril 2015

Split data based on the outcome and on the predictor date

In simple two class machine learning problem we could split data using caret package, for example:

library(caret)
set.seed(3456)
trainIndex <- createDataPartition(iris$Species, p = .8,
                                  list = FALSE)
irisTrain <- iris[ trainIndex,]
irisTest  <- iris[-trainIndex,]

Using this approach we preserve the overall class distribution of the data. But, what if we want to use second factor to be used for splitting. Let's say, we have a data from four neighbouring sites for one year and two class outcome. So the split we want to make, will group sites by date and (on the same time) - try to preserve the overall class distribution of the data. For example:

Training set will be:
Class A, site a, January 25 
Class A, site b, January 25 
Class B, site c, January 25 
Class A, site d, January 25,
Class B, site c, January 27, 
Class A, site d, January 27,
....

Testing set will be:
Class B, site a, January 26 
Class A, site b, January 26 
Class B, site c, January 26 
Class A, site d, January 26,
Class A, site c, January 28, 
Class A, site d, January 28,
....   

I'm looking for solution to the problem - how to split the data based not only on the outcome, but also on the predictors (dates)?




Aucun commentaire:

Enregistrer un commentaire