vendredi 4 novembre 2016

R: Cross-validation with low-sample categorical variables

I want to perform cross validation in R.

My data has a categorical variable with 80 levels, but some of these levels have only a few representatives (often 10 or less). I want to be sure that my training and test sets all contain enough samples within these low-representative levels of my categorical variable to run correctly.

However, splitting of data in cross validation is often random, so I'm concerned that perhaps low-sample categories won't be well represented in training and test data splitssets.

Is there a way in R to split my data in a way that ensures that low-frequency levels of a given categorical variable are well distributed between training and test sets?


Context: I have ~90000 repeated-measures tree growth samples which represent ~80 levels of a categorical variable (species).




Aucun commentaire:

Enregistrer un commentaire