I want to perform cross validation in R.
My data has a categorical variable with 80 levels, but some of these levels have only a few representatives (often 10 or less). I want to be sure that my training and test sets all contain enough samples within these low-representative levels of my categorical variable to run correctly.
However, splitting of data in cross validation is often random, so I'm concerned that perhaps low-sample categories won't be well represented in training and test data splitssets.
Is there a way in R to split my data in a way that ensures that low-frequency levels of a given categorical variable are well distributed between training and test sets?
Context: I have ~90000 repeated-measures tree growth samples which represent ~80 levels of a categorical variable (species).
Aucun commentaire:
Enregistrer un commentaire