samedi 8 janvier 2022

Partitioning of data in training and test set, keeping the right prevalence

I have a dataset with 100 variables and more than 3000 observations.

Many of these variables, are factors, where a level appears for example, just 3 or 4 times in all the observations of the dataset.

For example:

dat$Var_56
1
1
1
0 #the "0" level just happens 4 times on 3000 obs
1
1
..

This happens for many variables (levels are not 0 and 1, but, for example, the zone of the city). When I try to split the dataset in Training and Test set, and then I prepare the data using the function model.matrix() (that expands all the factors) what happens is that, basically, training and test set have different variables, and so it is impossible to check the validity of the estimated linear model.

What can we do in these situations?




Aucun commentaire:

Enregistrer un commentaire