I'm trying to run k-folds cross-validation for a glm model with unequally distributed factor levels, so when I split the data into separate calibration/validation data frames, I inevitably end up with certain factor levels present only in one of the two.
So say I have the following data frame:
set.seed(3.14)
df<-data.frame(x1=sample(0:1,size=20,replace=T),
x2=sample(0:2,size=20,replace=T),
y =sample(0:1,size=100,replace=T))
df<-as.data.frame(apply(df,MARGIN=2,FUN=as.factor))
> sapply(df,FUN=summary)
$x1
0 1
51 49
$x2
0 1 2
37 32 31
$y
0 1
48 52
How can I randomly split it into two dataframes with somewhat-equal proportions of factor levels across all variables. For example, the summary for an 80/20 split would look something like this:
calibration:
$x1
0 1
41 39
$x2
0 1 2
30 26 25
$y
0 1
38 42
Validation:
$x1
0 1
10 10
$x2
0 1 2
7 6 6
$y
0 1
10 10
Note: This is a simplified example. The actual data has 20+ variables with as many as 9 or 10 factor levels.
Aucun commentaire:
Enregistrer un commentaire