jeudi 27 juillet 2017

Split a data frame into two random samples with equal proportions of multiple variables

I'm trying to run k-folds cross-validation for a glm model with unequally distributed factor levels, so when I split the data into separate calibration/validation data frames, I inevitably end up with certain factor levels present only in one of the two.

So say I have the following data frame:

set.seed(3.14)
df<-data.frame(x1=sample(0:1,size=20,replace=T),
               x2=sample(0:2,size=20,replace=T),
               y =sample(0:1,size=100,replace=T))
df<-as.data.frame(apply(df,MARGIN=2,FUN=as.factor))
> sapply(df,FUN=summary)
$x1
0  1 
51 49 

$x2
0  1  2 
37 32 31 

$y
0  1 
48 52 

How can I randomly split it into two dataframes with somewhat-equal proportions of factor levels across all variables. For example, the summary for an 80/20 split would look something like this:

calibration:

$x1
0   1
41  39
$x2
0    1   2
30   26  25
$y
0   1
38  42

Validation:

$x1
0   1
10  10
$x2
0   1  2
7   6  6
$y
0   1
10  10

Note: This is a simplified example. The actual data has 20+ variables with as many as 9 or 10 factor levels.




Aucun commentaire:

Enregistrer un commentaire