lundi 4 mai 2020

Train/Test samples are not random when downsampling R

My data set consists of information collected from inpatients on their satisfaction about the services they received at the hospital. Data looks as below (only a set of variables are mentioned here);

 $ Advised                                : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
 $ Overall_Rate_Discharge_Process         : Factor w/ 5 levels "1","2","3","4",..: 3 4 5 5 4 4 4 4 4 5 ...
 $ Rights_Responsibilities                : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 1 2 ...
 $ Overall_Care                           : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 4 4 4 3 5 5 ...
 $ Recommend_Employees                    : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 2 1 1 2 ...
 $ NPSVal3.1                              : Factor w/ 3 levels "Detractor","Passive",..: 3 2 3 3 3 2 2 1 3 3 ...

My objective is to find the factors that affect the NPSVal3.1 of the patients (using Ordinal Logistic Regression). The NPSVal3.1 column does not have equal number of rows from each level;

Detractor   Passive  Promoter 
  981     12932      8560 

Therefore, I'm trying "downsampling" method to select the train set of the data. Below is the code I used (from library "caret");

train3.1 <- downSample(mydata3.1, mydata3.1$NPSVal3.1)

When the head() and tail() of the train set was checked, it doesn't look random (The row IDs are in order)

> head(train3.1)

  Discharge_Instructions_Treatment_Plans Advised Overall_Rate_Discharge_Process Rights_Responsibilities Overall_Care
1                                      1       1                              2                       1            3
2                                      1       1                              4                       0            4
3                                      1       0                              4                       0            5
4                                      1       1                              3                       1            4
5                                      1       1                              4                       0            4
6                                      1       0                              4                       1            4
  Recommend_Employees NPSVal3.1     Class
1                   0 Detractor Detractor
2                   0 Detractor Detractor
3                   0 Detractor Detractor
4                   0 Detractor Detractor
5                   0 Detractor Detractor
6                   1 Detractor Detractor

Also, when I extracted the test set, it doesn't look random either. Below is the code I used.

test3.1 <- dplyr::anti_join(mydata3.1, train3.1)

Are these data sets random? If yes, how can I know that? If not, how can I make both train and test sets random? Thank you for your support!




Aucun commentaire:

Enregistrer un commentaire