mardi 19 février 2019

How to efficiently select a random sample of variables from a set of variables in a dataframe

I would appreciate any help to randomly select a subset of var.w_X containing 5 out of 10 var.w_X variables from my sample data sampleDT, while keeping all the other variables that do not start withvar.w_.

Below is the sample data sampleDT which contains, among other variables (those to be kept altogether), X variables starting with var.w_ in their names (those from which to draw the random sample).

In the current example, X=10, so that var.w_ includes var.w_1 to var.w_10, and I want to draw a random sample of 5 out of these 10. However, in my actual data, X>1,000,000and I might want to draw a sample of 7,500 var.w_ variables out of these X>1,000,000.

Therefore, accounting for efficiency is paramount in any given solution since recently I experienced some performance issues with mutate_at whose cause I still don't have an explanation.

#sample data

sampleDT<-structure(list(n = c(62L, 96L, 17L, 41L, 212L, 143L, 143L, 143L, 
73L, 73L), r = c(3L, 1L, 0L, 2L, 170L, 21L, 0L, 33L, 62L, 17L
), p = c(0.0483870967741935, 0.0104166666666667, 0, 0.0487804878048781, 
0.80188679245283, 0.146853146853147, 0, 0.230769230769231, 0.849315068493151, 
0.232876712328767), group = c(1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 
0L, 0L), treat = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L), c1 = c(1.941115288, 
1.186583128, 1.159882668, 1.159882668, 1.133397521, 1.128993008, 
1.128993008, 1.128993008, 1.121927228, 1.121927228), c2 = c(0.1438, 
0.237, 0.2774, 0.2774, 0.2093, 0.1206, 0.1707, 0.0699, 0.1351, 
0.1206), var.w_1 = c(1.941115288, 1.186583128, 3.159882668, 3.159882668, 
1.133397521, 1.128993008, 2.128993008, 1.128993008, 1.121927228, 
1.121927228), var.w_2 = c(1.931115288, 1.176583128, 3.149882668, 
3.149882668, 1.123397521, 1.118993008, 2.118993008, 1.118993008, 
1.111927228, 1.111927228), var.w_3 = c(1.946115288, 1.191583128, 
3.164882668, 3.164882668, 1.138397521, 1.133993008, 2.133993008, 
1.133993008, 1.126927228, 1.126927228), var.w_4 = c(1.93778195466667, 
1.18324979466667, 3.15654933466667, 3.15654933466667, 1.13006418766667, 
1.12565967466667, 2.12565967466667, 1.12565967466667, 1.11859389466667, 
1.11859389466667), var.w_5 = c(1.943615288, 1.189083128, 3.162382668, 
3.162382668, 1.135897521, 1.131493008, 2.131493008, 1.131493008, 
1.124427228, 1.124427228), var.w_6 = c(1.939115288, 1.184583128, 
3.157882668, 3.157882668, 1.131397521, 1.126993008, 2.126993008, 
1.126993008, 1.119927228, 1.119927228), var.w_7 = c(1.94278195466667, 
1.18824979466667, 3.16154933466667, 3.16154933466667, 1.13506418766667, 
1.13065967466667, 2.13065967466667, 1.13065967466667, 1.12359389466667, 
1.12359389466667), var.w_8 = c(1.94254385942857, 1.18801169942857, 
3.16131123942857, 3.16131123942857, 1.13482609242857, 1.13042157942857, 
2.13042157942857, 1.13042157942857, 1.12335579942857, 1.12335579942857
), var.w_9 = c(1.942365288, 1.187833128, 3.161132668, 3.161132668, 
1.134647521, 1.130243008, 2.130243008, 1.130243008, 1.123177228, 
1.123177228), var.w_10 = c(1.94222639911111, 1.18769423911111, 
3.16099377911111, 3.16099377911111, 1.13450863211111, 1.13010411911111, 
2.13010411911111, 1.13010411911111, 1.12303833911111, 1.12303833911111
)), class = "data.frame", row.names = c(NA, -10L))

Thanks in advance for any help




Aucun commentaire:

Enregistrer un commentaire