mardi 10 avril 2018

Why does R 'sample' some columns more than others?

I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:

#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))

#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
  #ci is the opposite of the proportion
  ci = 1-ProportionRemove
  Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}

#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)

I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.

       B         C         D  E
 [1,] NA 24.004402  7.201558 NA
 [2,] NA        NA        NA NA
 [3,] NA  4.029659        NA NA
 [4,] NA        NA        NA NA
 [5,] NA 29.377632        NA NA
 [6,] NA  3.340918 -2.131747 NA
 [7,] NA        NA        NA NA
 [8,] NA 15.967318        NA NA
 [9,] NA        NA        NA NA
[10,] NA -8.078221        NA NA 

In summary, I want to replace a propotion of observations with NAs in each column.

Any help is greatly appreciated!!!




Aucun commentaire:

Enregistrer un commentaire