I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:
#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))
#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
#ci is the opposite of the proportion
ci = 1-ProportionRemove
Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}
#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)
I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.
B C D E
[1,] NA 24.004402 7.201558 NA
[2,] NA NA NA NA
[3,] NA 4.029659 NA NA
[4,] NA NA NA NA
[5,] NA 29.377632 NA NA
[6,] NA 3.340918 -2.131747 NA
[7,] NA NA NA NA
[8,] NA 15.967318 NA NA
[9,] NA NA NA NA
[10,] NA -8.078221 NA NA
In summary, I want to replace a propotion of observations with NAs in each column.
Any help is greatly appreciated!!!
Aucun commentaire:
Enregistrer un commentaire