samedi 26 août 2017

Random data generation leading to good prediction on random labels

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.

First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?

Here is a code snippet that shows my issue.

library(randomForest)
n=30
p=900

XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = c()

for(i in 1:n){
  fit = randomForest(XX[-i,], YY[-i])
  pred = predict(fit, XX[i,], type = "prob")[2]
  resp <- c(resp, pred)
}

t.test(resp~YY)$p.value

roc(YY, resp)$auc

I tried multiple ways of generating data all of which result in the same thing

XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)

and

random_data=matrix(0, n, p)
for(i in 1:n){
  random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)

Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?




Aucun commentaire:

Enregistrer un commentaire