random: Random data generation leading to good prediction on random labels

samedi 26 août 2017

Random data generation leading to good prediction on random labels

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.

First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?

Here is a code snippet that shows my issue.

library(randomForest)
n=30
p=900

XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = c()

for(i in 1:n){
  fit = randomForest(XX[-i,], YY[-i])
  pred = predict(fit, XX[i,], type = "prob")[2]
  resp <- c(resp, pred)
}

t.test(resp~YY)$p.value

roc(YY, resp)$auc

I tried multiple ways of generating data all of which result in the same thing

XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)

and

random_data=matrix(0, n, p)
for(i in 1:n){
  random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)

Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?

random

samedi 26 août 2017

Random data generation leading to good prediction on random labels

Aucun commentaire:

Enregistrer un commentaire