vendredi 3 février 2017

How to sample a data.frame into training and testing data whilst ensuring that every column has a value?

How to sample a data into training and testing data whilst ensuring that every column has a value?

My idea was to do something like this;

data(iris)

random.sample = function(df){
  repeat {
# do something
ind = sample(2, nrow(df), replace = TRUE, prob=c(0.8, 0.2))
df1 = df[ind == 1,]
b = data.frame(colSums(df1))
b = min(b[,1])
df2 = df[ind == 2,]
c = data.frame(colSums(df2))
c = min(c[,1])
# check for success
check = sum(a,b)
if(check>0.01) break
 }
 ind
}  #this function makes sure that every trait has a value (could change this         to be count = n)

And you can check the data using

tester_1 = function(df){
ind = random.sample(df)
data = data.frame(df[ind == 2,])
 a = data.frame(colSums(data))
}

tester_1(df)
b = replicate(20, tester_1(df))
c = do.call(cbind, b) %>% as.data.frame
str(c)

d<-apply(c,2,min)
table(d)

I know I only checked half of the data but there were errors already indicating that something is up with my original coding..probably the random.sampling

Any help greatly appreciated.

I have tagged random forest here because this was a problem when looping though several training data.frames (I wanted to see if I had randomly chosen a poor test data set through some randomisation & comparison of the OOD & predicted accuracy!)

Perhaps there is also more elegant solution where one can control where the random samples come from columnwise. Eg. If i wanted to save 20% of the rows for training, but for that to be 'representative' subset along the columns hence in each column I would have ~20% of the values..




Aucun commentaire:

Enregistrer un commentaire