How to sample a data into training and testing data whilst ensuring that every column has a value?
My idea was to do something like this;
data(iris)
random.sample = function(df){
repeat {
# do something
ind = sample(2, nrow(df), replace = TRUE, prob=c(0.8, 0.2))
df1 = df[ind == 1,]
b = data.frame(colSums(df1))
b = min(b[,1])
df2 = df[ind == 2,]
c = data.frame(colSums(df2))
c = min(c[,1])
# check for success
check = sum(a,b)
if(check>0.01) break
}
ind
} #this function makes sure that every trait has a value (could change this to be count = n)
And you can check the data using
tester_1 = function(df){
ind = random.sample(df)
data = data.frame(df[ind == 2,])
a = data.frame(colSums(data))
}
tester_1(df)
b = replicate(20, tester_1(df))
c = do.call(cbind, b) %>% as.data.frame
str(c)
d<-apply(c,2,min)
table(d)
I know I only checked half of the data but there were errors already indicating that something is up with my original coding..probably the random.sampling
Any help greatly appreciated.
I have tagged random forest here because this was a problem when looping though several training data.frames (I wanted to see if I had randomly chosen a poor test data set through some randomisation & comparison of the OOD & predicted accuracy!)
Perhaps there is also more elegant solution where one can control where the random samples come from columnwise. Eg. If i wanted to save 20% of the rows for training, but for that to be 'representative' subset along the columns hence in each column I would have ~20% of the values..
Aucun commentaire:
Enregistrer un commentaire