I am am using regression tree analysis on a data contained in a pandas dataframe. In order to preform V-fold cross validation, I need to split my data into V random, mutually exclusive subsets
Here is what I've worked out so far where I add a new column V = 10 to the dataframe to denote which subset each sample is a member of:
def Vfold_Subsets(Data,V):
subs = Data
Data['V'] = V
N = Data.shape[0]
n = N//V
for v in range(1,V):
sample = subs.sample(n = n)
Data['V'][Data.index.isin(sample.index)] = v
subs.drop(sample.index)
return Data
This method works, but I have a feeling there is a better way to do it? A downside of this method is if N = 108, then
for v in range(1,V+1):
print (v,': ',Data['V'][Data['V']==v].count())
returns:
1 : 10
2 : 10
3 : 10
4 : 10
5 : 10
6 : 10
7 : 10
8 : 10
9 : 10
10 : 18
And I think it would be better if I could achieve something like this
1 : 10
2 : 11
3 : 11
4 : 11
5 : 11
6 : 11
7 : 11
8 : 11
9 : 10
10 : 10
So that I don't lump all the remaining samples into the last bin.
Aucun commentaire:
Enregistrer un commentaire