vendredi 17 mars 2017

Split pandas dataframe into mutually exclusive subsets

I am am using regression tree analysis on a data contained in a pandas dataframe. In order to preform V-fold cross validation, I need to split my data into V random, mutually exclusive subsets

Here is what I've worked out so far where I add a new column V = 10 to the dataframe to denote which subset each sample is a member of:

def Vfold_Subsets(Data,V):
    subs = Data
    Data['V'] = V
    N = Data.shape[0]
    n = N//V
    for v in range(1,V):
        sample = subs.sample(n = n)
        Data['V'][Data.index.isin(sample.index)] = v
        subs.drop(sample.index)
    return Data 

This method works, but I have a feeling there is a better way to do it? A downside of this method is if N = 108, then

for v in range(1,V+1):
    print (v,': ',Data['V'][Data['V']==v].count())

returns:

1 :  10
2 :  10
3 :  10
4 :  10
5 :  10
6 :  10
7 :  10
8 :  10
9 :  10
10 :  18

And I think it would be better if I could achieve something like this

1 :  10
2 :  11
3 :  11
4 :  11
5 :  11
6 :  11
7 :  11
8 :  11
9 :  10
10 :  10

So that I don't lump all the remaining samples into the last bin.




Aucun commentaire:

Enregistrer un commentaire