random: Split pandas dataframe into mutually exclusive subsets

vendredi 17 mars 2017

Split pandas dataframe into mutually exclusive subsets

I am am using regression tree analysis on a data contained in a pandas dataframe. In order to preform V-fold cross validation, I need to split my data into V random, mutually exclusive subsets

Here is what I've worked out so far where I add a new column V = 10 to the dataframe to denote which subset each sample is a member of:

def Vfold_Subsets(Data,V):
    subs = Data
    Data['V'] = V
    N = Data.shape[0]
    n = N//V
    for v in range(1,V):
        sample = subs.sample(n = n)
        Data['V'][Data.index.isin(sample.index)] = v
        subs.drop(sample.index)
    return Data

This method works, but I have a feeling there is a better way to do it? A downside of this method is if N = 108, then

for v in range(1,V+1):
    print (v,': ',Data['V'][Data['V']==v].count())

returns:

And I think it would be better if I could achieve something like this

So that I don't lump all the remaining samples into the last bin.

random

vendredi 17 mars 2017

Split pandas dataframe into mutually exclusive subsets

Aucun commentaire:

Enregistrer un commentaire