jeudi 5 avril 2018

Sample pandas group using weights (or counts) from another column

Consider the following dataframe:

import pandas as pd

df = pd.DataFrame({'A':[1,3,1,1,5,1,6,6,3,6],'B':['x','y','a','f','e','y','q','o','h','j'],'Weights':[0.5,1,0.5,0.5,2,0.5,0.66666,0.66666,1,0.66666],'Counts':[4,2,4,4,1,4,3,3,2,3]})

It looks like this:

A   B   Counts  Weights

1   x   4   0.50000
3   y   2   1.00000
1   a   4   0.50000
1   f   4   0.50000
5   e   1   2.00000
1   y   4   0.50000
6   q   3   0.66666
6   o   3   0.66666
3   h   2   1.00000
6   j   3   0.66666

Counts and Weights are actually generated columns. Counts speaks for itself and Weights is an over (or under) sampling factor by which I want to sample the data (columns A and B), in order to end up with groups containing 2 elements. The output should be something like this:

A B
1 x
1 f
3 y
3 h
5 e
5 e
6 j
6 q

Here is what I have tried:

def sampler(x,p):
    if p >=1:
        q = int(p)
        return x.sample(n=q,replace=True)
    else:
        return x.sample(frac=p,replace=False)

#(not sure I am using the 'sample' method properly)

newDF = df.groupby('Weights').apply(lambda x: sampler(x,x['Weights'].iloc[0])).reset_index()

...but I get the following error:

ValueError: cannot insert Weights, already exists




Aucun commentaire:

Enregistrer un commentaire