Consider the following dataframe:
import pandas as pd
df = pd.DataFrame({'A':[1,3,1,1,5,1,6,6,3,6],'B':['x','y','a','f','e','y','q','o','h','j'],'Weights':[0.5,1,0.5,0.5,2,0.5,0.66666,0.66666,1,0.66666],'Counts':[4,2,4,4,1,4,3,3,2,3]})
It looks like this:
A B Counts Weights
1 x 4 0.50000
3 y 2 1.00000
1 a 4 0.50000
1 f 4 0.50000
5 e 1 2.00000
1 y 4 0.50000
6 q 3 0.66666
6 o 3 0.66666
3 h 2 1.00000
6 j 3 0.66666
Counts and Weights are actually generated columns. Counts speaks for itself and Weights is an over (or under) sampling factor by which I want to sample the data (columns A and B), in order to end up with groups containing 2 elements. The output should be something like this:
A B
1 x
1 f
3 y
3 h
5 e
5 e
6 j
6 q
Here is what I have tried:
def sampler(x,p):
if p >=1:
q = int(p)
return x.sample(n=q,replace=True)
else:
return x.sample(frac=p,replace=False)
#(not sure I am using the 'sample' method properly)
newDF = df.groupby('Weights').apply(lambda x: sampler(x,x['Weights'].iloc[0])).reset_index()
...but I get the following error:
ValueError: cannot insert Weights, already exists
Aucun commentaire:
Enregistrer un commentaire