I have a Dataframe and I want to sample 20 percent of the data. However, my data is not balanced that's why I need to sample 20 percent of each category (20% positive, 20%negative and 20% neutral).
After sampling I want to save the rest of the data in a new Dataframe.
This is my data frame:
df = pd.DataFrame({'text':['hello', 'how', 'good', 'bad', 'ok', 'bye', 'ol'], 'Sentiment':[0, 1, 1, 1, 2, 0, 2]})
##sample 20% (for simplicity n=1) of the data based on the distribution in sentiment column
:
dfsample = df.groupby('Sentiment').apply(lambda x: x.sample(n=1))
text sentiment
hello. 0
how. 1
ol. 2
##keep the rest of the data:
df_rest = df.loc[~df.index.isin(dfsample.index)]
This df_rest
will output the original df
I had as the indexes in df_sample
does not match the indexes in df
.
The problem is that when we use apply
the index will change and I cannot retrieve the remaining data.
I searched through the internet and learned that transform
keeps the original index. But transform used to return one scaler value based on the group.
This is the ideal output in this simple example:
text sentiment
good. 1
bad. 1
ok. 2
bye 0
Aucun commentaire:
Enregistrer un commentaire