lundi 4 octobre 2021

How to sample from a pandas without changing indexes and keep the remaining samples

I have a Dataframe and I want to sample 20 percent of the data. However, my data is not balanced that's why I need to sample 20 percent of each category (20% positive, 20%negative and 20% neutral).

After sampling I want to save the rest of the data in a new Dataframe.

This is my data frame:

df = pd.DataFrame({'text':['hello', 'how', 'good', 'bad', 'ok', 'bye', 'ol'], 'Sentiment':[0, 1, 1, 1, 2, 0, 2]})

##sample 20% (for simplicity n=1) of the data based on the distribution in sentiment column:

dfsample = df.groupby('Sentiment').apply(lambda x: x.sample(n=1))

text    sentiment
hello.   0
how.     1
ol.      2

##keep the rest of the data:

df_rest = df.loc[~df.index.isin(dfsample.index)]

This df_rest will output the original df I had as the indexes in df_sample does not match the indexes in df.

The problem is that when we use apply the index will change and I cannot retrieve the remaining data.

I searched through the internet and learned that transform keeps the original index. But transform used to return one scaler value based on the group.

This is the ideal output in this simple example:

text    sentiment
good.    1
bad.     1
ok.      2
bye      0



Aucun commentaire:

Enregistrer un commentaire