random: For each unique value in a pandas DataFrame column, how can I randomly select a proportion of rows?

samedi 16 octobre 2021

For each unique value in a pandas DataFrame column, how can I randomly select a proportion of rows?

Python newbie here. Imagine a csv file that looks something like this:

(...except that in real life, there are 20 distinct names in the Person column, and each Person has 300-500 rows. Also, there are multiple data columns, not just one.)

What I want to do is randomly flag 10% of each Person's rows and mark this in a new column. I came up with a ridiculously convoluted way to do this--it involved creating a helper column of random numbers and all sorts of unnecessarily complicated jiggery-pokery. It worked, but was crazy. More recently, I came up with this:

import pandas as pd 
df = pd.read_csv('source.csv')
df['selected'] = ''

names= list(df['Person'].unique())  #gets list of unique names

for name in names:
     df_temp = df[df['Person']== name]
     samp = int(len(df_temp)/10)   # I want to sample 10% for each name
     df_temp = df_temp.sample(samp)
     df_temp['selected'] = 'bingo!'   #a new column to mark the rows I've randomly selected
     df = df.merge(df_temp, how = 'left', on = ['Person','data'])
     df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
        #Note:  initially instead of the line above, I tried the line below, but it didn't work too well:
        #df['temp'] = df['selected_x'] + df['selected_y']
     df = df[['Person','data','temp']]
     df = df.rename(columns = {'temp':'selected'})

df['selected'] = df['selected'].str.replace('nan','').str.strip()  #cleans up the column

As you can see, essentially I'm pulling out a temporary DataFrame for each Person, using DF.sample(number) to do the randomising, then using DF.merge to get the 'marked' rows back into the original DataFrame. And it involved iterating through a list to create each temporary DataFrame...and my understanding is that iterating is kind of lame.

There's got to be a more Pythonic, vectorising way to do this, right? Without iterating. Maybe something involving groupby? Any thoughts or advice much appreciated.

random

samedi 16 octobre 2021

For each unique value in a pandas DataFrame column, how can I randomly select a proportion of rows?

Aucun commentaire:

Enregistrer un commentaire