lundi 21 octobre 2019

Groupby and replacing certain values with a random number in a certain range for each group

The question is about updating the values of a column with unique values for each group in groupby based on some conditions.

I have a dataframe like this:

import numpy as np
import pandas as pd

df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
                   'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'li', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
                   'rank': [3, 3, 3, 3, 1, 2, 1, 2, 4, 2, 3, 2, 1, 2, 3, 2],
                  'rating': [1500, 1500, 1500, 1500, 1550, 1540, 1640, 1500, 1390, 1500, 1450, 1500, 1720, 1500, 1320, 1500]})

I need to modify some numbers for each group of values in "match_id" based on a condition about another column.

So, I did a groupby on match_id first. Now, for every 1500 in the column "rating", I want to update the corresponding values in column "rank" with a value within the range of 1 to the length of the corresponding group which is also unqiue in the group.

This is what I've done so far:


new = pd.DataFrame()
grouped = df.groupby('match_id', sort=False)
for name, dfg in grouped:
    dfm = dfg.copy()
    num = len(dfm['rating'] == 1500)
    dfm['rank'] = np.where(dfm['rating'] == 1500, np.random.choice(range(1,len(dfm)+1), num, replace=False), dfm['rank'])
    new = pd.concat([new, dfm], sort = True)

This works but has two problems. First, numbers generated this way may already exist within the group (on other rows). I want the generated random numbers to be unique, meaning the numbers do not already exist within the corresponding group.

Second, this takes way too long for my original dataset (125000 groups). So I need it to be also much more efficient and faster than np.where().

Any help is highly appreciated.




Aucun commentaire:

Enregistrer un commentaire