lundi 4 novembre 2019

Randomize values within a DataFrame column

I have a dataframe (df) and a variable containing a group number. Each observation has a group number going from 1 to 80. I would like to create a new variable, called new_group, containing new random numbers from 1 to 80 for each observation. However, these new group numbers must be consistent with the original group numbers in the sense that if 2 observations were in group 1, both observation should have the same new random group number.

Example:

observation    group   random_group
0                1         4
1                2         3
2                1         4
3                43        1
4                1         4
5                21        80
6                43        1

I am using Python 3.7. I tried the following: 1.I created a dictionary with keys from 1 to 80 and values from 1 to 80 but with a different, random order. The idea is to use this dictionary to do a Excel "vlookup" type of matching.

  1. I created a new dataframe with 2 columns: one colum with values from 1 to 80, and another column with numbers with 1 to 80 but in a different, random order. The idea would be to merge the original dataframe with the new one.

Here is what I did:

import random
ordered_group = list(range(1,81))
random_group = random.sample(range(1, 81), 80)
group_dict = dict(zip(ordered_group ,random_group))

df['new_group'] = df.group.map(group_dict)

The new_group column only has nan

I also tried this instead of the last line:

df['new_group'] = df["group"].apply(lambda x: group_dict .get(x))

Now it maps correctly all 80 groups once but it does not go through all observations

I also tried using merge instead of using map

import random
random_group= list(range(1,81))
random_group= pd.DataFrame(random_group)
random_group['new_group'] = random.sample(range(1, 81), 80)
random_group.rename(columns={0:'group'},inplace=True )


df= df.merge(random_group, on = 'group', how = 'outer')

It maps correctly all 80 groups once but it does not go through all observations

So i get something like this:

observation    group   random_group
0                1         4
1                2         3
2                1         nan
3                43        1
4                1         nan
5                21        80
6                43        nan

My two methods seem to work well but they do not go through the whole dataframe. Any idea where did I go wrong? Also, any more efficient method is welcome

Thank you!




Aucun commentaire:

Enregistrer un commentaire