random: Pandas Replace NaN values based on random sample of values conditional on another column

jeudi 30 janvier 2020

Pandas Replace NaN values based on random sample of values conditional on another column

Say I have a dataframe like so:

import pandas as pd
import numpy as np

np.random.seed(0)

df = {}
df['x'] = np.concatenate([np.random.uniform(0, 5, 4), np.random.uniform(5, 10, 4)])
df['y'] = np.concatenate([[0] * 4, [1] * 4])
df = pd.DataFrame(df)

df.loc[len(df) + 1] = [np.NaN, 0]
df.loc[len(df) + 1] = [np.NaN, 1]
df
Out[232]: 
           x    y
0   2.744068  0.0
1   3.575947  0.0
2   3.013817  0.0
3   2.724416  0.0
4   7.118274  1.0
5   8.229471  1.0
6   7.187936  1.0
7   9.458865  1.0
9        NaN  0.0
10       NaN  1.0

What I want to do is fill in the NaN values based on a random sample of x values based on the y value.

For example, in row 9 where y is 0, I want to replace the NaN with a number randomly sampled only from x values where the value of y is 0. Effectively, I'd be sampling from this list:

df[df['y'] == 0]['x'].dropna().values.tolist()
Out[233]: [2.7440675196366238, 3.5759468318620975, 3.0138168803582195, 2.724415914984484]

And similarly for row 10, I'd sample only based on 'x' values where y is 1, rather than 0. I can't figure out a way to do it programmatically (at least, in a way that isn't bad practice, such as iterating through dataframe rows).

I've consulted Pandas: Replace NaN Using Random Sampling of Column Values, which shows me how I would randomly sample from all values in a column, but I need the random sample to be conditional on another column's distinct values. I've also seen answers for replacing NaNs with a conditional mean (such as this), but I'm looking to randomly sample, rather than use the mean.

random

jeudi 30 janvier 2020

Pandas Replace NaN values based on random sample of values conditional on another column

Aucun commentaire:

Enregistrer un commentaire