jeudi 30 janvier 2020

Pandas Replace NaN values based on random sample of values conditional on another column

Say I have a dataframe like so:

import pandas as pd
import numpy as np

np.random.seed(0)

df = {}
df['x'] = np.concatenate([np.random.uniform(0, 5, 4), np.random.uniform(5, 10, 4)])
df['y'] = np.concatenate([[0] * 4, [1] * 4])
df = pd.DataFrame(df)

df.loc[len(df) + 1] = [np.NaN, 0]
df.loc[len(df) + 1] = [np.NaN, 1]
df
Out[232]: 
           x    y
0   2.744068  0.0
1   3.575947  0.0
2   3.013817  0.0
3   2.724416  0.0
4   7.118274  1.0
5   8.229471  1.0
6   7.187936  1.0
7   9.458865  1.0
9        NaN  0.0
10       NaN  1.0

What I want to do is fill in the NaN values based on a random sample of x values based on the y value.

For example, in row 9 where y is 0, I want to replace the NaN with a number randomly sampled only from x values where the value of y is 0. Effectively, I'd be sampling from this list:

df[df['y'] == 0]['x'].dropna().values.tolist()
Out[233]: [2.7440675196366238, 3.5759468318620975, 3.0138168803582195, 2.724415914984484]

And similarly for row 10, I'd sample only based on 'x' values where y is 1, rather than 0. I can't figure out a way to do it programmatically (at least, in a way that isn't bad practice, such as iterating through dataframe rows).

I've consulted Pandas: Replace NaN Using Random Sampling of Column Values, which shows me how I would randomly sample from all values in a column, but I need the random sample to be conditional on another column's distinct values. I've also seen answers for replacing NaNs with a conditional mean (such as this), but I'm looking to randomly sample, rather than use the mean.




Aucun commentaire:

Enregistrer un commentaire