jeudi 17 juin 2021

How to replace null values with the random sampled values in pyspark?

I have a spark data frame, I want to replace the null values with the random sampled value into the null values. I know how to do it in python when I am trying the same in pyspark getting an erorr. Since I am ne to pyspark I am wondering where am I going wrong?

df =
   Name   age
0  Jhon  20.0
1   NaN  30.0
2  jack   NaN
3  jhon  40.0
4  jack   NaN
5  prem  20.0

random_sample = df['Name'].dropna().sample(df['Name'].isnull().sum(),random_state =0)
print(random_sample)
random_sample.index=df[df['Name'].isnull()].index
df.loc[df['Name'].isnull(),'Name']=random_sample
df
3 jhon
    Name    age
0   Jhon    20.0
1   jhon    30.0
2   jack    NaN
3   jhon    40.0
4   jack    NaN
5   prem    20.0

Pyspark:-

rand = df.filter(df['Name']. isNull())
null=df.where(col("Name").isNull()).count()
rand.sample(null, random_sample = 1)


TypeError: sample() got an unexpected keyword argument 'random_state'

Is the function sampling is different in pyspark. How to fill the null values using random sampling method in pyspark?




Aucun commentaire:

Enregistrer un commentaire