I have a spark data frame, I want to replace the null values with the random sampled value into the null values. I know how to do it in python when I am trying the same in pyspark getting an erorr. Since I am ne to pyspark I am wondering where am I going wrong?
df =
Name age
0 Jhon 20.0
1 NaN 30.0
2 jack NaN
3 jhon 40.0
4 jack NaN
5 prem 20.0
random_sample = df['Name'].dropna().sample(df['Name'].isnull().sum(),random_state =0)
print(random_sample)
random_sample.index=df[df['Name'].isnull()].index
df.loc[df['Name'].isnull(),'Name']=random_sample
df
3 jhon
Name age
0 Jhon 20.0
1 jhon 30.0
2 jack NaN
3 jhon 40.0
4 jack NaN
5 prem 20.0
Pyspark:-
rand = df.filter(df['Name']. isNull())
null=df.where(col("Name").isNull()).count()
rand.sample(null, random_sample = 1)
TypeError: sample() got an unexpected keyword argument 'random_state'
Is the function sampling is different in pyspark. How to fill the null values using random sampling method in pyspark?
Aucun commentaire:
Enregistrer un commentaire