I've got some nulls to impute in a column of a df. The weights were taken from a value_counts() of the non nulls. The following line of code works perfectly and returns the correct weights, but takes too long because of the size of the df:
dataset_2021["genero_usuario"] = dataset_2021["genero_usuario"].apply(lambda x : x if pd.isnull(x) == False else np.random.choice(a = ["M","F"], p=[0.656,0.344]))
The faster vectorized version I want to use doesn't work. 1st attempt:
dataset_2021.loc[dataset_2021.genero_usuario.isnull(), dataset_2021.genero_usuario] = np.random.choice(a = ["M","F"], p=[0.656,0.344])
This throws the error:
Cannot mask with non-boolean array containing NA / NaN values
Second attempt:
dataset_2021.fillna(value = {"genero_usuario" : np.random.choice(a = ["M","F"], p=[0.656,0.344])}, inplace = True)
This decreases the weight of the "M" and increases the weight of the "F": the value_counts() give M 0.616 and F 0.384.
- Why does the 1st attempt give an error
- Why does the difference in weights happen? with lambda it remains equal
- How can I solve it? I don't want to use lambda, want the code to remain speedy.
Thanks in advance
Aucun commentaire:
Enregistrer un commentaire