samedi 22 mai 2021

np.random.choice not returning correct weights when vectorized

I've got some nulls to impute in a column of a df. The weights were taken from a value_counts() of the non nulls. The following line of code works perfectly and returns the correct weights, but takes too long because of the size of the df:

dataset_2021["genero_usuario"] = dataset_2021["genero_usuario"].apply(lambda x : x if pd.isnull(x) == False else np.random.choice(a = ["M","F"], p=[0.656,0.344]))

The faster vectorized version I want to use doesn't work. 1st attempt:

dataset_2021.loc[dataset_2021.genero_usuario.isnull(), dataset_2021.genero_usuario] = np.random.choice(a = ["M","F"], p=[0.656,0.344])

This throws the error:

Cannot mask with non-boolean array containing NA / NaN values

Second attempt:

dataset_2021.fillna(value = {"genero_usuario" : np.random.choice(a = ["M","F"], p=[0.656,0.344])}, inplace = True)

This decreases the weight of the "M" and increases the weight of the "F": the value_counts() give M 0.616 and F 0.384.

  1. Why does the 1st attempt give an error
  2. Why does the difference in weights happen? with lambda it remains equal
  3. How can I solve it? I don't want to use lambda, want the code to remain speedy.

Thanks in advance




Aucun commentaire:

Enregistrer un commentaire