vendredi 30 juillet 2021

Randomly sample non-empty column values for each row of a pandas dataframe

For each row, I would like to randomly sample k columnar indices that correspond to non-null values.

If I start with this dataframe,

A = pd.DataFrame([
    [1, np.nan, 3, 5],
    [np.nan, 2, np.nan, 7],
    [4, 8, 9]
])
>>> A
    0   1   2   3
0   1.0 NaN 3.0 5.0
1   NaN 2.0 NaN 7.0
2   4.0 8.0 9.0 NaN

If I wanted to randomly sample 2 non-null values for each row and change them to the value -1, one way that can be done is as follows:

B = A.copy()

for i in A.index:
    s = A.loc[i]
    s = s[s.notnull()]
    col_idx = random.sample(s.index.tolist(), 2)
    B.iloc[i, col_idx] = -1

>>> B
    0   1   2   3
0   -1.0    NaN -1.0    5.0
1   NaN -1.0    NaN -1.0
2   -1.0    -1.0    9.0 NaN

Is there a better way to do this natively in Pandas that avoids having to use a for loop? The pandas.DataFrame.sample method seems to keep the number of columns that are sampled in each row constant. But if the dataframe has empty holes, the number of non-null values for each row wouldn't be constant.




Aucun commentaire:

Enregistrer un commentaire