mercredi 5 août 2020

Generating random hierarchical data in a Pandas data frame

I am trying to generate hierarchical random data in a Pandas data frame. As a toy example, suppose that I sample x from some distribution, and then sample y from some conditional distribution given x, and then sample z from some conditional distribution given x and y as shown below. In my real problem, x, y, and z can take many more values than just 0 and 1, but the distributions are represented using dictionaries as shown below. Is there a more elegant way to generate this data frame? It seems particularly ugly that I have to generate an "array" using np.random.choice, and then just choose one element. Additionally, it seems like the code for generating z is particularly awkward because I need to extract the x and y columns from row rather than being able to write something like lambda x, y: ... to have the row automatically flattened into columns.

p_x = {0: 0.2, 1: 0.8}
p_y_given_x = {
    0: {0: 0.3, 1: 0.7},
    1: {0: 0.5, 1: 0.5},
}
p_z_given_x_and_y = {
    0: {0: {0: 0.1, 1: 0.9}, 1: {0: 0.5, 1: 0.5}},
    1: {0: {0: 0.5, 1: 0.5}, 1: {0: 0.7, 1: 0.3}},
}

data = pd.DataFrame({
    'x': np.random.choice(a=list(p_x), size=10, p=list(p_x.values()))
})
data['y'] = data['x'].apply(
    lambda x: np.random.choice(
        list(p_y_given_x[x]),
        size=1,
        p=list(p_y_given_x[x].values()),
    )[0],
)
data['z'] = data.apply(
    lambda row: np.random.choice(
        list(p_z_given_x_and_y[row['x']][row['y']]),
        size=1,
        p=list(p_z_given_x_and_y[row['x']][row['y']].values()),
    )[0],
    axis=1,
)



Aucun commentaire:

Enregistrer un commentaire