I am trying to generate hierarchical random data in a Pandas data frame. As a toy example, suppose that I sample x from some distribution, and then sample y from some conditional distribution given x, and then sample z from some conditional distribution given x and y as shown below. In my real problem, x, y, and z can take many more values than just 0 and 1, but the distributions are represented using dictionaries as shown below. Is there a more elegant way to generate this data frame? It seems particularly ugly that I have to generate an "array" using np.random.choice, and then just choose one element. Additionally, it seems like the code for generating z is particularly awkward because I need to extract the x and y columns from row rather than being able to write something like lambda x, y: ... to have the row automatically flattened into columns.
p_x = {0: 0.2, 1: 0.8}
p_y_given_x = {
0: {0: 0.3, 1: 0.7},
1: {0: 0.5, 1: 0.5},
}
p_z_given_x_and_y = {
0: {0: {0: 0.1, 1: 0.9}, 1: {0: 0.5, 1: 0.5}},
1: {0: {0: 0.5, 1: 0.5}, 1: {0: 0.7, 1: 0.3}},
}
data = pd.DataFrame({
'x': np.random.choice(a=list(p_x), size=10, p=list(p_x.values()))
})
data['y'] = data['x'].apply(
lambda x: np.random.choice(
list(p_y_given_x[x]),
size=1,
p=list(p_y_given_x[x].values()),
)[0],
)
data['z'] = data.apply(
lambda row: np.random.choice(
list(p_z_given_x_and_y[row['x']][row['y']]),
size=1,
p=list(p_z_given_x_and_y[row['x']][row['y']].values()),
)[0],
axis=1,
)
Aucun commentaire:
Enregistrer un commentaire