jeudi 20 octobre 2022

Choose from multinomial distribution

I have a series of values and a probability I want each those values sampled. Is there a PySpark method to sample from that distribution for each row? I know how to hard-code with a random number generator, but I want this method to be flexible for any number of assignment values and probabilities:

assignment_values = ["foo", "buzz", "boo"]
value_probabilities = [0.3, 0.3, 0.4]

Hard-coded method with random number generator:

from pyspark.sql import Row

data = [
    {"person": 1, "company": "5g"},
    {"person": 2, "company": "9s"},
    {"person": 3, "company": "1m"},
    {"person": 4, "company": "3l"},
    {"person": 5, "company": "2k"},
    {"person": 6, "company": "7c"},
    {"person": 7, "company": "3m"},
    {"person": 8, "company": "2p"},
    {"person": 9, "company": "4s"},
    {"person": 10, "company": "8y"},
]
df = spark.createDataFrame(Row(**x) for x in data)

(
    df
    .withColumn("rand", F.rand())
    .withColumn(
        "assignment", 
        F.when(F.col("rand") < F.lit(0.3), "foo")
        .when(F.col("rand") < F.lit(0.6), "buzz")
        .otherwise("boo")
    )
    .show()
)
+-------+------+-------------------+----------+
|company|person|               rand|assignment|
+-------+------+-------------------+----------+
|     5g|     1| 0.8020603266148111|       boo|
|     9s|     2| 0.1297179045352752|       foo|
|     1m|     3|0.05170251723736685|       foo|
|     3l|     4|0.07978240998283603|       foo|
|     2k|     5| 0.5931269297050258|      buzz|
|     7c|     6|0.44673560271164037|      buzz|
|     3m|     7| 0.1398027427612647|       foo|
|     2p|     8| 0.8281404801171598|       boo|
|     4s|     9|0.15568513681001817|       foo|
|     8y|    10| 0.6173220502731542|       boo|
+-------+------+-------------------+----------+



Aucun commentaire:

Enregistrer un commentaire