samedi 5 mars 2022

Select N values from an array based on probabilities - Pyspark

I want to sample 3 items from an array of 4 items based on the probability of the items occurring. The input is an array of items and there is another array with the probability of the item being selected:

I tried creating a udf and passing the array values to it, with some modification I got that to work but it picks up items with replacement, I want the items to be unique in the sample.

import pyspark.sql.functions as F
import pyspark.sql.types as T

df = spark.createDataFrame(
    [(1, ['A','B','C','D'], [0.5,0.3,0.1,0.1]), 
     (2, ['X','Y','Z','Q'], [0.8,0.1,0.1,0.0]),
     (3, ['Z','P','Q','R'], [0.6,0.3,0.1,0.1]),
     (4, ['P','M','R','Q'], [0.7,0.2,0.1,0.0]),
     (5, ['P','M','R','Q'], [0.5,0.3,0.1,0.1])], 
    ["id", "item_id", "probability"])

import random
def wt_sample(item, wt):
    return(random.choices(item, weights=wt, k=3))

udf_wt_sample = F.udf(wt_sample, T.ArrayType(T.StringType()))
df.withColumn("sample", udf_wt_sample(col("items"), col("probability"))).show()

Tried np.random.choice which lets me specify if I want replacement but this takes too long and errors out:

import numpy as np
def wt_sample(item, wt):
    return np.random.choice(item,size=3,replace=False, p=wt)

udf_wt_sample = F.udf(wt_sample, T.ArrayType(T.StringType()))

Input:

+--+-----------+-----------------+
|id|    item_id|      probability|
+--------------+-----------------+
| 1|  [A,B,C,D]|[0.5,0.3,0.1,0.1]|
| 2|  [X,Y,Z,Q]|[0.8,0.1,0.1,0.0]|
| 3|  [Z,P,Q,R]|[0.6,0.3,0.1,0.1]|
| 4|  [P,M,R,Q]|[0.7,0.2,0.1,0.0]|
| 5|  [P,M,R,Q]|[0.5,0.3,0.1,0.1]|
+--------------+-----------------+

The output should look like this:

+--+-----------+-----------------+-------+
|id|    item_id|      probability| sample|
+--------------+-----------------+-------+
| 1|  [A,B,C,D]|[0.5,0.3,0.1,0.1]|[A,B,D]|
| 2|  [X,Y,Z,Q]|[0.8,0.1,0.1,0.0]|[X,Z,Q]|
| 3|  [Z,P,Q,R]|[0.6,0.3,0.1,0.1]|[Z,P,M]|
| 4|  [P,M,R,Q]|[0.7,0.2,0.1,0.0]|[P,M,R]|
| 5|  [P,M,R,Q]|[0.5,0.3,0.1,0.1]|[M,R,Q]|
+--------------+-----------------+-------+

The samples should be selected based on the probability randomly and shouldn't just be the top-3 values. That's why I tried to use the random.choice() fn with weights but I need the sample values to be unique without replacement (Currently the output is with replacement when I use random.choices(), so the same item can be picked twice in the sample).




Aucun commentaire:

Enregistrer un commentaire