I want to sample 3 items from an array of 4 items based on the probability of the items occurring. The input is an array of items and there is another array with the probability of the item being selected:
I tried creating a udf and passing the array values to it, with some modification I got that to work but it picks up items with replacement, I want the items to be unique in the sample.
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = spark.createDataFrame(
[(1, ['A','B','C','D'], [0.5,0.3,0.1,0.1]),
(2, ['X','Y','Z','Q'], [0.8,0.1,0.1,0.0]),
(3, ['Z','P','Q','R'], [0.6,0.3,0.1,0.1]),
(4, ['P','M','R','Q'], [0.7,0.2,0.1,0.0]),
(5, ['P','M','R','Q'], [0.5,0.3,0.1,0.1])],
["id", "item_id", "probability"])
import random
def wt_sample(item, wt):
return(random.choices(item, weights=wt, k=3))
udf_wt_sample = F.udf(wt_sample, T.ArrayType(T.StringType()))
df.withColumn("sample", udf_wt_sample(col("items"), col("probability"))).show()
Tried np.random.choice
which lets me specify if I want replacement but this takes too long and errors out:
import numpy as np
def wt_sample(item, wt):
return np.random.choice(item,size=3,replace=False, p=wt)
udf_wt_sample = F.udf(wt_sample, T.ArrayType(T.StringType()))
Input:
+--+-----------+-----------------+
|id| item_id| probability|
+--------------+-----------------+
| 1| [A,B,C,D]|[0.5,0.3,0.1,0.1]|
| 2| [X,Y,Z,Q]|[0.8,0.1,0.1,0.0]|
| 3| [Z,P,Q,R]|[0.6,0.3,0.1,0.1]|
| 4| [P,M,R,Q]|[0.7,0.2,0.1,0.0]|
| 5| [P,M,R,Q]|[0.5,0.3,0.1,0.1]|
+--------------+-----------------+
The output should look like this:
+--+-----------+-----------------+-------+
|id| item_id| probability| sample|
+--------------+-----------------+-------+
| 1| [A,B,C,D]|[0.5,0.3,0.1,0.1]|[A,B,D]|
| 2| [X,Y,Z,Q]|[0.8,0.1,0.1,0.0]|[X,Z,Q]|
| 3| [Z,P,Q,R]|[0.6,0.3,0.1,0.1]|[Z,P,M]|
| 4| [P,M,R,Q]|[0.7,0.2,0.1,0.0]|[P,M,R]|
| 5| [P,M,R,Q]|[0.5,0.3,0.1,0.1]|[M,R,Q]|
+--------------+-----------------+-------+
The samples should be selected based on the probability randomly and shouldn't just be the top-3 values. That's why I tried to use the random.choice()
fn with weights but I need the sample values to be unique without replacement (Currently the output is with replacement when I use random.choices()
, so the same item can be picked twice in the sample).
Aucun commentaire:
Enregistrer un commentaire