I am trying to efficiently add an integer column to a dataframe, where the integer is any number, i, from 0 to N, where i is not in the set R. The set r in R for every row is obtained by a mapping from the value of the first column, word1
, to a set, K. Every value in word1
has a map to a set of values within R.
word1 | word2 | word3
Idaho New York <rand integer based on word 1 map>
I'm trying to do this without a UDF. This is the psuedo-code I've landed on but I really don't know if this is even possible with PySpark. I'm stuck in how to generate a random integer within a range that excludes values. Normally, I'd generate random values until a value meeting the condition is met. But I think I'd need to move any while loop into a udf, which I'm trying to avoid.
df_expanded = df.withColumn('word3', rand() where rand() not in map(col('word1')))
Here is how I'd write it in a udf, but this is very slow.
def get_sample(all_ii, word):
# the set of integers to exclude
s = iid[word]
# all_ii, set, has all possible integers in a range
a = all_ii - s
if len(a): return random.sample(a, 1)[0]
return ''
Aucun commentaire:
Enregistrer un commentaire