samedi 1 août 2020

PySpark, generate random integer for row not in list

I am trying to efficiently add an integer column to a dataframe, where the integer is any number, i, from 0 to N, where i is not in the set R. The set r in R for every row is obtained by a mapping from the value of the first column, word1, to a set, K. Every value in word1 has a map to a set of values within R.

word1     | word2     | word3
Idaho       New York    <rand integer based on word 1 map>

I'm trying to do this without a UDF. This is the psuedo-code I've landed on but I really don't know if this is even possible with PySpark. I'm stuck in how to generate a random integer within a range that excludes values. Normally, I'd generate random values until a value meeting the condition is met. But I think I'd need to move any while loop into a udf, which I'm trying to avoid.

df_expanded = df.withColumn('word3', rand() where rand() not in map(col('word1')))

Here is how I'd write it in a udf, but this is very slow.

def get_sample(all_ii, word):
    # the set of integers to exclude
    s = iid[word]
    #  all_ii, set, has all possible integers in a range
    a = all_ii - s
    if len(a): return random.sample(a, 1)[0]
    return ''



Aucun commentaire:

Enregistrer un commentaire