mercredi 27 novembre 2019

pyspark - get random values reproducible across Spark sessions

I want to add a column of random integers to a dataframe for something I am testing. I am struggling to get reproducible results across Spark sessions. I am able to reproduce the results by using

from pyspark.sql.functions import rand

new_df = my_df.withColumn("rand_index", rand(seed = 7))

but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script. I also tried defining a udf and using random from Python with random.seed set but to no avail.

Is there a way to ensure reproducible random number generation across Spark sessions? I would really appreciate some guidance :) Thanks for the help!




Aucun commentaire:

Enregistrer un commentaire