I want to add a column of random integers to a dataframe for something I am testing. I am struggling to get reproducible results across Spark sessions. I am able to reproduce the results by using
from pyspark.sql.functions import rand
new_df = my_df.withColumn("rand_index", rand(seed = 7))
but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script. I also tried defining a udf and using random from Python with random.seed set but to no avail.
Is there a way to ensure reproducible random number generation across Spark sessions? I would really appreciate some guidance :) Thanks for the help!
Aucun commentaire:
Enregistrer un commentaire