Hello, I have a dataframe with ID and associated quarter for that ID (table1 in image). My objective at the end of day, is to randomly select an ID regardless of the quarter.
In order to randomly select an ID, I add the Random_Num column using the PySpark - rand function and also set the seed value so the results can be replicated (for ex: random_num = rand(seed=1234). After adding random_num column (table2 in image), I sort table2 based on ID and Random_Num and then randomly select the ID using - dropDuplicates function. Once I run the dropDuplicates, I get table3 show in image.
After this step, I do other data processing which is not a concern for me. At the end, I run aggregate function and count ID for each quarter. However, every time, I run it, I get different count and also associated aggregated statistics such as average value. For example, when I run aggreate - I get for example: count of 200 in Quarter - 3/31/2015). The next time, I run the same code, I get 210 for Quarter - 3/31/2015.
Note that - I have several million records that I have in my dataframe.
My question - Is this expected to get different count everytime I run - since I am using the random function (I verified the random function with seed value generates the same number every time) or is there some other problem which I am don't know off. (FYI - I am new to PySpark).
Lastly, I found two similar Stack Overflow question and the discussion was to either - cache() or persist() the dataframe. Is this the only solution?
I appreciate any guidance or help you can provide.
Reference:
spark inconsistency when running count command
Spark re-samples my data everytime I run something related to the sample
Aucun commentaire:
Enregistrer un commentaire