I have 5 large tables in hive. I want to create smaller samples for each of them using pyspark.
I found that there're at least 3 ways to do this.
# best option?
df.rdd.takeSample(withReplacement=False, num=5000).toDF(schema=df.schema)
# does not shuffles
df.limit(5000)
# fraction instead of n_rows
df.sample(withReplacement=False, fraction=0.01)
Latest option requires fraction instead of specific number of rows so it depends on size of initial dataframe. df.count() takes long time.
I need a random sample and limit does not shuffle (?) dataframe.
So I've chosen df.rdd.takeSample. Is it the best way to do fast random sample of dataframe?
Aucun commentaire:
Enregistrer un commentaire