lundi 18 février 2019

What is fastest way to random sample hive table in spark?

I have 5 large tables in hive. I want to create smaller samples for each of them using pyspark.

I found that there're at least 3 ways to do this.

# best option?
df.rdd.takeSample(withReplacement=False, num=5000).toDF(schema=df.schema)

# does not shuffles
df.limit(5000)

# fraction instead of n_rows
df.sample(withReplacement=False, fraction=0.01)

Latest option requires fraction instead of specific number of rows so it depends on size of initial dataframe. df.count() takes long time.

I need a random sample and limit does not shuffle (?) dataframe.

So I've chosen df.rdd.takeSample. Is it the best way to do fast random sample of dataframe?

Aucun commentaire:

Enregistrer un commentaire