random: What is fastest way to random sample hive table in spark?

I have 5 large tables in hive. I want to create smaller samples for each of them using pyspark.

I found that there're at least 3 ways to do this.

# best option?
df.rdd.takeSample(withReplacement=False, num=5000).toDF(schema=df.schema)

# does not shuffles
df.limit(5000)

# fraction instead of n_rows
df.sample(withReplacement=False, fraction=0.01)

Latest option requires fraction instead of specific number of rows so it depends on size of initial dataframe. df.count() takes long time.

I need a random sample and limit does not shuffle (?) dataframe.

So I've chosen df.rdd.takeSample. Is it the best way to do fast random sample of dataframe?

random

lundi 18 février 2019

What is fastest way to random sample hive table in spark?

Aucun commentaire:

Enregistrer un commentaire