vendredi 8 janvier 2021

Fastest way to take a random sample of 100000 rows from each partition of a hive table

I have a table partitioned daywise with each partition containing almost 80M rows.

I want to take a random sample of 100000 rows from each partition for a particular month.

Currently I'm doing it using rank within each partition, ordering by rand() and then filtering on the rank but it takes almost 45-60 mins.

Is there a faster way to do the same thing without compromising on the quality of the sample?




Aucun commentaire:

Enregistrer un commentaire