mercredi 21 février 2018

Efficient sampling of a fixed number of rows in BigQuery

I have a large dataset of size N, and want to get a (uniformly) random sample of size n. This question offers two possible solutions:

SELECT foo FROM mytable WHERE RAND() < n/N

→ This is fast, but doesn't give me exactly n rows (only approximately).

SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n

→ This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).

Is there a solution that combines the advantages of both? I imagine I could use the first solution to select 2n rows, then sort this smaller dataset, but it's sort of ugly and not guaranteed to work, so I'm wondering whether there's a better option.




Aucun commentaire:

Enregistrer un commentaire