I have a large dataset of size N, and want to get a (uniformly) random sample of size n. This question offers two possible solutions:
SELECT foo FROM mytable WHERE RAND() < n/N
→ This is fast, but doesn't give me exactly n rows (only approximately).
SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n
→ This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).
Is there a solution that combines the advantages of both? I imagine I could use the first solution to select 2n rows, then sort this smaller dataset, but it's sort of ugly and not guaranteed to work, so I'm wondering whether there's a better option.
Aucun commentaire:
Enregistrer un commentaire