I have a table in databricks with ~500 million rows covering 25 months of data (2018-2020). I want to pull a 1 million row sample for some quick analysis & modeling. When I use tablesample(1000000 rows)
or limit 1000000
only 8 of the 25 months are represented in the sample. Is there a way to fix this?
%sql
with c1 as(
select *
from my_table
TABLESAMPLE(1000000 ROWS)
)
select month_id, count(*) cnt
from c1
group by month_id
order by month_id
Returns 8 rows all in the last 8 months of the data set.
%sql
with c1 as(
select *
from my_table
limit 1000000
)
select month_id, count(*) cnt
from c1
group by month_id
order by month_id
Returns 8 rows all in the last 8 months of the data set.
%sql
with c1 as(
select *
from my_table
--TABLESAMPLE(1000000 ROWS)
--limit 1000000
)
select month_id, count(*) cnt
from c1
group by month_id
order by month_id
Returns 25 rows covering all months
Aucun commentaire:
Enregistrer un commentaire