lundi 24 mai 2021

Databricks : tablesample only using certain groups

I have a table in databricks with ~500 million rows covering 25 months of data (2018-2020). I want to pull a 1 million row sample for some quick analysis & modeling. When I use tablesample(1000000 rows) or limit 1000000 only 8 of the 25 months are represented in the sample. Is there a way to fix this?

%sql
with c1 as(
  select *
  from my_table
  TABLESAMPLE(1000000 ROWS)
)
select month_id, count(*) cnt
from c1
group by month_id
order by month_id

Returns 8 rows all in the last 8 months of the data set.

%sql
with c1 as(
  select *
  from my_table
  limit 1000000
)
select month_id, count(*) cnt
from c1
group by month_id
order by month_id

Returns 8 rows all in the last 8 months of the data set.

%sql
with c1 as(
  select *
  from my_table
  --TABLESAMPLE(1000000 ROWS)
  --limit 1000000
)
select month_id, count(*) cnt
from c1
group by month_id
order by month_id

Returns 25 rows covering all months




Aucun commentaire:

Enregistrer un commentaire