mercredi 2 juin 2021

BigQuery: Sample a varying number of rows per group

I have two tables. One has a list of items, and for each item, a number n.

item | n
--------
a    | 1
b    | 2
c    | 3

The second one has a list of rows containing item, uid, and other rows.

item | uid | data
------------------
a    | x   | foo
a    | x   | baz
a    | x   | bar
a    | z   | arm
a    | z   | leg
b    | x   | eye
b    | x   | eye
b    | x   | eye
b    | x   | eye
b    | z   | tap
c    | y   | tip
c    | z   | top

I would like to sample, for each (item,uid) pair, n rows (arbitrary, it's better if this is uniformly random, but it doesn't have to be). In the example above, I want to keep maximum one row per user for item a, two rows per user for item b, and three rows per user to item c:

item | uid | data
------------------
a    | x   | baz
a    | z   | arm
b    | x   | eye
b    | x   | eye
b    | z   | tap
c    | y   | tip
c    | z   | top

n can be large (on the order of 100,000), so ARRAY_AGG with LIMIT n won't work. Are there any other options?




Aucun commentaire:

Enregistrer un commentaire