vendredi 17 septembre 2021

Select N random rows with matching conditions in PostgreSQL

I have a small table (10K records) in PostgreSQL of individuals that I want to randomly select an age+gender match from a big (100M records) table and get several additional columns about these people.

There are a couple of considerations:

  1. I'd like an efficient solution as the table is kinda big
  2. While unlikely, I don't want to accidentally select any people in the small table from records from the big table. While a complete without replacement would be ideal, I am OK with just removing all the people in small table from the big table.
  3. The big table can have multiple records for everyone, thus a DISTINCT is needed.
  4. Once I get the N random matches I have to rejoin the results to mybigtable in order to get additional columns that I want
  5. In this database, I have privileges to create TEMP tables but I cannot load data from a CSV into them and I cannot create regular tables.

I have figured out (below) how to inefficiently randomly select N (in this case 3) records for one person.

What I really want to do is be able to generalize this so it randomly selects 10 records for the all the people in the table mymatch, matching on values age+gender. I can't quite understand how to move to this.

DROP TABLE IF EXISTS mybigtable;  -- this is 100M
CREATE TEMPORARY TABLE mybigtable (ID varchar, eID varchar, age INT, gender VARCHAR);

INSERT INTO mybigtable VALUES 
    ('1', 'aaa', 84, 'F'),('2', 'aaa', 16, 'M'),('3', 'aaa', 23, 'F'),('4', 'aaa', 16, 'F'),('5', 'aaa', 94, 'F'),('6', 'aaa', 91, 'F'),('7', 'aaa', 18, 'M'),('8', 'aaa', 57, 'F'),('9', 'aaa', 84, 'F'),('10', 'aaa', 80, 'M'),('11', 'aaa', 16, 'M'),('12', 'aaa', 46, 'M'),('13', 'aaa', 84, 'F'),('14', 'aaa', 16, 'M'),('15', 'aaa', 23, 'F'),('16', 'aaa', 84, 'F'),('17', 'aaa', 30, 'M'),('18', 'aaa', 15, 'M'),('19', 'aaa', 16, 'M'),('20', 'aaa', 23, 'F'),('21', 'aaa', 84, 'F'),('22', 'aaa', 14, 'M'),('23', 'aaa', 84, 'F'),('24', 'aaa', 57, 'M'),('25', 'aaa', 89, 'M'),('1', 'bbb', 83, 'F'),('2', 'bbb', 19, 'M'),('3', 'bbb', 64, 'F'),('4', 'bbb', 92, 'M'),('5', 'bbb', 23, 'F'),('6', 'bbb', 62, 'M'),('7', 'bbb', 43, 'M'),('8', 'bbb', 16, 'M'),('9', 'bbb', 93, 'M'),('10', 'bbb', 45, 'M'),('11', 'bbb', 96, 'M'),('12', 'bbb', 68, 'M'),('13', 'bbb', 16, 'M'),('14', 'bbb', 97, 'F'),('15', 'bbb', 31, 'M'),('16', 'bbb', 23, 'F'),('17', 'bbb', 32, 'F'),('18', 'bbb', 18, 'F'),
    ('19', 'bbb', 23, 'F'),('20', 'bbb', 16, 'M'),('21', 'bbb', 35, 'M'),('22', 'bbb', 84, 'F'),('23', 'bbb', 48, 'F'),('24', 'bbb', 73, 'F'),('25', 'bbb', 46, 'F'),('26', 'bbb', 16, 'M'),('27', 'bbb', 39, 'M'),('28', 'bbb', 86, 'M'),('29', 'bbb', 78, 'F'),('30', 'bbb', 28, 'M'),('31', 'bbb', 32, 'F'),('32', 'bbb', 43, 'M'),('33', 'bbb', 64, 'F'),('34', 'bbb', 26, 'M'),('35', 'bbb', 81, 'M'),('36', 'bbb', 84, 'F'),('37', 'bbb', 23, 'F'),('38', 'bbb', 49, 'F'),('39', 'bbb', 66, 'F'),('40', 'bbb', 23, 'F'),('41', 'bbb', 23, 'F'),('42', 'bbb', 16, 'M'),('43', 'bbb', 92, 'M'),
    ('44', 'bbb', 16, 'M'),('45', 'bbb', 62, 'M'),('46', 'bbb', 16, 'M'),('47', 'bbb', 24, 'M'),('48', 'bbb', 16, 'M'),('49', 'bbb', 94, 'F'),('50', 'bbb', 58, 'F'),('1', 'ccc', 69, 'F'),('2', 'ccc', 97, 'M'),('3', 'ccc', 84, 'F'),('4', 'ccc', 78, 'M'),('5', 'ccc', 84, 'F'),('6', 'ccc', 54, 'M'),('7', 'ccc', 21, 'M'),('8', 'ccc', 23, 'F'),('9', 'ccc', 26, 'M'),('10', 'ccc', 84, 'M'),('11', 'ccc', 84, 'F'),('12', 'ccc', 69, 'M'),('13', 'ccc', 74, 'M'),('14', 'ccc', 83, 'F'),('15', 'ccc', 97, 'M'),('16', 'ccc', 55, 'M'),('17', 'ccc', 23, 'F'),('18', 'ccc', 59, 'F'),('19', 'ccc', 23, 'F'),('20', 'ccc', 68, 'F'),('21', 'ccc', 23, 'F'),('22', 'ccc', 84, 'F'),('23', 'ccc', 63, 'M'),('24', 'ccc', 88, 'M'),('25', 'ccc', 70, 'M');

DROP TABLE IF EXISTS mymatch;  -- this will be about 10000
CREATE TEMPORARY TABLE mymatch (ID varchar, eID varchar, age INT, gender VARCHAR);

INSERT INTO mymatch VALUES
    ('16', 'aaa', 84, 'F'),('8', 'bbb', 16, 'M'),('15', 'aaa', 23, 'F');

DROP TABLE IF EXISTS mynotin;
CREATE TEMPORARY TABLE mynotin (ID varchar, eID varchar, age INT, gender VARCHAR);

--Create a table that does not have the people of interest
INSERT INTO mynotin
    SELECT DISTINCT ID, eID, age, gender 
    FROM   mybigtable mbt 
    WHERE  NOT EXISTS 
        (SELECT  
        FROM   mymatch
        WHERE  mymatch.ID = mbt.ID AND mymatch.eID = mbt.eID);


--This is the SELECT statement to get 3 random rows.  Eventually this has to go to a table so I can join it to mybigtable and get additional columns of interest for the matched people.
SELECT id, eid, age, gender  
    FROM (
        SELECT 
            t.*, 
            row_number() OVER(partition by age, gender ORDER BY RANDOM()) rn  -- is there a more efficient method 
        FROM mynotin t
        WHERE age=84 AND gender='F') t  -- These are the conditions I want to change to the table mymatch
    WHERE rn <= 3;  --three for the example this will change to 10

DROP TABLE IF EXISTS mybigtable, mymatch, mynotin;



Aucun commentaire:

Enregistrer un commentaire