mercredi 17 février 2021

PostgreSQL -- how to generate random number of rows of fake data with foreign key constraints?

Note: I tried searching for other questions and answers that address this, but I couldn't find one that really corresponded to my problem and presented a full solution.

I'm trying to generate random, synthetic data to test my database schema, using SQL. While generating a bunch of random values in PostgreSQL is easy enough with random(), generating random data sets that preserve the constraints and characteristics of the data I'm expecting to see is not. Specifically, I have the following tables:

CREATE TABLE suites(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);

INSERT INTO suites(name)
SELECT 'suite' || g FROM generate_series(1,50) g;

CREATE TABLE tests(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);

INSERT INTO tests(name)
SELECT 'test' || g FROM generate_series(1,100) g;

CREATE TABLE tests_in_suites(
suite_id BIGINT,
test_id BIGINT,
PRIMARY KEY (suite_id, test_id)
);

DB Fiddle

I want to populate tests_in_suites with random values, such that each suite contains a random number (between 3 and 7) of tests, selected uniformly from tests. I want the selection to be random and uniform, and avoid cycles and other repeating patterns. I have tried the following approach:

SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5)) t
ORDER BY s.id, t.id;

DB Fiddle

But it always chooses the same number of tests for each suite and the tests selected are identical, because the optimiser replaces the subquery for s with a constant. I tried introducing a dependency on the current suite being considered, but it complains the value I'm trying to use is not accessible:

SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
ORDER BY s.id, t.id;

ERROR:  invalid reference to FROM-clause entry for table "s"
LINE 3: ...s ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
                                                             ^
HINT:  There is an entry for table "s", but it cannot be referenced from this part of the query.

DB Fiddle

How can I generate my random data without falling victim to either the optimiser or invalid data dependencies in my query?




Aucun commentaire:

Enregistrer un commentaire