Note: I tried searching for other questions and answers that address this, but I couldn't find one that really corresponded to my problem and presented a full solution.
I'm trying to generate random, synthetic data to test my database schema, using SQL. While generating a bunch of random values in PostgreSQL is easy enough with random()
, generating random data sets that preserve the constraints and characteristics of the data I'm expecting to see is not. Specifically, I have the following tables:
CREATE TABLE suites(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);
INSERT INTO suites(name)
SELECT 'suite' || g FROM generate_series(1,50) g;
CREATE TABLE tests(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);
INSERT INTO tests(name)
SELECT 'test' || g FROM generate_series(1,100) g;
CREATE TABLE tests_in_suites(
suite_id BIGINT,
test_id BIGINT,
PRIMARY KEY (suite_id, test_id)
);
I want to populate tests_in_suites
with random values, such that each suite contains a random number (between 3 and 7) of tests, selected uniformly from tests
. I want the selection to be random and uniform, and avoid cycles and other repeating patterns. I have tried the following approach:
SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5)) t
ORDER BY s.id, t.id;
But it always chooses the same number of tests for each suite and the tests selected are identical, because the optimiser replaces the subquery for s
with a constant. I tried introducing a dependency on the current suite being considered, but it complains the value I'm trying to use is not accessible:
SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
ORDER BY s.id, t.id;
ERROR: invalid reference to FROM-clause entry for table "s"
LINE 3: ...s ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
^
HINT: There is an entry for table "s", but it cannot be referenced from this part of the query.
How can I generate my random data without falling victim to either the optimiser or invalid data dependencies in my query?
Aucun commentaire:
Enregistrer un commentaire