mardi 22 juin 2021

How to generate realistic, arbitrary length text in (plain) SQL

I have a testing requirement to populate a text field in a SQL database, of arbitrary length (e.g. between 100 and 32k characters), to an arbitrary number of rows (up to several million rows). The text should be vaguely realistic (English), so a single character repeated N times is no good. The example below is written for IBM Db2, but the idea is that it should work in any major RDBMS without using functionality which is uniquely specific to a single database (if your answer only works in Sql Server, or Postgres that's fine, but the more generic the better). So far I've come up with the following approach:

  1. Create a table with an id and text fields:
CREATE TABLE STAGING.DATA_TEST (
    id      BIGINT          NOT NULL    GENERATED BY DEFAULT AS IDENTITY,
    text    VARCHAR(32000)  NULL
);
  1. Insert some pre-canned text of arbitrary length (e.g. quotes from a popular book):
INSERT INTO STAGING.DATA_TEST (text) 
VALUES 
    ('It was all very well to say “Drink me,” but the wise little Alice was not going to do that in a hurry. “No, I’ll look first,” she said, “and see whether it’s marked ‘poison’ or not."'),
    ('“Curiouser and curiouser!” cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English).'),
    ('“I wish I hadn’t cried so much!” said Alice, as she swam about, trying to find her way out. “I shall be punished for it now, I suppose, by being drowned in my own tears! That will be a queer thing, to be sure! However, everything is queer to-day.”'),
    ('“Well! I’ve often seen a cat without a grin,” thought Alice; “but a grin without a cat! It’s the most curious thing I ever saw in all my life!” '),
    ('“Who are you?” said the Caterpillar. This was not an encouraging opening for a conversation. Alice replied, rather shyly, “I—I hardly know, Sir, just at present—at least I know who I was when I got up this morning, but I think I must have been changed several times since then.”')
;
  1. Join the above table to itself (pseudo-random order) and concatenate the text fields. Insert the results into the same table:
INSERT INTO STAGING.DATA_TEST (text)
    SELECT CONCAT(t1.text, t2.text)
        FROM STAGING.DATA_TEST as t1
        JOIN (SELECT MAX(id) AS MAXID, MIN(id) AS MINID FROM STAGING.DATA_TEST) AS C ON 1 = 1
        JOIN STAGING.DATA_TEST AS t2 ON t1.id = (MOD(t2.id * 7, MAXID - MINID) + MINID)
;

I then repeat step 3 which gives me double the rows and rougly double the length of text in those rows each time. At some point I run out of column width, at which point I can't repeat step 3 any more, and if I still need more rows I can just duplicate the whole table again and again as required. The problem I have with this approach, is:

  1. I don't have fine-grained control over the distribution of the length of text generated.
  2. I don't have fine-grained control over the number of rows generated (after running out of column width by the random join approach above).
  3. I don't know whether my naive source of pseudo-randomness (see the join statement above) gives me good entropy in my generated data.

To start to answer 3 I tested the above with starting values [A,B,C,D,E] and it produces strings like this: CBCEDCBBCCBABAEBCCBADABCBDEADBCDEDADEBBCCBCBCBCBCBADBCDEDADEBDADADBCCBADBCDEDAD

This appears random enough to me, but without a background in statistics I am not sure if it's any good. Would appreciate feedback and any suggested improvements on the above.




Aucun commentaire:

Enregistrer un commentaire