samedi 2 mai 2020

Update PostgreSQL columns with random numbers according to a uniform or normal law

Context

I have a PostgreSQL table which contains several hundreds of thousands of rows and many columns.

Short: I'd like to initialize random values in some columns according to either a uniform distribution or a normal distribution.


Uniform distribution

I have 3 empty columns which I'd like to initialize with uniform random numbers (i.e. according to a uniform distribution along each column).

For that, I'm using the PostgreSQL random() function, but it is not clearly explained in the documentation whether the generated numbers are picked from a uniform or a normal distribution:

PostgreSQL (unclear) documentation for the random function Source: https://www.postgresql.org/docs/12/functions-math.html

So I made the (I hope correct) hypothesis it is a uniform distribution from now.


Normal distribution

And I have 3 other empty columns, which I'd like to initialize with normal random numbers (i.e. according to a normal distribution along each column):

Normal distribution for each of the 3 columns


Results

For a uniform distribution

I did this (I actually figured it out while writing this post);

UPDATE schema.tables
     SET col1 = (1 * random() + 1)::float4,
     col2 = (1 * random() + 1)::float4,
     col3 = (1 * random() + 1)::float4

Which is a bit slow but seems to work, because here is an example of generated data:

3 columns of uniform random number

And the data histogram for one column is almost uniform, so I guess it's OK:

Example of a uniform distribution across one column

But what would be great is to set the entire columns values in one shot instead of row by row (or maybe this row-by-row way of updating the table performs really better in SQL. I don't know, but it's more a linear algebra reasoning that I have in the background).


For a normal distribution

I'm stuck with the following query, inspired by the previous one but using the normal_rand function from the tablefunc extension:

UPDATE schema.table
  SET col1 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
  col2 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
  col3 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),

but here, I am facing the following error: set-returning functions are not allowed in UPDATE.

So, I guessed I have to use a SELECT sub-query then, hence I also tried this, but without much success:

UPDATE schema.table
  SET col1 = sub.col
  FROM (SELECT (1 * normal_rand(1, 50.0, 20.0) + 2)::float4 as col) AS sub,
  col2 = sub.col
  FROM (SELECT (1 * normal_rand(1, 50.0, 10.0) + 2)::float4 as col) AS sub,
  col3 = sub.col
  FROM (SELECT (1 * normal_rand(1, 50.0, 5.0) + 2)::float4 as col) AS sub

Where I get a syntax error at or near col2

But If I play with one single column:

UPDATE schema.table
  SET col1 = sub.col
  FROM (SELECT (1 * normal_rand(1, 50.0, 20.0) + 2)::float4 as col) AS sub;

It almost works, the query is successful, but I have the exact same number in each row of the column, which obviously doesn't make it a normal distribution!

So my dream would be to be able to update all whole column in one shot, using something such as:

UPDATE schema.table 
SET col1 = sub.col 
FROM (
    SELECT (
        1 * normal_rand(
            SELECT count(*) FROM schema.table, 
            50.0, 
            20.0
            ) + 2
    ):: float4 AS col
) AS sub;

But here again, I got a syntax error at the 2nd SELECT position; syntax error at or near "select"

Question

How can I set one or more entire column(s) with random numbers according to a normal distribution?




Aucun commentaire:

Enregistrer un commentaire