Context
I have a PostgreSQL table which contains several hundreds of thousands of rows and many columns.
Short: I'd like to initialize random values in some columns according to either a uniform distribution or a normal distribution.
Uniform distribution
I have 3 empty columns which I'd like to initialize with uniform random numbers (i.e. according to a uniform distribution along each column).
For that, I'm using the PostgreSQL random() function, but it is not clearly explained in the documentation whether the generated numbers are picked from a uniform or a normal distribution:
Source: https://www.postgresql.org/docs/12/functions-math.html
So I made the (I hope correct) hypothesis it is a uniform distribution from now.
Normal distribution
And I have 3 other empty columns, which I'd like to initialize with normal random numbers (i.e. according to a normal distribution along each column):
Results
For a uniform distribution
I did this (I actually figured it out while writing this post);
UPDATE schema.tables
SET col1 = (1 * random() + 1)::float4,
col2 = (1 * random() + 1)::float4,
col3 = (1 * random() + 1)::float4
Which is a bit slow but seems to work, because here is an example of generated data:
And the data histogram for one column is almost uniform, so I guess it's OK:
But what would be great is to set the entire columns values in one shot instead of row by row (or maybe this row-by-row way of updating the table performs really better in SQL. I don't know, but it's more a linear algebra reasoning that I have in the background).
For a normal distribution
I'm stuck with the following query, inspired by the previous one but using the normal_rand function from the tablefunc extension:
UPDATE schema.table
SET col1 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
col2 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
col3 = (1 * normal_rand(1,50.0,20.0) + 2)::float4),
but here, I am facing the following error: set-returning functions are not allowed in UPDATE.
So, I guessed I have to use a SELECT sub-query then, hence I also tried this, but without much success:
UPDATE schema.table
SET col1 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 20.0) + 2)::float4 as col) AS sub,
col2 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 10.0) + 2)::float4 as col) AS sub,
col3 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 5.0) + 2)::float4 as col) AS sub
Where I get a syntax error at or near col2
But If I play with one single column:
UPDATE schema.table
SET col1 = sub.col
FROM (SELECT (1 * normal_rand(1, 50.0, 20.0) + 2)::float4 as col) AS sub;
It almost works, the query is successful, but I have the exact same number in each row of the column, which obviously doesn't make it a normal distribution!
So my dream would be to be able to update all whole column in one shot, using something such as:
UPDATE schema.table
SET col1 = sub.col
FROM (
SELECT (
1 * normal_rand(
SELECT count(*) FROM schema.table,
50.0,
20.0
) + 2
):: float4 AS col
) AS sub;
But here again, I got a syntax error at the 2nd SELECT position; syntax error at or near "select"
Question
How can I set one or more entire column(s) with random numbers according to a normal distribution?



Aucun commentaire:
Enregistrer un commentaire