jeudi 4 mars 2021

Unexpected behaviour of rand() in MySQL

I encountered a very weird result while trying to filter my data using RAND() function. Suppose i have a table filled with some data:

CREATE TABLE `status_log` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
`rank` int(11) DEFAULT 50,
)

Then i do the following simple select: select id,rank as rank,(rand()*100) as thres from status_log where rank = 50, and have a clear and expected output:

<...skip...>
| 6575476 |   50 |     34.51090244065123 |
| 6575511 |   50 |     67.84258230388404 |
| 6575589 |   50 |     35.68020727083106 |
| 6575644 |   50 |     74.87329251586766 |
| 6575723 |   50 |     67.32584384020961 |
| 6575771 |   50 |    12.009344726809621 |
| 6575863 |   50 |     58.06919518678374 |
+---------+------+-----------------------+
66169 rows in set (2.502 sec).

So, i generate some random data from 0 to 100 and join each result to the table, around 66000 results in total.

Then i want only the (random) part of the data to be shown. It doesn't have any purpose for production, by the way, it's just some artificial test, so let's not discuss it.

select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres>rank order by thres;

After that i get the following:

<...skip...>
| 4396732 |   50 |  99.97966075314177 |
| 4001782 |   50 |  99.98002871869134 |
| 1788580 |   50 |  99.98064143581375 |
| 5300286 |   50 |  99.98275954274717 |
|  146401 |   50 |  99.98552389441573 |
| 4744748 |   50 |  99.98644758014609 |
+---------+------+--------------------+
16449 rows in set (2.188 sec)

It's obvious that for the mean of 50 the expected number of results should be around 33000 out of total 66000. So it seems that the distribution of rand() is biased, correct?

Let's then change > to <:

select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres<rank order by thres;

<...skip...>
| 4653786 |   50 |    49.98035016467827 |
| 6041489 |   50 |   49.980370281245904 |
| 5064204 |   50 |   49.989308742796354 |
| 1699741 |   50 |   49.991373205549436 |
| 3234039 |   50 |    49.99390454030959 |
|  806791 |   50 |    49.99575274996064 |
| 3713581 |   50 |    49.99814410693771 |
+---------+------+----------------------+
16562 rows in set (2.373 sec)

Again 16000! So not the half but the quarter of all results is shown! It seems that the output of rand() inside the brackets is somehow influenced with the expression outside them. How is this possible?

I can also union it:

select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres<50
UNION ALL
select * from (select id,rank as rank,(rand()*100) as thres from status_log where rank = 50) t where thres>=50;

The expected number of results has to be somewhere around 66000, but it returns only 33000 or so.

I observe this behavior only when rand() is non-deterministic and is generated dynamically each time. If i do ...select id,rank as rank,(rand(id)*100)... (i.e. make the output of rand() dependent of id), i start getting the expected number of results (33000-ish). The same happens if i precalculate and fill a temporary field in the table.

I also tried making the filtering with rank=30, and the results were ~6000 and ~32000 for < and > respectively.

Version 10.5.8-MariaDB-3, InnoDB




Aucun commentaire:

Enregistrer un commentaire