I want to select one row among a large dataset (>30Million rows) with heavy writes/reads RANDOMLY.
My problem I can't let the arbitrary pick to postgresql (that would have been the cheapest / fastest query, ust using 'limit 1') as it behaves erratically and in "obscure ways": see my initial problem here: postgresql 9.4 - prevent app selecting always the latest updated rows
Here is my current query
UPDATE opportunities s
SET opportunity_available = false
FROM (
SELECT id
FROM opportunities
WHERE deal_id = #{@deal.id}
AND opportunity_available
AND pg_try_advisory_xact_lock(id)
LIMIT 1
FOR UPDATE
) sub
WHERE s.id = sub.id
RETURNING s.prize_id, s.id;
// inspired by http://ift.tt/1PLUXyH
I asked a first question (postgresql 9.4 - prevent app selecting always the latest updated rows) but i think that even if there is no clear answer, what is happening is that postgresql is left free to make a arbitrary pick (as I only use 'LIMIT 1' because I wanted the cheapest/fastest query), which is VERY DIFFERENT from a RANDOM pick. But as a consequence, postgresql often outputs the latest rows updated by the administrator (which are always opportunities that have all prizes), versus really choosing randomly the row.
I think I need to move away from the arbitrary pick to get a RANDOM pick.
In that context what is the best choice i.e the fastest to select (notice the 'FOR UPDATE' and 'advisory locks' as i need to lock rows when they are being updated with for update to prevent concurrent calls...I'll use soon in postgresql 9.5 skip locked as soon as 9.5 goes out of beta)
-
Use order with random() but it is notoriously (read many many posts on on stackoverflow and stack exchange dba about this) to be REALLY really slow on large datasets => "ORDER BY RAND() is slow because the DBMS has to read all rows, sort them all, just to keep only a few rows. So the performance of this query heavily depends on the number of rows in the table, and decreases as the number of rows increase.", as explained here or here
-
Use offset is also know to be slow for large datasets
-
Use Sampling like explained/advised here by what seem big experts: http://ift.tt/1H3dWN5
-
Use another advanced technique you might suggest
Aucun commentaire:
Enregistrer un commentaire