samedi 15 août 2020

How do I quickly get a weighted random instance of a Django model instance based on a weight field on that model?

I'm using Postgres as the database backend but I don't think that will matter. Also, I'd like to still use sqlite3 as the local development database, so ideally any approach will work for both.

By "weighted", I mean that some items in that database table are more likely to appear than others based on a heuristic value ranging from 0 to +inf where 0 is "never going to be picked", 1 is "as equal to be picked as any other instance, and 2 is "twice as likely to be picked than any other instance".

I've read other SO posts about pulling random instances of models, but as far as I've seen there are no ways to do this quickly involving weights.

My model:

  • Has millions of instances.
  • Has a weight DecimalField which can be updated at any time, even during runtime.
  • Is not referenced anywhere except for using this random selection algorithm (so, for example, it can be deleted and recreated at any time with no problems).

What I'm after is a fast way to do this that is faster than the solutions I've tried or an explanation to why one of my solutions that I've tried is the fastest I can get.

Avoiding XY problem

I want to select "fresh" stuff from a database table but still give the chance of seeing some of the older content. If some content has been viewed too frequently or it is not well-received, it should appear less frequently. Ideally, I would be able to control how frequently this is: "ah, so this will appear 1.5 times more than other content on the site."

Stuff I've tried

Selecting randomly and trying again based on a probability roll

  • Count the total number of instances of the model. E.g.: 100.
  • Select a random instance of the model.
  • Roll a dice from 1 * weight / instances_count to determine if the item should be thrown back and a random choice be made again.

This seems like it's pretty slow and the random nature of the "throw back" might never terminate. In general, this is really janky and I would hate to use it. Putting it first as it is pretty "simple" and I'm most likely to dismiss it.

SELECTing every element ID and weight and using a random weight algorithm to select an ID

  • SELECT all of the rows.
  • Assign each row a range of IDs based on the weights.
  • Add up all of the weights dynamically.
  • Roll a sum_of_all_weights-sided dice.
  • Whatever is selected, pick an ID based on the weights.

Problem is, this algorithm seems slow for millions of rows. This is the "naive" solution.

Assigning a range of IDs depending on the weight and dynamically delete/re-create instances

  • Whenever something is added or the weight is changed, delete all instances containing the unique details of the instance and create weight more instances with the same meta-information.
  • Pick a random instance normally.

A caveat with this is that only integer-based weightings are possible. Also, the performance problem is offloaded from the SELECT to the INSERT and DELETE operations.

Rethink the entire model and introduce a "throw-back" value

  • Add a throw_back_probability field instead of weight.
  • If the probability is 0.0 it will never be thrown back. Otherwise, roll a die and throw back if required depending on the throw_back_probability.
  • Limit the algorithm to 3 "throw-back"s (or some arbitrary number).

This ultimately solves the problem but still probably requires more database calls and is an indirect solution.


It's SO, so I'm sure there are clever annotation-based solutions to this (or similar) that I'm overlooking. Thanks for the help in advance!




Aucun commentaire:

Enregistrer un commentaire