dimanche 8 avril 2018

Django queryset annotation mutated by randomization

I've been working on a project where I need to perform a fairly involved filter & sort operation on a reasonably large dataset in Django, then serialize the result out to be consumed by some frontend application.

Here's the goal:

  1. Filter StoreItem results by some criteria. Let's just assume this is already done.
  2. Sort these StoreItem instances by their respective number of Discounts, in descending order.
  3. For collisions (i.e. when two StoreItems have the same number of Discounts), randomize that subset of results.
  4. Paginate the sorted results and return a subset of data.

We have a couple of oversimplified models:

class StoreItem(models.Model):
    # Model Fields
    name = models.CharField(max_length=128)
    price = models.DecimalField(max_digits=4, decimal_places=2)

class Discount(models.Model):
    # Model Fields
    name = models.CharField(max_length=128)
    is_active = models.BooleanField(default=True)
    client_verified = models.BooleanField(default=False)
    value = models.DecimalField(max_digits=4, decimal_places=2)

    # Relationships
    store_item = models.ForeignKey(StoreItem, on_delete=models.CASCADE, related_name='discounts')

Pretty straightforward: we have a bunch of StoreItem records, each of which can have N Discounts.

=== Attempt One ===

To solve item #2, we'll use a queryset annotation to calculate the number of Discounts for each StoreItem. For #3, we'll use order_by with the randomization flag.

Let's skip forward and pretend we already have serializers and a StoreItem viewset via Django Rest Framework. In the StoreItem viewset class, we have the following member method:

def filter_queryset(self, queryset):
    # perform some complex filtering operation elsewhere
    initial_queryset = super(StoreItemViewset, self).filter_queryset(queryset)

    # run the annotation to count the number of "valid" discounts
    annotated_queryset = initial_queryset.annotate(discount_count=Sum(
        Case(
            When(is_active=True, client_verified=True, then=1),
            default=0, output_field=IntegerField()
        )
    ))

    # order our results and return them for serialization (NOTE: doesn't work)
    return annotated_queryset.order_by(F('discount_count').desc(nulls_last=True), '?')

For some reason, this doesn't work -- it looks like Django's queryset annotations don't play nicely with the order_by randomization flag.

sorted_results_without_randomization = list(set(list(annotated_queryset.order_by(F('discount_count').desc(nulls_last=True)).values_list('discount_count', flat=True))))
# annotated_results = [0, 1, 2, 3, 5, 6, 10, 30]

sorted_results_with_randomization = list(set(list(annotated_queryset.order_by(F('discount_count').desc(nulls_last=True), '?').values_list('discount_count', flat=True))))
# annotated_results = [0, 1]

The randomization flag destroys/mutates the annotation results. I have no idea why. Maybe there's no convenient to do this in SQL in the first place and so the Django ORM is stuck in the same boat -- but this is pretty unintuitive.

=== Attempt Two ===

Okay, so maybe we'll be able to break the queryset into two parts -- the part with discounts, and the part without them -- then do the sorting logic separately and mash them back together. This obviously won't have the ideal collision resolution I described above, but at least the general order of the queryset will still be right (which is more important).

def filter_queryset(self, queryset):
    # perform some complex filtering operation elsewhere
    initial_queryset = super(StoreItemViewset, self).filter_queryset(queryset)

    # run the annotation to count the number of "valid" discounts
    annotated_queryset = initial_queryset.annotate(discount_count=Sum(
        Case(
            When(is_active=True, client_verified=True, then=1),
            default=0, output_field=IntegerField()
        )
    ))

    has_discounts = annotated_queryset.filter(discount_count__gt=0).order_by(F('discount_count').desc(nulls_last=True))
    no_discounts = annotated_queryset.exclude(discount_count__gt=0).order_by('?')

    # important: we can't use any nice set math here (like has_discounts.union(no_discounts))
    # since we'd lose our ordering
    return list(has_discounts) + list(no_discounts)

This works, but it is unbelievably slow thanks to the list conversion in the last step (around an order of magnitude slower than pure database operations). I've tried a bunch of other approaches -- using itertools, for example -- but none of them have really helped.

Does anyone have thoughts on how this can be made faster? I feel like I'm missing something really obvious, since this doesn't strike me as an uncommon use-case.

Thanks!




Aucun commentaire:

Enregistrer un commentaire