I've been working on a project where I need to perform a fairly involved filter & sort operation on a reasonably large dataset in Django, then serialize the result out to be consumed by some frontend application.
Here's the goal:
- Filter StoreItem results by some criteria. Let's just assume this is already done.
- Sort these StoreItem instances by their respective number of Discounts, in descending order.
- For collisions (i.e. when two StoreItems have the same number of Discounts), randomize that subset of results.
- Paginate the sorted results and return a subset of data.
We have a couple of oversimplified models:
class StoreItem(models.Model):
# Model Fields
name = models.CharField(max_length=128)
price = models.DecimalField(max_digits=4, decimal_places=2)
class Discount(models.Model):
# Model Fields
name = models.CharField(max_length=128)
is_active = models.BooleanField(default=True)
client_verified = models.BooleanField(default=False)
value = models.DecimalField(max_digits=4, decimal_places=2)
# Relationships
store_item = models.ForeignKey(StoreItem, on_delete=models.CASCADE, related_name='discounts')
Pretty straightforward: we have a bunch of StoreItem records, each of which can have N Discounts.
=== Attempt One ===
To solve item #2, we'll use a queryset annotation to calculate the number of Discounts for each StoreItem. For #3, we'll use order_by with the randomization flag.
Let's skip forward and pretend we already have serializers and a StoreItem viewset via Django Rest Framework. In the StoreItem viewset class, we have the following member method:
def filter_queryset(self, queryset):
# perform some complex filtering operation elsewhere
initial_queryset = super(StoreItemViewset, self).filter_queryset(queryset)
# run the annotation to count the number of "valid" discounts
annotated_queryset = initial_queryset.annotate(discount_count=Sum(
Case(
When(is_active=True, client_verified=True, then=1),
default=0, output_field=IntegerField()
)
))
# order our results and return them for serialization (NOTE: doesn't work)
return annotated_queryset.order_by(F('discount_count').desc(nulls_last=True), '?')
For some reason, this doesn't work -- it looks like Django's queryset annotations don't play nicely with the order_by randomization flag.
sorted_results_without_randomization = list(set(list(annotated_queryset.order_by(F('discount_count').desc(nulls_last=True)).values_list('discount_count', flat=True))))
# annotated_results = [0, 1, 2, 3, 5, 6, 10, 30]
sorted_results_with_randomization = list(set(list(annotated_queryset.order_by(F('discount_count').desc(nulls_last=True), '?').values_list('discount_count', flat=True))))
# annotated_results = [0, 1]
The randomization flag destroys/mutates the annotation results. I have no idea why. Maybe there's no convenient to do this in SQL in the first place and so the Django ORM is stuck in the same boat -- but this is pretty unintuitive.
=== Attempt Two ===
Okay, so maybe we'll be able to break the queryset into two parts -- the part with discounts, and the part without them -- then do the sorting logic separately and mash them back together. This obviously won't have the ideal collision resolution I described above, but at least the general order of the queryset will still be right (which is more important).
def filter_queryset(self, queryset):
# perform some complex filtering operation elsewhere
initial_queryset = super(StoreItemViewset, self).filter_queryset(queryset)
# run the annotation to count the number of "valid" discounts
annotated_queryset = initial_queryset.annotate(discount_count=Sum(
Case(
When(is_active=True, client_verified=True, then=1),
default=0, output_field=IntegerField()
)
))
has_discounts = annotated_queryset.filter(discount_count__gt=0).order_by(F('discount_count').desc(nulls_last=True))
no_discounts = annotated_queryset.exclude(discount_count__gt=0).order_by('?')
# important: we can't use any nice set math here (like has_discounts.union(no_discounts))
# since we'd lose our ordering
return list(has_discounts) + list(no_discounts)
This works, but it is unbelievably slow thanks to the list conversion in the last step (around an order of magnitude slower than pure database operations). I've tried a bunch of other approaches -- using itertools, for example -- but none of them have really helped.
Does anyone have thoughts on how this can be made faster? I feel like I'm missing something really obvious, since this doesn't strike me as an uncommon use-case.
Thanks!
Aucun commentaire:
Enregistrer un commentaire