Model.objects.all().distinct('foo').order_by('bar')
The above generates the following SQL:
SELECT DISTINCT ON ("app_model"."foo") <lots of columns here>
FROM "app_model"
ORDER BY "app_model"."bar" ASC;
I am not sure about all backends however the above syntax is not allowed in PostgreSQL which produces the following error:
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
Here are PostgreSQL docs explaining why that is not allowed:
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).
This ticket proposes to use subqueries in such situations which would use SQL:
SELECT *
FROM (
SELECT DISTINCT ON ("app_model"."foo") <lots of columns here>
FROM "app_model"
) result
ORDER BY "app_model"."bar" ASC;
The above is perfectly valid SQL and produces expected results (please note that ORDER_BY
is in the outer query to guarantee that distinct results are correctly sorted).
I created a simple proof-of-concept patch by overwriting few things in SQLCompiler.as_sql()
which seems to work pretty well. The patch only creates subquery when the above dilema is encountered which should not have any negative side-effects on existing queries (since such queries were not allowed by SQL). The patch also works on the .count()
queries since Django then strips any ordering hence the subquery is never created.
Thanks for the patch! The feature idea generally makes sense to me - taking something that currently results in an error and making it instead return sane results. Tentatively accepting based on that. Some thoughts:
- We need to check what the other core backends currently do with this query. If there are backends in which that query is currently not an error, but this patch would change their results, that might be a problem.
- If I understand the Postgres docs correctly, your proposed subquery SQL results in an undefined/unpredictable selection of which actual row is chosen of each group where the
DISTINCT ON
query returns the same value. I guess if this is a problem, the answer is "make your ORDER BY
match the DISTINCT ON
", but it seems like a non-obvious subtlety that might bite people.
If nobody else sees any blockers that I'm missing, the things that would still be needed here are:
a) Add a test demonstrating the new behavior.
b) Add/update the documentation as needed (I think this is a behavior that ought to be mentioned in the docs somewhere).
c) Turn it into a pull request so the CI runs on it and we can see if it breaks any existing tests.
Once those are done there may be some comments on code style or implementation choices. The POC code seems clear enough, but I'd defer to others who know the ORM better on whether it's actually implemented at the right level / in the right place.
Per the Postgres documentation snippet, shouldn't the ORDER BY
actually be applied to the sub-query to ensure a predictable result?
Also, make sure to test behavior with both bare distinct()
calls and calls where joins are involved (e.g. Model.distinct('fkfield__related_model_field')
)
I think the idea is that in this case the user has supplied an ORDER BY
which _can't_ be applied to the sub-query (because it doesn't match the DISTINCT ON
), and so we accept unpredictable results in order to provide some results which match what the user requested.
This business of unpredictable results is what makes me most uncomfortable with the patch.
Would it be better to auto-generate an ORDER BY
for the subquery matching the DISTINCT ON
?
If we did that, the subquery isn't even needed: we could get the same results by just automatically prepending the ORDER BY
such that it matches the DISTINCT ON
.
Not sure if that's a good idea, but it seems like perhaps at least as good an idea as the one proposed here.
I completely agree with that statement. Let me explain my-use case and then maybe possible solutions.
class ModelFoo(models.Model):
bars = models.ManyToManyField(ModelBar, related_name='foos')
class ModelBar(models.Model):
latlon = models.PointField(spatial_index=True)
# query foos where they have related bars within certain distance
# and sort by increasing distance
(ModelFoo.objects
# filter by distance
.filter(bars__latlon__distance_lte=(point, Distance(mi=1)))
# add distance annotation
.distance(point, field_name='bars__latlon')
# sort by distance
.extra(order_by=['distance'])
# distinct foos
.distinct('id'))
The reason I needed distinct here is because otherwise whenever foo had multiple bars, I would get duplicate foo models due to the join. Since in my case I simply used bars for filtering foos, I didnt care about "unpredictable" order at all since I only cared about foos. As for sorting, initially I also thought that ORDER BY
should be inside the inner query and DISTINCT
on the outer query however for some reason when I did that, foos were not sorted by distance anymore.
This however is just my use-case and I do see how this can cause issues in other scenarios since you cannot be sure what comes first in the DISTINCT
.
Here are some possible solutions:
- It seems that whenever a pattern of many-guys is used to filter single guys on one/many-to-many and sorting is required, this pattern might be useful. So maybe this functionality should only be enabled in those cases.
- What about if this behavior will be configurable? More specifically what about if the user will be able to control:
- explicitly enable this functionary since otherwise Django cannot guarantee data integrity for all use-cases. This will force the user to explicitly acknowledge they want to do this hence eliminating the burden from Django to always provide data-integrity (and explain all the cases when it is not provided).
- what goes to inner and outer subquery (
ORDER BY
or DISTINCT
). Maybe even allow ORDER BY
in both queries. So the user will pick some sort columns for both inner and outer query hence solving "predictability" of DISTINCT
and allowing to sort overall results.
I do like the idea of making these things customizable however not sure what would be an API to do those customizations. Is there precedent like that in Django ORM for query customizations?
- Perhaps the right solution is try not to solve my particular use-case but provide a more generic solution to use nested queries. Maybe use API something like:
Model.objects.nested(
Model.objects.filter(...).distinct(...).order_by(...) # inner
).order_by(...) # outer
Let me know your thoughts.
We need to go wirh option 2, specifically the solution where the inner and outer query have differen order by clauses. This way you can order the results the way you want, yet pick the distinct element as you wish.
I'vw thought of a solution something like this
qs.distict('foo').order_by('foo').barrier().order_by('bar')
The idea is that the barrier call effectively forces everything left of the call to go into subquery, then things after the barrier go into outer query. This construct would be hugely useful in other use cases, too.
@miki725 - In a case like yours, I've generally just done .distinct()
instead of .distinct('id')
. Since you're only selecting columns from foo, that should give the same results, and it makes Postgres happy with any ORDER BY
clause you like, with no need for a subquery. So I'm not sure that your particular case actually offers a use case for this feature - in this case our support for DISTINCT ON
is actually an "attractive nuisance."
@akaariai - that API doesn't feel quite right to me. order_by()
in every other case determines the ordering of the returned results. Just because in this case we are also applying a SQL ORDER BY
clause in the subquery (an implementation detail) doesn't mean that overloading order_by()
is the right API for it. For addressing this particular use case, I think a more intuitive API would be to introduce an ordering argument to .distinct()
to allow customization of the ordering used to pick the distinct row. Of course, that's a less general API - but I think we would need a more thorough list of the cases where subqueries are needed in order to see what a generic subqueries API should look like (e.g. are you intending to limit this API to cases where the outer query is simply a SELECT *
from the subquery?). It feels to me that a generic subqueries API should involve passing one queryset to another, or wrapping one in another, much like the current case where passing a queryset to an __in
filter results in a subquery.
Model.objects.filter(
pk__in=Subquery(
Model.objects.all().distinct('foo').order_by('-baz').values('pk')
).order_by('bar')