Less noise, more data. Get the biggest data report on software developer careers in South Africa.

Dev Report mobile

How Figuring Out the Django ORM Powered Up My Workflow

27 June 2019, by Jethro Muller

The Django ORM is a powerful tool. However, it is often misunderstood and confusing to debug, optimise, and use effectively. Using the Django ORM properly, however, can lead to incredible speed improvements and allow complicated queries to be expressed simply. In this article, I’ll share the most common debugging, optimisation, and querying techniques my team has found to improve productivity and overcome common pitfalls.

Jethro-Muller_how-to-power-up-your-django-orm_Inner-article-image-01

I work as a software developer at Flickswitch, a business service provider in the telecommunications space. Currently, we use Django as the basis for the majority of our product offerings and as the backing for SIMcontrol, our main product that services thousands of clients and processes around 500 000 tasks a day.

All of these clients, tasks, and backend processes interact with our PostgreSQL database via the Django ORM. This important tool that is part of the Django web-framework allows you to access data from different databases with queries written in Python. The ORM takes care of translating the Python into the various different SQL dialects as necessary.

This ability means it produces cross-platform SQL, which allows for easily moving between database implementations. The main benefit of such a system is that no specialised SQL knowledge is required to get started using the Django ORM. Anyone that can write Python can query a database with the ORM.

The automatic generation of SQL is magical. However, unless you know where to look and what to do, improperly written queries in the ORM can cause slowdowns. ORM query code is translated into SQL before being submitted to the database. This translation process, because it’s automated, can do things you don’t expect if you aren’t familiar with the way Django behaves.

This was something we experienced at Flickswitch. For a while, speed had not been a priority for our product. It was more important that we got everything working and had the requisite features ready as soon as possible. This worked initially, but as our reporting and views got more complicated, we experienced slowdowns for our bigger customers, sometimes preventing them from receiving the reports they needed.

As this became increasingly urgent for my team, we decided to learn more about how the ORM could help us fix this. As a start, we decided we needed to know more about debugging the queries that were causing the bottlenecks and slowing down processes. Once we had figured out how to do that, we had to understand which of the ORM’s tools could best help us with speeding up our workflow. We found the following to be particularly useful:

  • Select related
  • Prefetch related
  • Values and value list
  • Subqueries

Using these tools has resulted in fantastic results for my team. Our workflow has sped up enormously and we are now able to deliver reports to our customers more efficiently. I’ll walk you through the process we followed to get to where we are now.

Debugging Queries

Before we could make pages faster, we had to thoroughly understand where and why queries were slow in the first place.

A query in the Django ORM is a Python expression that Django turns into a SQL query in the dialect of the database you’re connecting to. These queries can range from being as simple as selecting all the data from a specific table, to doing complex aggregations, subqueries, and window functions.

They allow developers who are inexperienced with the intricacies of the various SQL databases to write easily readable and maintainable data queries.

Here is a very simple example of a query using the ORM. This query retrieves all the Post instances from the database:

Post.objects.all()

Here is the equivalent SQL query:

SELECT * FROM post;

In these simple cases, everything often goes as planned. The more complex the query becomes, the more likely it is that something will misbehave, resulting in unintended consequences. These issues are often exacerbated by how far removed the developer is from the actual SQL that gets run on the database.

The first and most important step to take when trying to address ORM query problems is to identify and understand the issues at hand. For this to happen, it’s important to find the query causing the slowdown.

To do this, my team used the Django Debug Toolbar.

Django Debug Toolbar

The Django Debug Toolbar is invaluable in determining the cause of any number of issues on a given page on your Django site. It manifests as a slide-out bar on the side of your viewport when DEBUG=True, and displays metrics like code execution time, the current page’s context and template hierarchy, and importantly for us when looking at slow code, the SQL query execution time.

This SQL query view provides a breakdown of the speed of each SQL query on the page, as well as the number of similar and duplicate queries that exist. This view gives us valuable feedback on what the queries that provide the data for our page are doing, which is incredibly important when trying to figure out if there are any problems. If there are duplicate queries on the page, you can take it as a sign that there is likely an issue with the way the queries are being done.

For example, a common issue is looping over the results of an ORM query and accessing a related object (one linked via foreign key) inside the loop. This specific issue is common in ORMs and is called the N+1 selects problem. It is one of the best examples of the problems caused by not fully understanding how the ORM works and the queries it generates.

Printing raw SQL

To understand what an individual query is actually doing behind the scenes, printing the raw SQL is helpful. In the past, my team spent a huge amount of time poring over raw SQL queries to figure out exactly why a query was slower, returning the wrong results or otherwise being uncooperative.

This proved to be unnecessary when we discovered that we could simply call .query on the results of a query. This returned a queryset – the data structure that Django uses to hold results from an ORM query. If printed or case to a string, this .query value would show the raw SQL query.

Similarly, we found that we could use django-extensions, which add many useful utilities and features. I highly recommend this to anyone facing the same problems as us. If you run ./manage.py shell_plus or ./manage.py runserver_plus with the --print-sql option, all ORM queries will automatically print the generated SQL as they’re run against the database.

Optimising Slow Queries

After we had covered the basics of finding and fixing our slow queries, it was time to look at the tools the Django ORM could provide us to help speed them up.

As a starting point, we decided to address:

  • Fixing N+1 querying issues
  • Addressing slow JOINs, and
  • Making queries faster by specifying exactly what data to retrieve.

Select Related

First, we focused on the N+1 issues, as they generally have fairly straightforward solutions. The nuance, we discovered, is in knowing how and when to apply them.

To understand this, here’s a simple example: In SIMcontrol, we have a Recharges model and a Rules model. These represent the basis of SIMcontrol’s functionality – Recharges are done by pre-defined Rules. Due to this business rule, Recharges (in most cases) have a Rule associated with them. This relationship is a foreign key relationship with many Recharges being related to the same Rule.

Part of our reporting involves providing our customers with a list of all the Recharges done on their account for all their managed SIMs. We do so in a Recharges table. In this table, we show the name of the Rule that caused the Recharge.

Before that table was optimised, we were making an additional query for every Recharge in the table. We were doing this because Django does lazy-loading on relationships unless told otherwise. Lazy-loading was great in most cases as we didn’t want extra data being transferred from the server if we weren’t using it. However, when focusing on speed, it became problematic because it caused all the data to be retrieved but in the worst possible way – across multiple queries.

Using the ORM, we wrote this code and used it to get the data required for the Recharges table.

In the snippet below, we used a Django feature called select_related. This feature removes N+1 issues on queries by telling Django before the query is made that we want additional data to be retrieved in a single query.


> Recharge.objects.filter(
    ...  # Excluded for brevity
).select_related(
    ...  # Excluded for brevity
    'rule',
)

The SQL this query generates when it’s evaluated will have a JOIN to the Rule table to request all the data for the rule associated with each Recharge. We thought this was great as it solved the N+1 issue we were facing and turned N+1 queries into one query as if by magic!

Some things to consider

As with most magic tricks, however, we found that there are drawbacks and limitations to this technique. select_related can only be used for foreign key and one-to-one relationships, which begs the question of what to do for other types of relationships?

Another concern worth mentioning is that select_related has to do a JOIN. This can be expensive if there are few related objects being JOINed onto a large dataset as it results in massive amounts of redundant data being transported for each request.

Prefetch Related

These issues around limitations and cost were of concern to us until we discovered prefetch_related – the answer to both of these problems. Very convenient!

This tool solves the same issues that select_related solves, however, it works with many-to-many and many-to-one relationships. It also prevents the issues of loading redundant data over the network, which is an important consideration if you have very few unique related objects.

In the below query, we replaced the select_related with a prefetch_related:


> Recharge.objects.filter(
    ...  # Excluded for brevity
).prefetch_related(
    ...  # Excluded for brevity
    'rule',
)

The result is the same removal of the N+1 queries, but this time it’s replaced with two queries: One does the fetching of the Recharge objects, and the second gets all the Rule objects associated with the Recharge objects retrieved initially. This data is then linked up correctly in Python without any manual intervention.

Some things to consider

When I first started using Django and I needed to use these two tools, I was very confused about the appropriate time and place to apply them.

Using the --print-sql method, that was mentioned earlier, to inspect the query, I found it was easier to understand precisely what each query was actually doing. Knowing that, I was able to understand why I was seeing certain performance characteristics when using the tools on different queries. This made it clearer to see when and where it would be best to apply each of them going forward.

Values and Values List

When writing SQL without an ORM, it’s normal to select only the columns you want to return as part of the SELECT. We found that this isn’t done automatically by the Django ORM and was, therefore, something that needed to be considered when working on making queries faster.

We saw that using either values or values_list would result in only the specified fields being selected for the query:

Query 1

> recharges = Recharge.objects.values('id', 'rule__name')
> recharges[0]
{'id': 1, 'rule__name': '50MB Bundle Topup'}

Query 2

> recharges = Recharge.objects.values_list('id', 'rule__name')
> recharges[0]
(1, '50MB Bundle Topup')

While both of these queries result in the same SQL query, they return different values in Python. Query 1 will return a list of dictionaries with the field names (the strings in values) as the keys, and Query 2 will return a list of tuples with the field’s values in the same order they were passed into values_list.

We were excited to find that values_list has some other tricks up its sleeve too: If you pass in named=True to Query 2, the result will instead be a list of named tuples.

> recharges = Recharge.objects.values_list('id', 'rule__name', named=True)
> recharges[0].id
1
> recharges[0].rule__name
'50MB Bundle Topup'

These tuples are great because they allow for direct dot notation to access the results, meaning that you can reference the values of each result by name instead of by position, something that you would have had to have done in Query 2 above.

Another important feature of values_list that we discovered is the flat=True keyword argument. This only works when a single field is passed into values_list.

> Recharge.objects.values_list('rule__name', flat=True)
['50MB Bundle Topup', 'R12 Airtime Topup', '15MB Topup', ...]

This option flattens the returned list of tuples for a single field, which means that the result is just a list of values.

This is particularly useful when checking more complicated queries where you might want to use the in keyword (like in SQL - SELECT * FROM table WHERE id IN (1, 2, 3);).

> filtered_recharge_ids = Recharge.objects.filter(
    ...  # Some complicated filter
).values_list('id', flat=True)
> Rule.objects.filter(recharges__id__in=filtered_recharge_ids)

Subqueries

Finally, we decided to look into subqueries. These are queries that are done within the context of another query. This allows the subquery to use context from the main query to do filtering, for example. Subqueries can be nested so that you have a subquery within a subquery, within the main query.

We quickly found that learning when to use subqueries is a lot more complicated than can be summarised in this short section. It’s worth noting that unless you need to use subqueries, it is often better to stick to JOINs, as the query optimiser in the database has an easier time optimising them.

That said, we saw that they can be incredibly useful when used in the right place. In SIMcontrol, for example, we use subqueries to gather all of the user email addresses that are associated with an organisation to send notification emails to. The query does the aggregating of the users’ email addresses in the subquery and appends the result to the main queryset:

> user_emails = OrganisationUser.objects.order_by().filter(
      organization_id=OuterRef('id'),
      user__is_active=True,
      user__daily_recharge_digest_mail_opt_in=True,
  ).values('organization_id').annotate(
      emails=ArrayAgg('user__email', distinct=True),
  ).values('emails')

> organisations_with_user_emails = Organisation.objects.annotate(
      emails=Subquery(
          user_emails,
          output_field=ArrayField(base_field=CharField()),
      ),
  )

In the user_emails query, there is an empty order_by call. This call clears any default ordering enforced by the model or present from a previous query. This is important because we are doing a type of aggregation and can’t have fields in the order_by that aren’t used in the aggregation.

In the .filter function call, we have an OuterRef object. This object provides us with a way to make references to the outer context, or the query that the subquery is being done inside. In this example, we are comparing the organization_id to the ids of each row in the Organisation query assigned to organisations_with_user_emails.

In the query assigned to organisations_with_user_emails, we annotate, or, in other words, attach, the results of the subquery to each row of the result of the organisations_with_user_emails query.

Working through examples like this, we got a real sense of the power that could come from using subqueries. They allow you to make more complicated queries and then break them down into simpler pieces that can then be fit together clearly.

The de-mystified ORM

These tools have really improved our day-to-day functioning as a team because they can create performant queries in a way that is easy to read and reason about.

As with anything, knowing that these tools actually exist and figuring out how to use them is the first step to becoming proficient with them. In our experience, it’s only possible to know when to use various tools by trying them out and seeing when and where they work. It’s important to experiment.


Jethro Muller is a software developer at Flickswitch. He primarily works in Python with Django on server-side code. He enjoys ORM query optimisation, building pipelines and tooling, optimising workflows, and playing with AI.

Source-banner--1-

Recent posts

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.