PostgreSQL, Django and Universally Unique Identifiers (UUID)
This article examines a case for using GUID/UUIDs and how both PostgreSQL and Django work together to make it easy to use and scale them. There are insights into how indexes are used, measuring performance, and discovering the size of an index on disk.
What is a Universally Unique Identifier?
A UUID, also known as a GUID Globally unique identifier, is a very very large number, often random, with special properties to make it easier to use.
What are the special properties of a UUID?
- They are very large numbers;
- They are considered globally unique;
- They are often completely random;
- They are people friendly with a recognised and standardised display format;
- They are a compact 128 bit number;
To put this into context:
- 8 bit numbers are between 0 and 255;
- 16 bit numbers are between 0 and 65,535;
- 32 bit numbers are between 0 and 4,294,967,295;
- 64 bit numbers are between 0 and 18,446,744,073,709,551,615;
- 128 bit numbers are between 0 and 340,282,366,920,938,463,463,374,607,431,768,211,455.
To make them easier for us humans to use, they are usually represented as follows:
a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
There are different versions, the one above is version 4 - random.
UUIDs are mainly used internally by computer software, but they can be used as unique references and in web URIs, for example:
https://example.com/info/a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
Anonymous? When random, they cannot be guessed. Because they have a standard representation layout it's impossible deduce any meaningful information from them.
They usually go unnoticed but if you see them at all you would probably find them in a link or scan of a QR code.
UUIDs in PostgreSQL
Databases can store UUIDs as a 36 character text field or as a 128 bit number, PostgreSQL have chosen to store UUIDs as a compact 16 byte (128 bit) number with efficient conversion between the two representations.
Our example table uses a uuid_ref in addition to an integer id because as UUIDs are not sequential they make poor primary keys.
CREATE TABLE example_uuid (
id integer PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
uuid_ref uuid,
title varchar(250)
);
We will insert a single record using the following:
INSERT INTO example_uuid (uuid_ref, title)
VALUES(‘a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11’, 'This is the first UUID');
As Django has good SQL management, this article will manage table and index creation using that framework.
Testing and measurement should be done with realistic data. If you use test data, any measurements will be performed against that data. Our example_uuid table has been populated with 3 million rows and that is left as an exercise for the reader.
Size matters so check the size of the data using the following SQL:
SELECT
id,
pg_column_size(id),
uuid_ref,
pg_column_size(uuid_ref),
title,
pg_column_size(title),
FROM example_uuid;
As the following result set shows, PostgreSQL stores an integer as 4 bytes and uuid as 16:
id | id_size | uuid_ref | uuid_ref_size
----+---------+--------------------------------------+--------------
1 | 4 | a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11 | 16
Why does size matter? The more data there is to process, the more resources the query uses and the longer it takes to deliver results. A smaller dataset will be faster and use fewer resources.
UUIDs in Django
Django understands UUIDs. It can ensure an incomming request has a valid UUID and thanks to Python's uuid library, has native RFC 4122 UUID objects.
URL Dispatcher
On entry, Django's URL dispatcher can test for valid strings, capture then pass them onto a view. With the example url http://127.0.0.1:8000/uuid/a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11/ path('uuid/<uuid:uuid_ref>/', views.view_uuid) reliably passes the captured uuid as uuid_ref to views.view_uuid(), if the uuid is badly formed then the dispatcher tries other urlpatterns.
Database
Django understands PostgreSQL and MySQL handle UUIDs differently and does the right thing.
Python
Python's uuid library, has native RFC 4122 UUID objects that has extensive support including generating and converting between formats.
The Django app
To focus on how Django and PostgreSQL work together, the code in this article is over simplified and error checking omitted.
This article uses the following models, urls and views.
# models.py
import uuid
from django.db import models
UUID_NULL = uuid.UUID('00000000-0000-0000-0000-000000000000')
class ExampleUUID(models.Model):
uuid_ref = models.UUIDField('UUID Reference', default=UUID_NULL, db_index=True)
title = models.CharField(max_length=250)
def __str__(self):
ret = '{}: {}: {}'.format(self.id, self.uuid_ref, self.title)
return ret
class Meta:
managed = True
db_table = 'example_uuid'
# urls.py
from django.contrib import admin
from django.urls import path
from . import views
urlpatterns = [
path('admin/', admin.site.urls),
path('uuid/<uuid:uuid_ref>/', views.view_uuid),
re_path('search_uuid/(?P<search>[0-9a-fA-F]+)/', views.search_uuid),
]
* Note that Django pattern matches an incoming URL then forwards it to the relevant view. <uuid:uuid_ref> checks the string is a valid UUID, then passes it as a valid parameter to view view_uuid(). The search path only checks for characters found in the first part of a uuid.
# views.py
from example_uuid.models import ExampleUUID
from django.http import HttpResponse
def view_uuid(request, uuid_ref):
"""View a UUID"""
data = ExampleUUID.objects.get(uuid_ref=uuid_ref)
res = "<html><body><p>ID: {}<br>Reference: {}<br>Title: {}</p></body></html>".format(
data.id,
data.uuid_ref,
data.title)
return HttpResponse(res)
def search_uuid(request, search):
"""Search for UUIDs"""
data = ExampleUUID.objects.filter(
uuid_ref__uuid_startswith=search
).order_by('uuid_ref')
uuid_list = [str(d.uuid_ref) for d in data]
res = "<html><body><div>{}</div></body></html>".format('<br/>'.join(uuid_list))
return HttpResponse(res)
Django database management
Django has good database management and normally does the right thing.
We can check what SQL would be generated to create our tables
example_uuid % ./manage.py showmigrations
admin
[X] 0001_initial
[X] 0002_logentry_remove_auto_add
[X] 0003_logentry_add_action_flag_choices
auth
[X] 0001_initial
[X] 0002_alter_permission_name_max_length
[X] 0003_alter_user_email_max_length
[X] 0004_alter_user_username_opts
[X] 0005_alter_user_last_login_null
[X] 0006_require_contenttypes_0002
[X] 0007_alter_validators_add_error_messages
[X] 0008_alter_user_username_max_length
[X] 0009_alter_user_last_name_max_length
[X] 0010_alter_group_name_max_length
[X] 0011_update_proxy_permissions
contenttypes
[X] 0001_initial
[X] 0002_remove_content_type_name
example_uuid
[ ] 0001_initial
sessions
[X] 0001_initial
example_uuid % ./manage.py sqlmigrate example_uuid 0001_initial
BEGIN;
--
-- Create model ExampleUUID
--
CREATE TABLE "example_uuid" ("id" serial NOT NULL PRIMARY KEY, "uuid_ref" uuid NOT NULL, "title" varchar(250) NOT NULL);
CREATE INDEX "example_uuid_uuid_ref_a11b7e48" ON "example_uuid" ("uuid_ref");
COMMIT;
Although good, there are edge cases where, for example in this article, we need to customise Django to allow us to lookup UUIDs in a more efficient manner. We will create a custom lookup uuid_startswith as follows:
import uuid
from django.db.models.fields import Field
@Field.register_lookup
class UUIDStartswith(Lookup):
"""uuid_startswith produces a SQL BETWEEN statement
with the search string padded with 0s and Fs
to form a well formed uuid"""
lookup_name = 'uuid_startswith'
def as_sql(self, compiler, connection):
lhs, lhs_params = self.process_lhs(compiler, connection)
rhs, rhs_params = self.process_rhs(compiler, connection)
params = []
uuid_min = '00000000-0000-0000-0000-000000000000'
uuid_max = 'ffffffff-ffff-ffff-ffff-ffffffffffff'
uuid_search_min = "{}{}".format(rhs_params[0], uuid_min[len(rhs_params[0]):36])
uuid_search_max = "{}{}".format(rhs_params[0], uuid_max[len(rhs_params[0]):36])
try:
uuid_search_min = str(uuid.UUID(uuid_search_min))
uuid_search_max = str(uuid.UUID(uuid_search_max))
sql = "{} BETWEEN '{}' AND '{}'".format(lhs, uuid_search_min, uuid_search_max)
except ValueError as err:
sql = "{} = '{}'".format(lhs, uuid_min)
return sql, params
By ensuring generated UUIDs are valid, we ensure errors are caught early and not passed to the database. If an invalid search produces an poorly formed uuid, the query will ensure no data is returned and few resources are wasted.
The above view and custom lookup generated the following SQL from http://127.0.0.1:8000/search_uuid/a0ede/:
(0.001)
SELECT "example_uuid"."id", "example_uuid"."uuid_ref", "example_uuid"."title"
FROM "example_uuid"
WHERE "example_uuid"."uuid_ref"
BETWEEN 'a0ede000-0000-0000-0000-000000000000'
AND 'a0edefff-ffff-ffff-ffff-ffffffffffff'
ORDER BY "example_uuid"."uuid_ref" ASC
LIMIT 21; args=()
As can be seen from the results it returned four rows out of a potential 3m in 0.001 second.
Django can be used to create and manage custom indexes
Lets create a custom index on SUBSTR(CAST(uuid_ref AS text), 1, 8), first create an empty migration:
example_uuid % python manage.py makemigrations --empty --name index_substr_uuid_ref example_uuid
Migrations for 'example_uuid':
example_uuid/migrations/0002_index_substr_uuid_ref.py
By giving --name index_substr_uuid_ref it's easier to see which migration will create the index.
The following code shows creating and dropping the index in example_uuid/migrations/0002_index_substr_uuid_ref.py:
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('example_uuid', '0001_initial'),
]
operations = [
migrations.RunSQL(
[("CREATE INDEX example_uuid__substr_uuid_ref "
"ON example_uuid (SUBSTR(CAST(uuid_ref AS text), 1, 8));")],
[("DROP INDEX example_uuid__substr_uuid_ref;")],
),
]
It's easy to check the SQL that will be generated:
example_uuid % python manage.py sqlmigrate example_uuid 0002_index_substr_uuid_ref
BEGIN;
--
-- Raw SQL operation
--
CREATE INDEX example_uuid__substr_uuid_ref ON example_uuid (SUBSTR(CAST(uuid_ref AS text), 1, 8));
COMMIT;
Then apply the migration:
example_uuid % python manage.py migrate
Operations to perform:
Apply all migrations: admin, auth, contenttypes, example_uuid, sessions
Running migrations:
Applying example_uuid.0002_index_substr_uuid_ref... OK
This explain shows that the index is successfully used by the query and returns the result in under 3ms.
example_uuid=# EXPLAIN ANALYSE SELECT uuid_ref FROM example_uuid WHERE substr(uuid_ref::text, 1, 8) > 'aaa' AND substr(uuid_ref::text, 1, 8) < 'aab';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on example_uuid (cost=386.17..30997.42 rows=14999 width=16) (actual time=1.251..2.590 rows=745 loops=1)
Recheck Cond: ((substr((uuid_ref)::text, 1, 8) > 'aaa'::text) AND (substr((uuid_ref)::text, 1, 8) < 'aab'::text))
Heap Blocks: exact=741
-> Bitmap Index Scan on example_uuid__substr_uuid_ref (cost=0.00..382.42 rows=14999 width=0) (actual time=0.991..0.991 rows=745 loops=1)
Index Cond: ((substr((uuid_ref)::text, 1, 8) > 'aaa'::text) AND (substr((uuid_ref)::text, 1, 8) < 'aab'::text))
Planning Time: 0.142 ms
Execution Time: 2.725 ms
(7 rows)
As indexes can be an overhead unless they are not used enough to warrant their existence, it's easy to un-apply the last migration by specifying the migration before this one. By using --fake the migration can be logged but not applied. Perhaps a future version will be able to select and apply migrations out of order.
example_uuid % python manage.py migrate example_uuid 0001
Operations to perform:
Target specific migration: 0001_initial, from example_uuid
Running migrations:
Rendering model states... DONE
Unapplying example_uuid.0002_index_substr_uuid_ref... OK
example_uuid % python manage.py migrate --fake example_uuid 0002
Operations to perform:
Target specific migration: 0002_index_substr_uuid_ref, from example_uuid
Running migrations:
Applying example_uuid.0002_index_substr_uuid_ref... FAKED
As can be seen, the query still runs, it's just not optimised with the index.
example_uuid=# EXPLAIN ANALYSE SELECT uuid_ref FROM example_uuid WHERE substr(uuid_ref::text, 1, 8) > 'aaa' AND substr(uuid_ref::text, 1, 8) < 'aab';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..82852.64 rows=14999 width=16) (actual time=0.900..678.534 rows=745 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on example_uuid (cost=0.00..80352.74 rows=6250 width=16) (actual time=6.603..665.920 rows=248 loops=3)
Filter: ((substr((uuid_ref)::text, 1, 8) > 'aaa'::text) AND (substr((uuid_ref)::text, 1, 8) < 'aab'::text))
Rows Removed by Filter: 999691
Planning Time: 0.122 ms
Execution Time: 678.601 ms
(8 rows)
Without the index, insertions will apply faster and some queries will be slower. YMMV
After creating and populating our database with data we can start to test performance.
SQL Performance testing
As this part is about measuring performance of UUIDs, we'll use the SQL generated by the Django app in psql to evaluate how the database is performing and offering insights using our data. It's worth remembering that any evaluation is against test data so it should be realistic.
How are UUIDs used?
There are two main uses for UUID data types, both are mainly used as unique random reference numbers.
The first is a single record exact match, for example a clicked link or match against an identity reference such as scanning a QR or bar code. As this is an exact match using an index to return a single record, the performance should be almost instant.
The second is as a multiple record lookup for humans to make a selection, for example in a call centre where the first few characters could filter to a few perhaps using the last few digits as a check.
Is the data random?
In our testing, the UUID generation appeared genuinely random using the following SQL GROUP BY to measure uniqueness of our 3m sample records, the minimum count was 20 and the max 80. With 3 it was 613 and 830. This suggests using the first four and last two characters of the text representation would normally be enough to identify a record.
SELECT SUBSTR(uuid_ref::text,1,4) AS sub_uuid_ref, count(*)
FROM example_uuid
GROUP BY 1
ORDER BY 2;
B-tree indexes are ideal for both exact matches and range lookups, but there’s a catch. Remember the difference between a uuid being a text and a number? This makes a difference in how indexes are used.
Before looking at the results it's worth listing the types of lookup and the index size. Size matters because the more data there is to process, the more resources the query uses and the longer it takes to deliver results.
There are two indexes that we will use, a native UUID and a SUBSTR of the first few characters of the text representation.
CREATE INDEX example_uuid__uuid_ref ON example_uuid (uuid_ref);
CREATE INDEX example_uuid__substr_uuid_ref ON example_uuid (SUBSTR(CAST(uuid_ref AS text), 1, 8));
After we have added our sample data (3m rows) the size of the table and indexes are as follows:
SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) AS size
FROM pg_class
WHERE relname IN (
'example_uuid',
'example_uuid__uuid_ref',
'example_uuid__substr_uuid_ref'
);
relname | size
-------------------------------+--------
example_uuid | 841 MB
example_uuid__uuid_ref | 90 MB
example_uuid__substr_uuid_ref | 90 MB
Interestingly, the uuid and SUBSTR(CAST(example_uuid.uuid_ref AS text)), 1, 8) indexes take up the same size on disk. If we were to use the whole uuid for a text index it would be almost twice the size at 169Mb.
As maintaining indexes that are poorly or not used is wasteful, it's worth ensuring their use is optimised and only the minimum number of indexes are added.
How does an exact match query perform?
As expected, the exact match uses an index and is almost instant.
# Django model query
data = ExampleUUID.objects.get(uuid_ref=uuid_ref)
example_uuid=# EXPLAIN ANALYSE SELECT * FROM example_uuid WHERE uuid_ref = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using example_uuid_uuid_ref_a11b7e48 on example_uuid (cost=0.43..8.45 rows=1 width=81) (actual time=0.022..0.024 rows=1 loops=1)
Index Cond: (uuid_ref = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::uuid)
Planning Time: 0.107 ms
Execution Time: 0.046 ms
(4 rows)
By way of comparison if we performed a text match without an index it would have taken significantly longer. It's worth noting that PostgreSQL optimised the query and used a Parallel Sequential Scan.
example_uuid=# EXPLAIN ANALYSE SELECT * FROM example_uuid WHERE uuid_ref::text = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..67228.35 rows=14999 width=81) (actual time=0.244..266.180 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on example_uuid (cost=0.00..64728.45 rows=6250 width=81) (actual time=166.425..251.602 rows=0 loops=3)
Filter: ((uuid_ref)::text = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11'::text)
Rows Removed by Filter: 999939
Planning Time: 0.095 ms
Execution Time: 266.203 ms
(8 rows)
How does a multiple record query perform?
Our search is for a uuid starting with a0ed and the three SQL queries are:
SELECT uuid_ref FROM example_uuid
WHERE uuid_ref::text LIKE 'a0ed%';
SELECT uuid_ref FROM example_uuid
WHERE SUBSTR(CAST(uuid_ref AS text), 1, 8) LIKE 'a0ed%';
SELECT uuid_ref FROM example_uuid
WHERE uuid_ref
BETWEEN 'a0ed0000-0000-0000-000000000000'
AND 'a0edFFFF-FFFF-FFFF-FFFFFFFFFFFF';
All three queries returned 41 rows from our test data.
Text query
As expected our text match does a Paralled Sequential Scan
example_uuid=# EXPLAIN ANALYSE SELECT uuid_ref FROM example_uuid
WHERE uuid_ref::text LIKE 'a0ed%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..67228.35 rows=14999 width=81) (actual time=16.102..270.789 rows=41 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on example_uuid (cost=0.00..64728.45 rows=6250 width=81) (actual time=21.058..256.119 rows=14 loops=3)
Filter: ((uuid_ref)::text ~~ 'a0ed%'::text)
Rows Removed by Filter: 999926
Planning Time: 0.094 ms
Execution Time: 270.818 ms
(8 rows)
Substring text query
example_uuid=# EXPLAIN ANALYSE SELECT uuid_ref FROM example_uuid
WHERE SUBSTR(uuid_ref::text, 1, 8) LIKE 'a0ed%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..70353.13 rows=14999 width=81) (actual time=19.336..392.100 rows=41 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on example_uuid (cost=0.00..67853.23 rows=6250 width=81) (actual time=27.615..377.215 rows=14 loops=3)
Filter: (substr((uuid_ref)::text, 1, 8) ~~ 'a0ed%'::text)
Rows Removed by Filter: 999926
Planning Time: 0.092 ms
Execution Time: 392.128 ms
(8 rows)
B-tree Indexes are used for the following indexes <, <=, =, >= and >, as can be seen LIKE does not use an index.
Between UUIDs
As expected the Between query used the index on uuid_ref and performed best with between 0.070 and 0.047 ms.
# Django custom lookup
data = ExampleUUID.objects.filter(uuid_ref__uuid_startswith=search).order_by('uuid_ref')
example_uuid=# EXPLAIN ANALYSE
SELECT uuid_ref FROM example_uuid
WHERE uuid_ref
BETWEEN 'a0ed0000-0000-0000-0000-000000000000'
AND 'a0edFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using example_uuid_uuid_ref_a11b7e48 on example_uuid (cost=0.43..8.45 rows=1 width=81) (actual time=0.009..0.039 rows=41 loops=1)
Index Cond: ((uuid_ref >= 'a0ed0000-0000-0000-0000-000000000000'::uuid) AND (uuid_ref <= 'a0edffff-ffff-ffff-ffff-ffffffffffff'::uuid))
Planning Time: 0.039 ms
Execution Time: 0.047 ms
(4 rows)
Text Index on full uuid
SELECT can use a text index created as follows:
example_uuid=# CREATE INDEX example_uuid_cast_uuid_ref ON example_uuid(CAST(uuid_ref AS text));
Where a select will return the correct 41 rows from our data.
example_uuid=# EXPLAIN ANALYSE
SELECT uuid_ref FROM example_uuid
WHERE uuid_ref::text
>= 'a0ed0000-0000-0000-0000-000000000000'
AND uuid_ref::text <= 'a0edFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on example_uuid (cost=590.29..31126.55 rows=14999 width=16) (actual time=0.161..0.252 rows=41 loops=1)
Recheck Cond: (((uuid_ref)::text >= 'a0ed0000-0000-0000-0000-000000000000'::text) AND ((uuid_ref)::text <= 'a0edFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF'::text))
Heap Blocks: exact=41
-> Bitmap Index Scan on example_uuid_cast_uuid_ref (cost=0.00..586.54 rows=14999 width=0) (actual time=0.138..0.140 rows=41 loops=1)
Index Cond: (((uuid_ref)::text >= 'a0ed0000-0000-0000-0000-000000000000'::text) AND ((uuid_ref)::text <= 'a0edFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF'::text))
Planning Time: 0.165 ms
Execution Time: 0.306 ms
(7 rows)
Creating this custom lookup is similar to the previous one and is left as an exercise for the reader.
Summary
PostgreSQL and Django work well together with good SQL management.
There are edge cases where custom lookups offer significant advantages by reducing a table scan of about 0.400 ms to an index scan of less than 0.050 ms. Django is well suited to creating custom lookups making it easier to do the right thing, thus significantly improving performance and reducing resources needed to deliver results.
Regular maintenance of UUID indexes is recommended due to their random nature.
Performance of MySQL and PostgreSQL is similar.