Fast Django model inserts with Postgres

At Cedar we love Django. It has been our framework of choice over the last five years. As the amount of data we have worked with went from the thousands of rows, to millions of rows, to the billions more recently, we have run into some performance bottlenecks–so, we had to come up with a creative solution.

For those who aren’t familiar, Django has a couple of ways to save models (data) to the database. For single models, you can use code like the following:

Model.objects.create(...)
# or
m = Model(...)
m.save()

This works great for a single model or a couple of models but quickly breaks down when saving a lot of models. Each .save/.create requires a round trip to the database which can slow things down. In addition, in most databases it will cause indexes to be updated for each row adding overhead. Updating multiple rows in a batch is generally better.

Luckily, Django provides a better mechanism for inserting a large number of models into the database: bulk_create. bulk_create is a significant improvement over saving models one at a time and we used it for a while at Cedar.

Unfortunately, it wasn’t fast enough when we loaded more than 10,000 models. Some loads would take over 10 minutes to execute, which would be acceptable if it was only done every once in a while, but we insert millions of models per day. The time really adds up so we decided to find a better way.

We started to investigate how bulk_create works in Django. For bulk_create, Django has code specific to each database that translates a bulk_create operation into the correct SQL for the database type. In Cedar’s case, we use Postgres, so the operation looks something like this:

INSERT INTO model_table (“field_1”, “field_2”)
VALUES (
	(“field_1_value1”, “field_2_value1”),
	(“field_1_value2”, “field_2_value2”)
);

Basically, each model and its values are added after VALUES. It seems simple enough and nothing on the surface seems wrong but we decided to look into a faster way to insert data.

Luckily, this isn’t the first time someone had to insert a lot of rows into Postgres. There are many articles written on the topic and many suggest using the Postgres COPY command instead of INSERT. The metrics suggest that COPY is up to six times faster than INSERT. Unfortunately, Django doesn’t support the COPY command out of the box. We needed to write some custom code.

If we were going to invest time into custom code, we first needed to validate the performance speedup. We did some testing to make sure the COPY command really provided the speedup we were looking for. In our tests we replicated the 6x speed up when loading 50,000 or more rows. With some testing complete, we decided to create a new library using the COPY command. We wanted the library to support updates and upserts to models as well, since we often update data in addition to insert.

We created the library django-bulk-load. It uses the Postgres COPY command and temp tables to handle inserts, updates and upserts.  Here are some benchmarks of django-bulk-load in comparison to bulk_create and bulk_update in Django:

bulk_create
count: 1,000
Bulk_create (Django): 0.048630714416503906
bulk_insert_models:   0.03132152557373047

count: 10,000
Bulk_create (Django): 0.45952868461608887
bulk_insert_models:   0.1908433437347412

count: 100,000
Bulk_create (Django): 4.875206708908081
bulk_insert_models:   1.764514684677124

count: 1,000,000
Bulk_create (Django): 59.16990399360657
bulk_insert_models:   18.651455640792847

bulk_update
count: 1,000
bulk_update (Django):             0.45329761505126953
bulk_update (django-bulk-update): 0.1036691665649414
bulk_update_models:               0.04524850845336914

count: 10,000
bulk_update (Django):             6.0840747356414795
bulk_update (django-bulk-update): 2.433042049407959
bulk_update_models:               0.10899758338928223

count: 100,000
bulk_update (Django):             647.6648473739624
bulk_update (django-bulk-update): 619.0643970966339
bulk_update_models                  0.9625072479248047

count: 1,000,000
bulk_update (Django):             Does not complete
bulk_update (django-bulk-update): Does not complete
bulk_update_models:               14.923949003219604

We’ve seen the biggest performance improvement of UPDATEs. There are some significant performance issues with Django's bulk_update. See more details in this thread.

We’ve been using django-bulk-load in production at Cedar for two years now and it’s been an amazing tool. Today we are open sourcing the code for everyone, which will be Cedar’s first of many open source libraries. We plan to release many more over the next few months–stay tuned for more!

GitHub - cedar-team/django-bulk-load: Load large batches of Django models into the database
Load large batches of Django models into the database - GitHub - cedar-team/django-bulk-load: Load large batches of Django models into the database
Taylor Hakes

Taylor Hakes

Cedar Chief Architect