At Cedar we love Django. It has been our framework of choice over the last five years. As the amount of data we have worked with went from the thousands of rows, to millions of rows, to the billions more recently, we have run into some performance bottlenecks–so, we had to come up with a creative solution.
For those who aren’t familiar, Django has a couple of ways to save models (data) to the database. For single models, you can use code like the following:
Model.objects.create(...) # or m = Model(...) m.save()
This works great for a single model or a couple of models but quickly breaks down when saving a lot of models. Each
.create requires a round trip to the database which can slow things down. In addition, in most databases it will cause indexes to be updated for each row adding overhead. Updating multiple rows in a batch is generally better.
Luckily, Django provides a better mechanism for inserting a large number of models into the database: bulk_create.
bulk_create is a significant improvement over saving models one at a time and we used it for a while at Cedar.
Unfortunately, it wasn’t fast enough when we loaded more than 10,000 models. Some loads would take over 10 minutes to execute, which would be acceptable if it was only done every once in a while, but we insert millions of models per day. The time really adds up so we decided to find a better way.
We started to investigate how
bulk_create works in Django. For bulk_create, Django has code specific to each database that translates a
bulk_create operation into the correct SQL for the database type. In Cedar’s case, we use Postgres, so the operation looks something like this:
INSERT INTO model_table (“field_1”, “field_2”) VALUES ( (“field_1_value1”, “field_2_value1”), (“field_1_value2”, “field_2_value2”) );
Basically, each model and its values are added after
VALUES. It seems simple enough and nothing on the surface seems wrong but we decided to look into a faster way to insert data.
Luckily, this isn’t the first time someone had to insert a lot of rows into Postgres. There are many articles written on the topic and many suggest using the Postgres COPY command instead of
INSERT. The metrics suggest that
COPY is up to six times faster than
INSERT. Unfortunately, Django doesn’t support the
COPY command out of the box. We needed to write some custom code.
If we were going to invest time into custom code, we first needed to validate the performance speedup. We did some testing to make sure the
COPY command really provided the speedup we were looking for. In our tests we replicated the 6x speed up when loading 50,000 or more rows. With some testing complete, we decided to create a new library using the
COPY command. We wanted the library to support updates and upserts to models as well, since we often update data in addition to insert.
We created the library
django-bulk-load. It uses the Postgres
COPY command and temp tables to handle inserts, updates and upserts. Here are some benchmarks of
django-bulk-load in comparison to
bulk_update in Django:
bulk_create count: 1,000 Bulk_create (Django): 0.048630714416503906 bulk_insert_models: 0.03132152557373047 count: 10,000 Bulk_create (Django): 0.45952868461608887 bulk_insert_models: 0.1908433437347412 count: 100,000 Bulk_create (Django): 4.875206708908081 bulk_insert_models: 1.764514684677124 count: 1,000,000 Bulk_create (Django): 59.16990399360657 bulk_insert_models: 18.651455640792847 bulk_update count: 1,000 bulk_update (Django): 0.45329761505126953 bulk_update (django-bulk-update): 0.1036691665649414 bulk_update_models: 0.04524850845336914 count: 10,000 bulk_update (Django): 6.0840747356414795 bulk_update (django-bulk-update): 2.433042049407959 bulk_update_models: 0.10899758338928223 count: 100,000 bulk_update (Django): 647.6648473739624 bulk_update (django-bulk-update): 619.0643970966339 bulk_update_models 0.9625072479248047 count: 1,000,000 bulk_update (Django): Does not complete bulk_update (django-bulk-update): Does not complete bulk_update_models: 14.923949003219604
We’ve seen the biggest performance improvement of
UPDATEs. There are some significant performance issues with Django's
bulk_update. See more details in this thread.
We’ve been using
django-bulk-load in production at Cedar for two years now and it’s been an amazing tool. Today we are open sourcing the code for everyone, which will be Cedar’s first of many open source libraries. We plan to release many more over the next few months–stay tuned for more!