Kfir

Performant Mass "Update or Create" Strategy for Data Imports

Performing a mass "Update or Create" is not a hard task, however, doing it efficiently requires some understanding of how things work "under the hood". Let's explore this area for a little bit and offer a potential performant solution for that task.

Preface

A while ago I had a request from a client. He wanted to be able to export those huge Excel files from his local CRM and upload them to their new web platform. By "upload" I actually mean import them into the system. One of the rules was that if the uploaded Excel file had a record in it that is already in the database, make sure to sync the database record with the Excel record while ignoring certain columns.

Does not sound very hard. It's actually relatively easy to do so. We can simply loop through the Excel file and figure out what to do with every single record in it. If it does not exist, create it. If it does, update it while ignoring certain fields.

A naive implementation may look something like:

foreach ($excelRecords as $record) {
    if (DB::exists('customer_number', $record['customer_number'])) {
        return $this->updateRecord($record);
    }

    return $this->insertRecord($record);
}

While the above implementation will work as expected, it is painfully slow. In fact, using that implementation on an Excel with 5k rows took over 15 seconds to complete on my local machine, which is much faster than any standard server. Don't forget the constraints that the database has to check every time we insert or update a record.

The Bottleneck 🍾

Unsatisfied with the result, I started thinking about what can I do to improve it.

For each record in the Excel file, We needed to perform 2 queries: one query to check if the record already exists and another query to insert or update the record.

According to MySQL's website, the time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions:

  • Connecting: (3)
  • Sending query to server: (2)
  • Parsing query: (2)
  • Inserting row: (1 × size of row)
  • Inserting indexes: (1 × number of indexes)
  • Closing: (1)

This does not take into consideration the initial overhead to open tables, which is done once for each concurrently running query.

Given we have 5k rows to insert, we will perform 10k queries to the server, where half of them are select queries and the rest are insert or update queries. Doing some simple math, we can deduce that we are going to spend 10,000 * (3 + 2 + 2 + 1) = 80,000 proportion of time just for connecting, sending the query, parsing it and terminating the connection. Outrageous I would say, especially when we compare that to the actual time we are going to spend inserting or updating the rows (5,000 * (select_time) + 5,000 * (num_of_columns) + 5,000 * (num_of_indexes)).

Well, I think we just found our bottleneck.

Power Booster ⚡️

Knowing where the bottleneck is, I started thinking about how can I reduce that overhead time for connecting, sending the query, opening tables, etc. The answer is actually quite obvious: perform only a single query. But how?

Admittedly, there are many ways to achieve that. Some are complex and some are quite easy. I had my eyes on 2 possible solutions:

  1. REPLACE
  2. ON DUPLICATE KEY UPDATE

At first glance, they both seem to achieve the exact same end-result: insert new records while "updating" existing records. The main difference is that the "REPLACE" approach will delete the existing database record before inserting the new one, while the "ON DUPLICATE KEY UPDATE" approach will update the existing rows.

If we have any foreign key constraint that uses the on delete cascade in our database schema, using the "REPLACE" approach will end up deleting data from other tables as it performs a delete operation on an existing record! More often than not, I think this behavior is not intended when we talk about bulk imports.

On the other hand, the second approach, "ON DUPLICATE KEY UPDATE" will perform an update query behind the scenes and won't cause any deletion cascading. Sounds great to me.

Generating the Query

Generating an ON DUPLICATE KEY UPDATE update is not really complicated but doing it for many records can get tedious real fast, especially when we need to ignore certain fields for the update operation.

The basic form of a ON DUPLICATE KEY UPDATE query looks like:

insert into `table_name` (`insert_columns`)
values (`record1_fields`),(`record2_fields`),(`recordN_fields`)
on duplicate key update `column1ToUpdate`=VALUES(`column1ToUpdate`), `column2ToUpdate`=VALUES(`column2ToUpdate`), `columnNToUpdate`=VALUES(`columnNToUpdate`)

Note that the VALUES(columnName) part of the on duplicate key update part may be changed to any operation you want, such as quantity = quantity + anotherQuantity.

Since I ❤️ you all, I've written a compact library that will generate a prepared statement compatible string and the correct bindings for PHP.

Installation

You can add this library as a local, per-project dependency to your project using Composer:

composer require kfirba/import-query-generator

Usage

use Kfirba\QueryGenerator;

$table = 'users';
$data = [
    ['name' => 'John', 'email' => '[email protected]', 'password' => 'hashed_password', 'created_at' => date('Y-m-d'), 'updated_at' => date('Y-m-d')],
    ['name' => 'Jane', 'email' => '[email protected]', 'password' => 'hashed_password', 'created_at' => date('Y-m-d'), 'updated_at' => date('Y-m-d')],
    ['name' => 'Susy', 'email' => '[email protected]', 'password' => 'hashed_password', 'created_at' => date('Y-m-d'), 'updated_at' => date('Y-m-d')],
];
$excludedColumnsFromUpdate = ['password', 'created_at'];

$queryObject = (new QueryGenerator)->generate($table, $data, $excludedColumnsFromUpdate);

$queryObject->getQuery();
// -> "insert into `users` (`name`,`email`,`password`,`created_at`,`updated_at`) values (?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?) on duplicate key update `name`=VALUES(`name`),`email`=VALUES(`email`),`updated_at`=VALUES(`updated_at`)"

$queryObject->getBindings();
// -> ['John', '[email protected]', 'hashed_password', '2018-01-12', '2018-01-12', 'Jane', '[email protected]', 'hashed_password', '2018-01-12', '2018-01-12', 'Susy', '[email protected]', 'hashed_password', '2018-01-12', '2018-01-12']

As you may have noticed, the generator defaults to column=VALUES(column) since this is usually what we use when we attempt to bulk import some data. Need another behavior? You can submit a PR or just open an issue and we can talk about it 🤓.