Kfir

SQL - Querying Against Dates Taking into Account Timezones

Retrieving records from a database table that were created between 2 dates is a relatively easy task. Retrieving records between 2 dates taking into account the user’s timezone, well, that might not be as straightforward as you may think - performance wise.

The Naive Approach

Let's say we have an events table that stores events that took place for each user in our system such as: login attempts, risky logins, profile changes, etc. with the following structure and data (note that the created_at column is in UTC):

id user_id name created_at
1 1 name_changed 2017-12-28 23:01:02
2 1 risky_login_attempt 2017-12-29 03:30:42
3 1 password_changed 2017-12-29 16:04:33
4 1 login_failed 2017-12-30 17:06:15
5 1 login_successful 2017-12-31 17:06:49

Since we know we are going to perform many queries against the events table that filter the results based on the created_at column, we've indexed that column for performance gains.

When the site administrator wants to see every event that took place for user_id 1 in our system between 2 dates, the query that we need will be something like:

select `name`
from `events`
where `user_id` = 1
and `created_at` between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';

The query above will return every event that took place on user_id 1 account between December 28, 2017, and December 29, 2017:

name
name_changed
risky_login_attempt
password_changed

Easy, right? It's also super fast since the created_at column is indexed. However, what if we need to fetch all events that took place between 2 dates taking into account the user's timezone? Doesn't sound very hard right? We can use SQL's convert_tz() method or date_add() method to calculate the offset. If our user's timezone is +10:00, we simply need to add 10 hours to our created_at values to take that into account! Easy enough:

select `name`
from `events`
where `user_id` = 1
and convert_tz(`created_at`, "+00:00", (select `timezone` from `users` where `id` = 1 limit 1))
between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';

The query above will convert the database's created_at time from UTC timezone to +10:00 timezone and then compare it against our 2 dates. The results of this query are:

name
name_changed
risky_login_attempt

What's going on here? Previously we had 3 results, now we have only 2 results! We are missing the password_changed event! Let's briefly explain why is that.

The password_changed event took place on December 29, 2017 at 16:04:33 UTC. If our user's timezone is +10:00, then for him that event took place on December 30, 2017, at 02:04:33. For that user, the event didn't happen on December 29 as our database sees it but on December 30 which is why we don't get that result back when we look for every event that happened between December 28 and December 29.

After we understand what's going on here I guess we can conclude that this is not really hard to take into account the user's timezone, right? Well, it's definitely not hard to just get the results, but what about performance? Let me spoil it for you, the query above will be slow. Very slow.

Off to The Races 🏎!

If we compare the performance for the 2 queries above on a large dataset, the processing time of the first query without taking into account the user's timezone is much faster than the other one. In fact, on a medium-sized dataset, the first query took a few milliseconds to complete while the second query took almost 12 seconds!

You may wonder to yourself: "how is that even possible? Our created_at column is indexed!". While it's true that your created_at column is indexed there is one thing caveat. Let's take a look at our first query:

select `name`
from `events`
where `user_id` = 1
and `created_at` between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';

Note how the created_at column is just there, "exposed". It's not "hidden" in any method call. By doing that, the database can use the index for that column, resulting in a very fast and efficient query.

However, let's see what happens in our second query:

select `name`
from `events`
where `user_id` = 1
and convert_tz(`created_at`, "+00:00", (select `timezone` from `users` where `id` = 1 limit 1))
between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';

This time, the created_at column is "hidden" inside the convert_tz() method. By doing that, the database will have to recalculate the values for the created_at column and will not be able to use the index, resulting in a full-table scan. For small tables, you may not even feel the difference but as your table grows, you are bound to face a very slow query performance.

Considering that, we can deduce that we shouldn't "hide" indexed columns inside methods (oh and yea, even doing something like created_at + interval 10 hour is considered "hiding" the created_at column inside a method).

Winning the Race 🥇

Okay, so what can we do to solve that? Let's re-iterate a bit. In order to take the user's timezone into account, we "converted" the created_at column's values to match the user's timezone by adding 10 hours - normalizing the database to match the user. By doing so, as we learned before, we get a performance penalty. Why don't we try "normalizing" the user's input to match the database?

Instead of adding 10 hours to the database, let's subtract 10 hours from the user's input - normalizing the input to match the database!

Let's take a look at how our query will need to change:

select `name`
from `events`
where `user_id` = 1
and `created_at`
between convert_tz('2017-12-28 00:00:00', "+10:00", "+00:00") and convert_tz('2017-12-29 23:59:59', "+10:00", "+00:00");

Look how the created_at column is "exposed" and not "hidden" in any method. Exactly what we wanted! The results of that query are:

name
name_changed
risky_login_attempt

Exactly the same results we got with the much slower query but now we are able to use the database index for the created_at column, resulting in a fast and efficient query!

Don't we all love more performant apps 🤓?