SQL - Querying Against Dates Taking into Account Timezones
This is part of a series of posts on Dates and Timezones
- SQL - Querying Against Dates Taking into Account Timezones
- Querying Against Dates and Timezones with Eloquent
Retrieving records from a database table that were created between 2 dates is a relatively easy task. Retrieving records between 2 dates taking into account the user’s timezone, well, that might not be as straightforward as you may think - performance wise.
The Naive Approach
Let's say we have an events
table that stores events that took place for each user in our system such as: login attempts, risky logins, profile changes, etc. with the following structure and data (note that the created_at
column is in UTC):
id | user_id | name | created_at |
---|---|---|---|
1 | 1 | name_changed | 2017-12-28 23:01:02 |
2 | 1 | risky_login_attempt | 2017-12-29 03:30:42 |
3 | 1 | password_changed | 2017-12-29 16:04:33 |
4 | 1 | login_failed | 2017-12-30 17:06:15 |
5 | 1 | login_successful | 2017-12-31 17:06:49 |
Since we know we are going to perform many queries against the events
table that filter the results based on the created_at
column, we've indexed that column for performance gains.
When the site administrator wants to see every event that took place for user_id
1 in our system between 2 dates, the query that we need will be something like:
select `name`
from `events`
where `user_id` = 1
and `created_at` between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';
The query above will return every event that took place on user_id
1 account between December 28, 2017, and December 29, 2017:
name |
---|
name_changed |
risky_login_attempt |
password_changed |
Easy, right? It's also super fast since the created_at
column is indexed. However, what if we need to fetch all events that took place between 2 dates taking into account the user's timezone? Doesn't sound very hard right? We can use SQL's convert_tz()
method or date_add()
method to calculate the offset. If our user's timezone is +10:00
, we simply need to add 10 hours to our created_at
values to take that into account! Easy enough:
select `name`
from `events`
where `user_id` = 1
and convert_tz(`created_at`, "+00:00", (select `timezone` from `users` where `id` = 1 limit 1))
between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';
The query above will convert the database's created_at
time from UTC
timezone to +10:00
timezone and then compare it against our 2 dates. The results of this query are:
name |
---|
name_changed |
risky_login_attempt |
What's going on here? Previously we had 3 results, now we have only 2 results! We are missing the password_changed
event! Let's briefly explain why is that.
The password_changed
event took place on December 29, 2017 at 16:04:33 UTC. If our user's timezone is +10:00
, then for him that event took place on December 30, 2017, at 02:04:33. For that user, the event didn't happen on December 29 as our database sees it but on December 30 which is why we don't get that result back when we look for every event that happened between December 28 and December 29.
After we understand what's going on here I guess we can conclude that this is not really hard to take into account the user's timezone, right? Well, it's definitely not hard to just get the results, but what about performance? Let me spoil it for you, the query above will be slow. Very slow.
Off to The Races 🏎!
If we compare the performance for the 2 queries above on a large dataset, the processing time of the first query without taking into account the user's timezone is much faster than the other one. In fact, on a medium-sized dataset, the first query took a few milliseconds to complete while the second query took almost 12 seconds!
You may wonder to yourself: "how is that even possible? Our created_at
column is indexed!". While it's true that your created_at
column is indexed there is one thing caveat. Let's take a look at our first query:
select `name`
from `events`
where `user_id` = 1
and `created_at` between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';
Note how the created_at
column is just there, "exposed". It's not "hidden" in any method call. By doing that, the database can use the index for that column, resulting in a very fast and efficient query.
However, let's see what happens in our second query:
select `name`
from `events`
where `user_id` = 1
and convert_tz(`created_at`, "+00:00", (select `timezone` from `users` where `id` = 1 limit 1))
between '2017-12-28 00:00:00' and '2017-12-29 23:59:59';
This time, the created_at
column is "hidden" inside the convert_tz()
method. By doing that, the database will have to recalculate the values for the created_at
column and will not be able to use the index, resulting in a full-table scan. For small tables, you may not even feel the difference but as your table grows, you are bound to face a very slow query performance.
Considering that, we can deduce that we shouldn't "hide" indexed columns inside methods (oh and yea, even doing something like created_at + interval 10 hour
is considered "hiding" the created_at
column inside a method).
Winning the Race 🥇
Okay, so what can we do to solve that? Let's re-iterate a bit. In order to take the user's timezone into account, we "converted" the created_at
column's values to match the user's timezone by adding 10 hours - normalizing the database to match the user. By doing so, as we learned before, we get a performance penalty. Why don't we try "normalizing" the user's input to match the database?
Instead of adding 10 hours to the database, let's subtract 10 hours from the user's input - normalizing the input to match the database!
Let's take a look at how our query will need to change:
select `name`
from `events`
where `user_id` = 1
and `created_at`
between convert_tz('2017-12-28 00:00:00', "+10:00", "+00:00") and convert_tz('2017-12-29 23:59:59', "+10:00", "+00:00");
Look how the created_at
column is "exposed" and not "hidden" in any method. Exactly what we wanted! The results of that query are:
name |
---|
name_changed |
risky_login_attempt |
Exactly the same results we got with the much slower query but now we are able to use the database index for the created_at
column, resulting in a fast and efficient query!
Don't we all love more performant apps 🤓?