8 Common SQL Slow Query Statements and How to Optimize Them

Learn how to optimize SQL queries for better performance

 

Thanks to Moore’s Law, computer performance has greatly improved, along with advancements in databases and various anti-pattern designs advocated by microservices. As a result, we now have fewer opportunities to write complex SQL queries. The industry (yes, even Google) has started advocating against specialized SQL optimization, as the resources saved do not outweigh the cost of employee salaries. However, as engineers, we should strive for technical excellence to become rocket scientists in our field.

This article is first published in the medium MPP plan. If you are a medium user, please follow me in medium. Thank you very much.

In this article, I will introduce eight common SQL slow query statements and explain how to optimize their performance. I hope this will be helpful to you.

LIMIT Statement

Pagination is one of the most commonly used scenarios, but it is also prone to problems. For example, for the simple statement below, a typical solution suggested by DBAs is to add a composite index on the type, name, and create_time fields. This way, the conditions and sorting can effectively utilize the index, resulting in a significant performance improvement.

1
2
3
4
5
6
SELECT *
FROM   operation
WHERE  type = 'SQLStats'
       AND name = 'SlowLog'
ORDER  BY create_time
LIMIT  1000, 10;

Okay, this might solve the problem for over 90% of DBAs. However, when the LIMIT clause becomes “LIMIT 1000000, 10”, programmers still complain, “Why is it slow when I’m only fetching 10 records?” You see, the database doesn’t know where the 1,000,000th record starts, so even with an index, it still needs to calculate from the beginning. In most cases, this performance issue is caused by lazy programming.

In scenarios such as frontend data browsing or exporting large data in batches, you can use the maximum value of the previous page as a parameter for querying. The SQL can be redesigned as follows:

1
2
3
4
5
6
7
SELECT   *
FROM     operation
WHERE    type = 'SQLStats'
AND      name = 'SlowLog'
AND      create_time > '2017-03-16 14:00:00'
ORDER BY create_time
LIMIT    10;

With this new design, the query time remains constant and does not change with the increasing data volume.

Implicit Conversion

Another common mistake in SQL statements is when the types of query variables and field definitions do not match. Take the following statement as an example:

1
2
3
4
5
6
mysql> explain extended SELECT *
     > FROM   my_balance b
     > WHERE  b.bpn = 14000000123
     >       AND b.isverified IS NULL ;
mysql> show warnings;
| Warning | 1739 | Cannot use ref access on index 'bpn' due to type or collation conversion on field 'bpn'

In this case, the field bpn is defined as varchar(20), and MySQL’s strategy is to convert the string to a number before comparing. This causes the function to be applied to the table field, rendering the index ineffective.

Such cases may be caused by parameters automatically filled in by the application framework, rather than the programmer’s intention. Nowadays, application frameworks are often complex, and while they provide convenience, they can also create pitfalls.

Join Updates and Deletions

Although MySQL 5.6 introduced materialization, it only optimizes SELECT statements. For UPDATE or DELETE statements, you need to manually rewrite them using JOIN.

For example, consider the following UPDATE statement. MySQL actually performs a loop/nested subquery (DEPENDENT SUBQUERY), and you can imagine the execution time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
UPDATE operation o
SET    status = 'applying'
WHERE  o.id IN (SELECT id
                FROM   (SELECT o.id,
                               o.status
                        FROM   operation o
                        WHERE  o.group = 123
                               AND o.status NOT IN ( 'done' )
                        ORDER  BY o.parent,
                                  o.id
                        LIMIT  1) t);

The execution plan is as follows:

1
2
3
4
5
6
7
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+-----------------------------------------------------+
| id | select_type        | table | type  | possible_keys | key     | key_len | ref   | rows | Extra                                               |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+-----------------------------------------------------+
| 1  | PRIMARY            | o     | index |               | PRIMARY | 8       |       | 24   | Using where; Using temporary                        |
| 2  | DEPENDENT SUBQUERY |       |       |               |         |         |       |      | Impossible WHERE noticed after reading const tables |
| 3  | DERIVED            | o     | ref   | idx_2,idx_5   | idx_5   | 8       | const | 1    | Using where; Using filesort                         |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+-----------------------------------------------------+

After rewriting it as a JOIN, the subquery’s select type changes from DEPENDENT SUBQUERY to DERIVED, significantly speeding up the execution time from 7 seconds to 2 milliseconds.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
UPDATE operation o
       JOIN  (SELECT o.id,
                            o.status
                     FROM   operation o
                     WHERE  o.group = 123
                            AND o.status NOT IN ( 'done' )
                     ORDER  BY o.parent,
                               o.id
                     LIMIT  1) t
         ON o.id = t.id
SET    status = 'applying';

The simplified execution plan is as follows:

1
2
3
4
5
6
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key   | key_len | ref   | rows | Extra                                               |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------------------------------+
| 1  | PRIMARY     |       |      |               |       |         |       |      | Impossible WHERE noticed after reading const tables |
| 2  | DERIVED     | o     | ref  | idx_2,idx_5   | idx_5 | 8       | const | 1    | Using where; Using filesort                         |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------------------------------+

Mixed Sorting

MySQL cannot utilize indexes for mixed sorting. However, in certain scenarios, there are still opportunities to improve performance using special methods.

1
2
3
4
5
6
SELECT *
FROM   my_order o
       INNER JOIN my_appraise a ON a.orderid = o.id
ORDER  BY a.is_reply ASC,
          a.appraise_time DESC
LIMIT  0, 20;

The execution plan shows a full table scan:

1
2
3
4
5
6
+----+-------------+-------+--------+-------------+---------+---------+---------------+---------+-+
| id | select_type | table | type   | possible_keys     | key     | key_len | ref      | rows    | Extra
+----+-------------+-------+--------+-------------+---------+---------+---------------+---------+-+
|  1 | SIMPLE      | a     | ALL    | idx_orderid | NULL    | NULL    | NULL    | 1967647 | Using filesort |
|  1 | SIMPLE      | o     | eq_ref | PRIMARY     | PRIMARY | 122     | a.orderid |       1 | NULL           |
+----+-------------+-------+--------+---------+---------+---------+-----------------+---------+-+

Since is_reply only has two states, 0 and 1, we can rewrite it as follows, reducing the execution time from 1.58 seconds to 2 milliseconds:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
SELECT *
FROM   ((SELECT *
         FROM   my_order o
                INNER JOIN my_appraise a
                        ON a.orderid = o.id
                           AND is_reply = 0
         ORDER  BY appraise_time DESC
         LIMIT  0, 20)
        UNION ALL
        (SELECT *
         FROM   my_order o
                INNER JOIN my_appraise a
                        ON a.orderid = o.id
                           AND is_reply = 1
         ORDER  BY appraise_time DESC
         LIMIT  0, 20)) t
ORDER  BY  is_reply ASC,
          appraisetime DESC
LIMIT  20;

EXISTS Statement

When dealing with EXISTS clauses, MySQL still uses nested subqueries for execution. Take the following SQL statement as an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT *
FROM   my_neighbor n
       LEFT JOIN my_neighbor_apply sra
              ON n.id = sra.neighbor_id
                 AND sra.user_id = 'xxx'
WHERE  n.topic_status < 4
       AND EXISTS(SELECT 1
                  FROM   message_info m
                  WHERE  n.id = m.neighbor_id
                         AND m.inuser = 'xxx')
       AND n.topic_type <> 5;
1
2
3
4
5
6
7
+----+--------------------+-------+------+-----+------------------------------------------+---------+-------+---------+ -----+
| id | select_type        | table | type | possible_keys | key     | key_len | ref      | rows    | Extra
+----+--------------------+-------+------+ -----+------------------------------------------+---------+-------+---------+ -----+
|  1 | PRIMARY            | n     | ALL  |  | NULL     | NULL    | NULL    | 1086041 | Using where                   |
|  1 | PRIMARY            | sra   | ref  |  | idx_user_id | 123     | const |       1 | Using where          |
|  2 | DEPENDENT SUBQUERY | m     | ref  |  | idx_message_info   | 122     | const |       1 | Using index condition; Using where |
+----+--------------------+-------+------+ -----+------------------------------------------+---------+-------+---------+ -----+

By removing the EXISTS clause and changing it to a JOIN, we can avoid nested subqueries and reduce the execution time from 1.93 seconds to 1 millisecond.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT *
FROM   my_neighbor n
       INNER JOIN message_info m
               ON n.id = m.neighbor_id
                  AND m.inuser = 'xxx'
       LEFT JOIN my_neighbor_apply sra
              ON n.id = sra.neighbor_id
                 AND sra.user_id = 'xxx'
WHERE  n.topic_status < 4
       AND n.topic_type <> 5;

The new execution plan is as follows:

1
2
3
4
5
6
7
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
| id | select_type | table | type   | possible_keys | key   | key_len | ref   | rows | Extra |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
|  1 | SIMPLE      | m     | ref    | | idx_message_info   | 122     | const |    1 | Using index condition |
|  1 | SIMPLE      | n     | eq_ref | | PRIMARY   | 122     | ighbor_id |    1 | Using where      |
|  1 | SIMPLE      | sra   | ref    | | idx_user_id | 123     | const |    1 | Using where           |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+

Condition Pushdown

There are cases where external query conditions cannot be pushed down to complex views or subqueries:

  1. Aggregated subqueries
  2. Subqueries with LIMIT
  3. UNION or UNION ALL subqueries
  4. Subqueries in output fields

Consider the following statement, where the condition affects the aggregated subquery:

1
2
3
4
5
6
SELECT *
FROM   (SELECT target,
               Count(*)
        FROM   operation
        GROUP  BY target) t
WHERE  target = 'rm-xxxx';
1
2
3
4
5
6
7
+----+-------------+------------+-------+---------------+-------------+---------+-------+------+-------------+
| id | select_type | table      | type  | possible_keys | key         | key_len | ref   | rows | Extra       |
+----+-------------+------------+-------+---------------+-------------+---------+-------+------+-------------+
|  1 | PRIMARY     | n          | ALL   | NULL          | NULL        | NULL    | NULL  | 1086041 | Using where |
|  1 | PRIMARY     | sra        | ref   | NULL          | idx_user_id | 123     | const |    1 | Using where |
|  2 | DEPENDENT SUBQUERY | m | ref   | NULL          | idx_message_info   | 122     | const |    1 | Using index condition; Using where |
+----+-------------+------------+-------+---------------+-------------+---------+-------+------+-------------+

By removing the EXISTS clause and changing it to a JOIN, we can avoid nested subqueries and reduce the execution time from 1.93 seconds to 1 millisecond.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT *
FROM   my_neighbor n
       INNER JOIN message_info m
               ON n.id = m.neighbor_id
                  AND m.inuser = 'xxx'
       LEFT JOIN my_neighbor_apply sra
              ON n.id = sra.neighbor_id
                 AND sra.user_id = 'xxx'
WHERE  n.topic_status < 4
       AND n.topic_type <> 5;

The new execution plan is as follows:

1
2
3
4
5
6
7
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
| id | select_type | table | type   | possible_keys | key   | key_len | ref   | rows | Extra |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
|  1 | SIMPLE      | m     | ref    | | idx_message_info   | 122     | const |    1 | Using index condition |
|  1 | SIMPLE      | n     | eq_ref | | PRIMARY   | 122     | ighbor_id |    1 | Using where      |
|  1 | SIMPLE      | sra   | ref    | | idx_user_id | 123     | const |    1 | Using where           |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+

Narrowing the Scope in Advance

Let’s take a look at the following partially optimized example (main table in the left join acts as a primary query condition):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SELECT    a.*,
          c.allocated
FROM      (
              SELECT   resourceid
              FROM     my_distribute d
                   WHERE    isdelete = 0
                   AND      cusmanagercode = '1234567'
                   ORDER BY salecode limit 20) a
LEFT JOIN
          (
              SELECT   resourcesid, sum(ifnull(allocation, 0) * 12345) allocated
              FROM     my_resources
                   GROUP BY resourcesid) c
ON        a.resourceid = c.resourcesid;

Does this statement still have other issues? It is clear that subquery c is an aggregate query on the entire table, which can cause performance degradation when dealing with a large number of tables.

In fact, for subquery c, the left join result set only cares about the data that can be matched with the primary table’s resourceid. Therefore, we can rewrite the statement as follows, reducing the execution time from 2 seconds to 2 milliseconds:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT    a.*,
          c.allocated
FROM      (
                   SELECT   resourceid
                   FROM     my_distribute d
                   WHERE    isdelete = 0
                   AND      cusmanagercode = '1234567'
                   ORDER BY salecode limit 20) a
LEFT JOIN
          (
                   SELECT   resourcesid, sum(ifnull(allocation, 0) * 12345) allocated
                   FROM     my_resources r,
                            (
                                     SELECT   resourceid
                                     FROM     my_distribute d
                                     WHERE    isdelete = 0
                                     AND      cusmanagercode = '1234567'
                                     ORDER BY salecode limit 20) a
                   WHERE    r.resourcesid = a.resourcesid
                   GROUP BY resourcesid) c
ON        a.resourceid = c.resourcesid;

However, the subquery a appears multiple times in our SQL statement. This approach not only incurs additional costs but also makes the statement more complex. We can simplify it using the WITH statement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
WITH a AS
(
         SELECT   resourceid
         FROM     my_distribute d
         WHERE    isdelete = 0
         AND      cusmanagercode = '1234567'
         ORDER BY salecode limit 20)
SELECT    a.*,
          c.allocated
FROM      a
LEFT JOIN
          (
                   SELECT   resourcesid, sum(ifnull(allocation, 0) * 12345) allocated
                   FROM     my_resources r,
                            a
                   WHERE    r.resourcesid = a.resourcesid
                   GROUP BY resourcesid) c
ON        a.resourceid = c.resourcesid;

Conclusion

The database compiler generates execution plans that determine how SQL statements are actually executed. However, compilers can only do their best to serve, and no database compiler is perfect. The scenarios mentioned above also exist in other databases. Understanding the characteristics of the database compiler allows us to work around its limitations and write high-performance SQL statements.

When designing data models and writing SQL statements, it is important to bring algorithmic thinking or awareness into the process. Developing the habit of using the WITH statement when writing complex SQL statements can simplify them and reduce the burden on the database.

Finally, here is the execution order of SQL statements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
FROM
<left_table>

ON
<join_condition>

<join_type>
 JOIN
<right_table>

WHERE
<where_condition>

GROUP BY
<group_by_list>

HAVING
<having_condition>

SELECT

DISTINCT
<select_list>

ORDER BY
<order_by_condition>

LIMIT
<limit_number>
Licensed under CC BY-NC-SA 4.0
Built with Hugo
Theme Stack designed by Jimmy