Optimizing the telemetry queries on JSON columns

Motivation

In the original article we have discussed the possibility of using MariaDB and its new JSON features to create an app telemetry. The principle seems to be valid, and data are stored.

We have gathered a reasonable amount of data to evaluate our first set of statistic queries. They can be divided into two groups - one with simple queries filtering the events and the second with more complex queries following the dependencies of certain events on each other.

We see no problem with simple queries. The complication has occurred in queries where we perform the left join on the table using the JSON data.

Optimization image

Let us have an example of the left join statistic query.

We store events ShipmentCatalogueRequestAdded and PickupCatalogueRequestAdded for all events of adding the catalog request. We also store events CatalogueRequestPickedUp and CatalogueRequestPaid.

We would like to get all records with any of the RequestAdded events and join them with the goal state represented by one of the CatalogueRequestPickedUp and CatalogueRequestPaid events so that we could see how many people reached it.

For the record, here is the schema of the table from the first article.

CREATE TABLE `snp_event` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(128),
  `time` datetime,
  `gid` varchar(128),
  `data` mediumtext,
  PRIMARY KEY (`id`),
  KEY `INDEX_1` (`name`),
  KEY `INDEX_2` (`time`),
  KEY `INDEX_3` (`gid`)
) ENGINE=InnoDB;

I have tested all queries on the local machine with MariaDB 10.11 installed. The size of the snp_event table is 12904 records.

Original query

-- original left join example
SELECT
  ...
  FROM snp_event s1
    LEFT JOIN snp_event s2
    ON (
      JSON_VALUE(s1.data, '$.mail') = JSON_VALUE(s2.data, '$.mail')
      OR JSON_VALUE(s1.data, '$.phone') = JSON_VALUE(s2.data, '$.phone')
    )
  WHERE s1.name IN ('ShipmentCatalogueRequestAdded','PickupCatalogueRequestAdded')
    AND s1.time BETWEEN '2023-10-01' AND '2024-01-15'
    AND s2.name IN ('CatalogueRequestPickedUp','CatalogueRequestPaid')

Execution time: 50.3s

| id | select_type | table      | type  | possible_keys   | key    | key_len | ref | rows    | Extra                                                  |
|----|-------------|------------|-------|-----------------|--------|---------|-----|---------|--------------------------------------------------------|
| 1  | PRIMARY     | <derived2> | ALL   |                 |        |         |     | 3250781 | Using temporary; Using filesort                        |
| 2  | DERIVED     | s2         | range | INDEX_1         | INDEX_1| 387     |     | 2011    | Using index condition; Using temporary; Using filesort |
| 2  | DERIVED     | s1         | ALL   | INDEX_1,INDEX_2 |        |         |     | 12904   | Using where; Using join buffer (flat, BNL join)        |

The original thought was that there is so much JSON_VALUE function execution that it slows down the entire process. So, the obvious solution is to use derived tables and minimize the number of JSON_VALUE executions as much as possible.

Derived tables query

-- left join example with derived tables
SELECT
  ...
  FROM (
    SELECT 
      ...
      JSON_VALUE(data, '$.phone') AS phone,
      JSON_VALUE(data, '$.mail') AS mail
      FROM snp_event 
      WHERE name IN ('ShipmentCatalogueRequestAdded','PickupCatalogueRequestAdded')
        AND time BETWEEN '2023-10-01' AND '2024-01-15'
  ) s1
  LEFT JOIN (
    SELECT
      ...
      JSON_VALUE(data, '$.phone') AS phone,
      JSON_VALUE(data, '$.mail') AS mail
      FROM snp_event
      WHERE name IN ('CatalogueRequestPickedUp','CatalogueRequestPaid')
  ) s2
  ON ( 
    s1.phone=s2.phone OR s1.mail=s2.mail
  )

Execution time: 22.6s

| id | select_type | table      | type  | possible_keys   | key       | key_len | ref | rows    | Extra                                           |
|----|-------------|------------|-------|-----------------|-----------|---------|-----|---------|-------------------------------------------------|
| 1  | PRIMARY     | <derived2> | ALL   |                 |           |         |     | 3250781 | Using temporary; Using filesort                 |
| 2  | DERIVED     | snp_event  | ALL   | INDEX_1,INDEX_2 |           |         |     | 12904   | Using where; Using temporary; Using filesort    |
| 2  | DERIVED     | snp_event  | range | INDEX_1         | INDEX_1   | 387     |     | 2011    | Using where; Using join buffer (flat, BNL join) |

Execution time decreased by 50% but remained still high. I tested partial queries and quickly identified a condition causing the major delay: ... ON ( s1.phone=s2.phone OR s1.mail=s2.mail ) .... OR in table join often causes delays, so I tested the condition without OR but saw no improvement at all. My guess was insufficient use of indices because phone and mail were extracted in derived tables while no possible index was used. This presents a problem since JSON values cannot be indexed unless you use generated columns.

So my next improvement was to add generated columns mail and phone to the snp_event table and create respective indices. The JSON_VALUE function retrieving the mail and phone data was then modified to fetch the data from the newly added columns.

Derived tables with indices

-- adding generated columns with respective indeces
ALTER TABLE snp_event
  ADD COLUMN mail VARCHAR(255) GENERATED ALWAYS AS (JSON_VALUE(data, '$.mail')) STORED,
  ADD COLUMN phone VARCHAR(255) GENERATED ALWAYS AS (JSON_VALUE(data, '$.phone')) STORED,
  ADD INDEX INDEX_MAIL (mail),
  ADD INDEX INDEX_PHONE (phone);

-- modified left join example using the newly added generated columns with indexes
SELECT
  ...
  FROM (
    SELECT 
      ...
      phone,
      mail
      FROM snp_event 
      WHERE name IN ('ShipmentCatalogueRequestAdded','PickupCatalogueRequestAdded')
        AND time BETWEEN '2023-10-01' AND '2024-01-15'
  ) s1
  LEFT JOIN (
    SELECT
      ...
      phone,
      mail
      FROM snp_event
      WHERE name IN ('CatalogueRequestPickedUp','CatalogueRequestPaid')
  ) s2
  ON ( 
    s1.phone=s2.phone OR s1.mail=s2.mail
  )

Execution time: 0.8s

| id | select_type | table      | type  | possible_keys                  | key       | key_len | ref | rows    | Extra                                           |
|----|-------------|------------|-------|--------------------------------|-----------|---------|-----|---------|-------------------------------------------------|
| 1  | PRIMARY     | <derived2> | ALL   |                                |           |         |     | 3250781 | Using temporary; Using filesort                 |
| 2  | DERIVED     | snp_event  | ALL   | INDEX_1,INDEX_2                |           |         |     | 12904   | Using where; Using temporary; Using filesort    |
| 2  | DERIVED     | snp_event  | range | INDEX_1,INDEX_PHONE,INDEX_MAIL | INDEX_1   | 387     |     | 2011    | Using where; Using join buffer (flat, BNL join) |

Execution time under a second. This tiny difference in the possible_keys column INDEX_1,INDEX_PHONE,INDEX_MAIL vs INDEX_1 is responsible for almost 30 times faster execution. Somehow, the MariaDB engine can use indices even in derived tables by keeping the reference to the original table and using the indices. Since the delay wasn’t caused by multiple JSON_VALUE function execution, I projected the modifications to the original query.

Original query with indices

-- modified left join example using the newly added generated columns with indexes
SELECT
  ...
  FROM snp_event s1
    LEFT JOIN snp_event s2
    ON (
      s1.phone=s2.phone OR s1.mail=s2.mail
    )
  WHERE s1.name IN ('ShipmentCatalogueRequestAdded','PickupCatalogueRequestAdded')
    AND s1.time BETWEEN '2023-10-01' AND '2024-01-15'
    AND s2.name IN ('CatalogueRequestPickedUp','CatalogueRequestPaid')

Execution time: 0.5s

| id | select_type | table      | type  | possible_keys                          | key       | key_len | ref | rows    | Extra                                                  |
|----|-------------|------------|-------|----------------------------------------|-----------|---------|-----|---------|--------------------------------------------------------|
| 1  | PRIMARY     | <derived2> | ALL   |                                        |           |         |     | 3250781 | Using temporary; Using filesort                        |
| 2  | DERIVED     | s2         | range | INDEX_1,INDEX_PHONE,INDEX_MAIL         | INDEX_1   | 387     |     | 2011    | Using index condition; Using temporary; Using filesort |
| 2  | DERIVED     | s1         | ALL   | INDEX_1,INDEX_2,INDEX_PHONE,INDEX_MAIL |           |         |     | 12904   | Using where; Using join buffer (flat, BNL join)        |

The execution time is even faster here. That means the MariaDB engine can optimize the query even better. It makes no sense to keep the complexity of Cartesian product down by creating the subqueries with the smallest possible count.

query	execution time
original	50.3s
derived tables	22.6s
derived tables + indeces	0.8s
original + indeces	0.5s

Summary

We have demonstrated that JSON functions have no significant impact on the efficiency of simple queries. The drag we have reached so far is caused by a lack of indices over the JSON values, but it can be addressed using the generated indices for the most common values. In this case, it is recommended to keep these values in the same position for all JSONs you store in the table and also to keep the naming convention.

The second takeaway from this example is that the inner optimizer in MariaDB works well and that no further steps are usually necessary to keep the cardinality of the subresult down for the Cartesian product.