An introduction to Functional indexes in MySQL 8.0, and their gotchas

March 10, 2020 -
Tags: databases, indexes, mysql

Another interesting feature released with MySQL 8.0 is full support for functional indexes.

Although this is not a strictly new concept in the MySQL world (indexed generated columns provided the same functionality), I find it worth reviewing, through some applications, notes and considerations.

All in all, I’m not 100% bought into functional indexes (as opposed to indexed generated columns); I’ll elaborate on this over the course of the article.

As a natural fit, generated columns are included in the article; additionally, some constructs build on my previous article, in relation to the subject of CTEs.

Updated on 12/Mar/2020: Found another bug.

Contents:

Terminology

In this article I’ll use the term “Functional index” to the refer to indexes both with (8.0) and without (5.7) underlying generated columns.

Where I need to refer to the 8.0 version, I’ll use the term “Functional key part” (even if it may not be entirely appropriate).

Generated columns, and their application on JSON data

Before explaining the functional indexes, I’ll give a brief introduction to generated columns, since the latter are built on top of the former.

A generated column is a column whose content is a function of another column.

Virtual generated columns - the default type - take no storage; the alternative type, “stored”, actually store the data. In this article I’ll refer exclusively to the virtual ones.

The syntax is simple: in the most minimal form, the definition is <column_name> <data_type> AS (<function>).

This is a sample table:

CREATE TEMPORARY TABLE t_generated_column
(
  id               INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters       JSON NOT NULL,
  parameter_serial CHAR(4) AS (parameters ->> '$.serial')
);

INSERT INTO t_generated_column (parameters)
VALUES
  ('{"serial": "foo0", "reserved": true}'),
  ('{"serial": "bar1", "reserved": false}'),
  ('{"serial": "baz2", "reserved": false}');

There are a few interesting concepts here.

First, the fact that a JSON column is used to store documents; we’re using MySQL as rudimentary document storage.
This is an interesting use case for generated columns (and likely, the original driver). On a complex enough application, at some point documents may be stored; if their usage is not sophisticated enough to require an external storage engine, MySQL can act as good enough tool for the job, in order to keep the system architecture as simple as possible.

The way the generated columns are defined, and work, is simple. In this case, the operator ->> (JSON inline path) is used, which is a shorthand for JSON_UNQUOTE(JSON_EXTRACT()). By default, JSON_EXTRACT includes quotes in the result (for strings), which we don’t require (in this context).

Finally, we can’t specify a NOT NULL constraint on the generated column - attempting to do so will return a syntax error.

Let’s have at look at how the data looks on SELECTion:

SELECT * FROM t_generated_column;

-- +----+---------------------------------------+------------------+
-- | id | parameters                            | parameter_serial |
-- +----+---------------------------------------+------------------+
-- |  1 | {"serial": "foo0", "reserved": true}  | foo0             |
-- |  2 | {"serial": "bar1", "reserved": false} | bar1             |
-- |  3 | {"serial": "baz2", "reserved": false} | baz2             |
-- +----+---------------------------------------+------------------+

Nice!

Functional indexes

Storing the data with the intention of unindexed access has definitely use cases, however, in applications where a significant part of the access to this data is performed at the DB layer, indexing will be crucial.

Generated columns can be indexed as any other column - in MySQL 5.7, this was the only way to build a functional index.

This is the previous table, with the index added and sample data:

CREATE TEMPORARY TABLE t_indexed_generated_column
(
  id               INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters       JSON NOT NULL,
  parameter_serial CHAR(4) AS (parameters ->> '$.serial'),
  KEY (parameter_serial)
)
WITH RECURSIVE counter (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM counter WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  CONCAT('{"serial": "', HEX(RANDOM_BYTES(2)), '"}') `parameters`
FROM counter;

ANALYZE TABLE t_indexed_generated_column;

Now we have a mean to address the JSON document via index (of course, limited to the specific field):

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_indexed_generated_column WHERE parameter_serial = 'CAFE';

-- -> Aggregate: count(0)
--     -> Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE')  (cost=1.10 rows=1)

The functionality above applies also to MySQL versions prior to 8.0, however, the latest version lifted a restriction: the backing generated column is not required anymore. A specific name is also given: “Functional key parts”, because indexes can now be composed of both functions and column references.

Behind the scenes, there’s nothing really new; appropriately, the engineers recycled the existing functionality, so that a functional indexes are backed by a hidden generated column.

Let’s create the table without the generated column, and fill it with random strings:

CREATE TEMPORARY TABLE t_functional_index
(
  id         INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters JSON NOT NULL,
  KEY ( (CAST(parameters ->> '$.serial' AS CHAR(4))) )
);

INSERT INTO t_functional_index (parameters)
WITH RECURSIVE counter (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM counter WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  CONCAT('{"serial": "', HEX(RANDOM_BYTES(2)), '"}') `parameters`
FROM counter;

ANALYZE TABLE t_functional_index;

The syntax is conceptually the same as generated columns - the function is wrapped by round brackets (the surrounding spaces are cosmetic).

Note that in this case, we must CAST the extracted value to CHAR, because we Cannot create a functional index on an expression that returns a BLOB or TEXT: the implicit function JSON_UNQUOTE return type is LONGTEXT.
We’re also hitting a limitation of functional indexes - while with normal indexes we could specify an index prefix (thus, converting the LONGTEXT into a (VAR)CHAR), this is not possible with functional indexes.

Now let’s test the index:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE parameters ->> '$.serial' = 'CAFE';

-- -> Aggregate: count(0)
--     -> Filter: (json_unquote(json_extract(t_functional_index.parameters,'$.serial')) = 'CAFE')  (cost=10384.20 rows=100312)
--         -> Table scan on t_functional_index  (cost=10384.20 rows=100312)

Nuts! A table scan. What happened?

JSON functional index gotchas

I’ll summarize here a few gotchas with JSON functional indexes. While the expression exactness is obvious, the other two aren’t [so much 😉].

Expression exactness

When using functional indexes, the match condition must be exact, in order for the index to be used. This is because MySQL needs to evaluates expressions in a general form, and, although some expressions can certainly be transformed (and some actually are, by the optimizer), a sensible design choice is to shift the burden to the developer, in some cases, including this one.

Let’s use a condition with the same function as the index definition:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE CAST(parameters ->> '$.serial' AS CHAR(4)) = 'CAFE';

-- -> Aggregate: count(0)
--    -> Index lookup on t_functional_index using functional_index (cast(json_unquote(json_extract(t_functional_index.parameters,_utf8mb4'$.serial')) as char(4) charset utf8mb4)='CAFE')  (cost=1.10 rows=1)

Even a minor change will make the optimizer discard the index:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE CAST(parameters ->> '$.serial' AS CHAR(5)) = 'CAFE';

-- -> Aggregate: count(0)
--     -> Filter: (cast(json_unquote(json_extract(t_functional_index.parameters,'$.serial')) as char(5) charset utf8mb4) = 'CAFE')  (cost=10384.20 rows=100312)
--         -> Table scan on t_functional_index  (cost=10384.20 rows=100312)

Inconsistent behavior between generated columns with index, and functional indexes

Interestingly, if we use the form generated column with index, in place of the functional index, the index will be used:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_indexed_generated_column WHERE parameters ->> '$.serial' = 'CAFE';

-- -> Aggregate: count(0)
--     -> Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE')  (cost=1.10 rows=1)

there is an inconsistency between a functional index and its generated column and index equivalent.

Let’s review the table definitions:

CREATE TEMPORARY TABLE t_indexed_generated_column
(
  id                 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters         JSON NOT NULL,
  parameter_serial   CHAR(4) AS (parameters ->> '$.serial'),
  KEY (parameter_serial)
);

CREATE TEMPORARY TABLE t_functional_index
(
  id         INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters JSON NOT NULL,
  KEY ( (CAST(parameters ->> '$.serial' AS CHAR(4))) )
);

There is no obvious reason for the optimizer not to use the functional index; it would definitely benefit from this improvement, in order for functional indexes to be a solid choice.

Encoding inconsistency based on the index usage

The combination of the CAST and JSON_UNQUOTE required in the context of functional indexes/generated columns has also another unintended effect: different results, based on the collation chosen by the query structure.

Let’s create a table with a generated column and an index:

CREATE TEMPORARY TABLE t_encoding_test
(
  id                INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters        JSON NOT NULL,
  parameters_serial CHAR(4) AS (CAST(parameters ->> '$.serial' AS CHAR(4))),
  KEY (parameters_serial)
)
SELECT '{"serial": "CAFE"}' `parameters`;

If a query uses the index indirectly (here we query on parameters, but the optimizer automatically uses the index on parameters_serial), we get a case insensitive search:

SELECT COUNT(*) FROM t_encoding_test WHERE parameters ->> '$.serial' = 'CAFe';

-- +----------+
-- | COUNT(*) |
-- +----------+
-- |        1 |
-- +----------+

this happens because the CAST function used to build the index, is associated to the system collation, which is case insensitive (by default, utf8mb4_0900_ai_ci).

However, if the index is not used:

SELECT COUNT(*) FROM t_encoding_test USE INDEX () WHERE parameters ->> '$.serial' = 'CAFe';

-- +----------+
-- | COUNT(*) |
-- +----------+
-- |        0 |
-- +----------+

the record is not matched! This is because the ->> operator uses JSON_UNQUOTE, whose hardcoded collation is utf8mb4_bin, which is case insensitive.

For more details, see the MySQL manpage or even the worklog.

An example of functional index with dates

Let’s take another example, and test the index:

CREATE TEMPORARY TABLE date_functional_index
(
  id         INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  created_at DATETIME NOT NULL,
  INDEX ( (DATE(created_at)) )
);

INSERT INTO date_functional_index (created_at)
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 100K) */
  NOW() - INTERVAL (90 * RAND()) DAY `created_at`
FROM sequence;

ANALYZE TABLE date_functional_index;

(There are two issues in relation to this test; the details are given below)

Let’s test the index access:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM date_functional_index WHERE DATE(created_at) = CURDATE();

-- -> Aggregate: count(0)
--     -> Index lookup on date_functional_index using functional_index (cast(date_functional_index.created_at as date)=curdate())  (cost=668.80 rows=608)

Works as expected; with this data type, we don’t need to deal with BLOBs and/or collations.

Gotcha: JOINs don’t use functional key parts

How about joins?

EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
  SELECT CURDATE() - INTERVAL 90 DAY
  UNION ALL
  SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT d, COUNT(id)
FROM
  dates_range
  LEFT JOIN date_functional_index ON d = DATE(created_at)
GROUP BY d;

-- -> Table scan on <temporary>
--     -> Aggregate using temporary table
--         -> Nested loop left join
--             -> Table scan on dates_range
--                 -> [...]
--             -> Filter: (dates_range.d = cast(date_functional_index.created_at as date))  (cost=3429.97 rows=100649)
--                 -> Table scan on date_functional_index  (cost=3429.97 rows=100649)

Ouch! The index is not used; this is definitely something that needs to be considered.

Indexes on generated columns exhibit the same behavior, however, we can perform the join against the generated column, whose index is then used by the optimizer:

CREATE TEMPORARY TABLE date_generated_column_functional_index
(
  id              INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  created_at      DATETIME NOT NULL,
  created_at_date DATE AS (DATE(created_at)),
  INDEX (created_at_date)
)
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 100K) */
  NOW() - INTERVAL (90 * RAND()) DAY `created_at`
FROM sequence;

ANALYZE TABLE date_generated_column_functional_index;

EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
  SELECT CURDATE() - INTERVAL 90 DAY
  UNION ALL
  SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT d, COUNT(id)
FROM
  dates_range
  LEFT JOIN date_generated_column_functional_index ON d = created_at_date
GROUP BY d;

-- -> Table scan on <temporary>
--     -> Aggregate using temporary table
--         -> Nested loop left join
--             -> Table scan on dates_range
--                 -> [...]
--             -> Index lookup on date_generated_column_functional_index using created_at_date (created_at_date=dates_range.d)  (cost=36.18 rows=1026)

Therefore, it’s not possible to use functional key parts with JOINs at all, while it’s possible with indexed generated columns. This makes functional key parts less appealing, when considering the overall design.

I’ve filed this as feature request.

Bugs

Bug on CREATE TABLE ... SELECT

In some of the previous queries I’ve used CREATE TABLE + INSERT instead of CREATE TABLE ... SELECT. Why?

Because of a bug:

CREATE TEMPORARY TABLE bug_functional_index (
  sold_on DATETIME NOT NULL,
  INDEX sold_on_date ((DATE(sold_on)))
)
SELECT NOW() `sold_on`;

-- ERROR 3105 (HY000): The value specified for generated column '3351ae78dcbae4f473d53aebdc350681' in table 'bug_functional_index' is not allowed.

the above should work, considering split form works ok:

CREATE TEMPORARY TABLE bug_functional_index (
  sold_on DATETIME NOT NULL,
  INDEX sold_on_date ((DATE(sold_on)))
);

INSERT INTO bug_functional_index VALUES (NOW());

-- Query OK, 1 row affected (0,00 sec)

I’ve reported this to the MySQL bug tracker.

Bug on LOAD DATA INFILE

There is also an additional bug: LOAD DATA INFILE statements will fail, if the columns are not explicitly specified:

echo '[]' > /tmp/test_data.csv

mysql <<'SQL'
  CREATE SCHEMA IF NOT EXISTS tmp;

  CREATE TEMPORARY TABLE tmp.issue_load_data_on_functional_index
  (
    json_col JSON,
    KEY json_col ( (CAST(json_col -> '$' AS UNSIGNED ARRAY)) )
  );

  LOAD DATA INFILE '/tmp/test_data.csv' INTO TABLE tmp.issue_load_data_on_functional_index;
SQL

# ERROR 1261 (01000) at line 9: Row 1 doesn't contain data for all columns

The workaround is to explicitly specify the columns:

LOAD DATA INFILE '/tmp/test_data.csv' INTO TABLE tmp.issue_load_data_on_functional_index (json_col);

I’ve reported this bug as well.

Conclusion

I’m not bought into functional key parts.

While I find functional indexes an important functionality of solid, modern, RDBMSs, I think that the functional key parts feature itself needs some time to mature, especially considering that indexed generated columns can do the same work (with some exceptions, e.g. multi-valued indexing).

Now moving on to another new 8.0 interesting feature (window functions!) 😄