UPDATE [...] SET mycol = (@myvar := EXPRESSION(@myvar, mycol))
pattern.
This pattern though doesn’t play well with the optimizer (leading to non-deterministic behavior), so it has been deprecated. This left a sort of void, since the (relatively) sophisticated logic is now harder to reproduce, at least with the same simplicity.
In this article, I’ll have a look at two ways to apply such logic: using, canonically, window functions, and, a bit more creatively, using recursive CTEs.
Although CTEs are fairly intuitive, I advise, to those unfamiliar with the subject, to read my previous post on the subject.
The same principle applies to the window functions principles; I will break the query/concepts down, however, it’s advised to have at least an idea. There is a vast amount of literature about window functions (which is the reason why I haven’t written about them until now); pretty much all the tutorials use as example either corporate budgets, or populations/countries. Here instead, I’ll use a real-world case.
In relation to the software, MySQL 8.0.19 is convenient (but not required). All the statements need to be run in the same console, due to reusing @venue_id
.
There is always an architectural dilemma between placing the logic at the application level as opposed as the database level. Although this is an appropriate debate, in this context the underlying assumption is that it’s necessary that the logic stays at the database level; a requirement for this can be, for example, speed, which has actually been our case.
In this problem, we manage venue (theater) seats.
As a business requirement, we need to assign a “grouping”: an additional number representing each seat.
In order to set the grouping value:
In pseudocode:
current_grouping = 0
for each row:
for each number:
if (is_there_a_space_after_last_seat or is_a_new_row) and is_not_the_first_seat:
current_grouping += 2
else
current_grouping += 1
seat.grouping = current_grouping
In practice, we want the setup on the left to have the corresponding values on the right:
x→ 0 1 2 0 1 2
y ╭───┬───┬───╮ ╭───┬───┬───╮
↓ 0 │ x │ x │ │ │ 1 │ 2 │ │
├───┼───┼───┤ ├───┼───┼───┤
1 │ x │ │ x │ │ 4 │ │ 6 │
├───┼───┼───┤ ├───┼───┼───┤
2 │ x │ │ │ │ 8 │ │ │
╰───┴───┴───╯ ╰───┴───┴───╯
Let’s use a minimalist design for the underlying table:
CREATE TABLE seats (
id INT AUTO_INCREMENT PRIMARY KEY,
venue_id INT,
y INT,
x INT,
`row` VARCHAR(16),
number INT,
`grouping` INT,
UNIQUE venue_id_y_x (venue_id, y, x)
);
We won’t need the row
/number
columns, however, on the other hand, we don’t want to use a table whose records are fully contained in an index, in order to be closer to a real-world setting.
Based on the diagram of the previous section, the seat coordinates are, in the form (y, x)
:
Note that we’re using y
as first coordinate, because it makes it easier to reason in terms of rows.
We’re going to load a large enough number of records, in order to make sure the optimizer doesn’t take unexpected shortcuts. We use recursive CTEs, of course 😉:
INSERT INTO seats(venue_id, y, x, `row`, number)
WITH RECURSIVE venue_ids (id) AS
(
SELECT 0
UNION ALL
SELECT id + 1 FROM venue_ids WHERE id + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
v.id,
c.y, c.x,
CHAR(ORD('A') + FLOOR(RAND() * 3) USING ASCII) `row`,
FLOOR(RAND() * 3) `number`
FROM venue_ids v
JOIN (
VALUES
ROW(0, 0),
ROW(0, 1),
ROW(1, 0),
ROW(1, 2),
ROW(2, 0)
) c (y, x)
;
ANALYZE TABLE seats;
A couple of notes:
VALUES ROW()...
) in order to represent a (joinable) table without actually creating it;row
/number
data, as they’re filler;The old-school solution is very straightforward:
SET @venue_id = 5000; -- arbitrary venue id; any (stored) id will do
SET @grouping = -1;
SET @y = -1;
SET @x = -1;
WITH seat_groupings (id, y, x, `grouping`, tmp_y, tmp_x) AS
(
SELECT
id, y, x,
@grouping := @grouping + 1 + (seats.x > @x + 1 OR seats.y != @y),
@y := seats.y,
@x := seats.x
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
)
UPDATE
seats s
JOIN seat_groupings sg USING (id)
SET s.grouping = sg.grouping
;
-- Query OK, 5 rows affected, 3 warnings (0,00 sec)
Nice and easy (but keep in mind the warnings)!
A little side note: I’m taking advantage of boolean arithmetic properties here; specifically, the following statements are equivalent:
SELECT seats.x > @x + 1 OR seats.y != @y `increment`;
SELECT IF (
seats.x > @x + 1 OR seats.y != @y,
1,
0
) `increment`;
some people find it intuitive, some don’t - it’s a matter of taste; since it’s clarified now, for compactness purposes, I will use it for the rest of the article.
Let’s see the outcome:
SELECT id, y, x, `grouping` FROM seats WHERE venue_id = @venue_id ORDER BY y, x;
-- +-------+------+------+----------+
-- | id | y | x | grouping |
-- +-------+------+------+----------+
-- | 24887 | 0 | 0 | 1 |
-- | 27186 | 0 | 1 | 2 |
-- | 29485 | 1 | 0 | 4 |
-- | 31784 | 1 | 2 | 6 |
-- | 34083 | 2 | 0 | 8 |
-- +-------+------+------+----------+
This approach is ideal!
It has just a “small” defect: it may work… or not.
The reason is that the query optimizer doesn’t necessarily evaluate left to right, so the assignment operations (:=
) may be evaluated out of order, causing the result to be wrong. This is a problem typically experienced after MySQL upgrades.
As of MySQL 8.0, this functionality is indeed deprecated:
-- To be run immediately after the UPDATE.
--
SHOW WARNINGS\G
-- *************************** 1. row ***************************
-- Level: Warning
-- Code: 1287
-- Message: Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'.
-- [...]
Let’s fix this!
Window functions have been a long-awaited functionality in the MySQL world.
Generally speaking, the “rolling” nature of window functions fits very well accumulating functions. However, some complex accumulating functions require the results of the latest expression to be available, which is something window functions don’t support, since they work on a column basis.
This doesn’t mean that the problem can’t be solved, rather, than it needs to be re-thought.
In this case, we split the problem in two concepts; we think the grouping value for each seat as the sum of two values:
Those familiar with window functions will recognize the patterns here 🙂
The sequence number of each seat is a built-in function:
ROW_NUMBER() OVER <window>
The cumulative value is where things get interesting. In order to accomplish this task, we perform two steps:
Let’s see the SQL:
WITH
increments (id, increment) AS
(
SELECT
id,
x > LAG(x, 1, x - 1) OVER tzw + 1 OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
)
SELECT
s.id, y, x,
ROW_NUMBER() OVER tzw + SUM(increment) OVER tzw `grouping`
FROM seats s
JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x)
;
-- +-------+---+---+----------+
-- | id | y | x | grouping |
-- +-------+---+---+----------+
-- | 24887 | 0 | 0 | 1 |
-- | 27186 | 0 | 1 | 2 |
-- | 29485 | 1 | 0 | 4 |
-- | 31784 | 1 | 2 | 6 |
-- | 34083 | 2 | 1 | 8 |
-- +-------+---+---+----------+
Nice!
(Note that for simplicity, I’ll omit the UPDATE
from now on.)
Let’s review the query.
The CTE (edited):
SELECT
id,
x > LAG(x, 1, x - 1) OVER tzw + 1 OR y != LAG(y, 1, y) OVER tzw `increment`
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
;
-- +-------+-----------+
-- | id | increment |
-- +-------+-----------+
-- | 24887 | 0 |
-- | 27186 | 0 |
-- | 29485 | 1 |
-- | 31784 | 1 |
-- | 34083 | 1 |
-- +-------+-----------+
calculates the increments for each seat, compared to the previous (more on LAG()
later). It works purely on each record and the previous; it’s not cumulative.
Now, in order to calculate the cumulative increments, we just use a window function to compute the sum, for and up to each seat:
-- (CTE here...)
SELECT
s.id, y, x,
ROW_NUMBER() OVER tzw `pos.`,
SUM(increment) OVER tzw `cum.incr.`
FROM seats s
JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x);
-- +-------+---+---+------+-----------+
-- | id | y | x | pos. | cum.incr. | (grouping)
-- +-------+---+---+------+-----------+
-- | 24887 | 0 | 0 | 1 | 0 | = 1 + 0 (curr.)
-- | 27186 | 0 | 1 | 2 | 0 | = 2 + 0 (#24887) + 0 (curr.)
-- | 29485 | 1 | 0 | 3 | 1 | = 3 + 0 (#24887) + 0 (#27186) + 1 (curr.)
-- | 31784 | 1 | 2 | 4 | 2 | = 4 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (curr.)
-- | 34083 | 2 | 1 | 5 | 3 | = 5 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (#31784)↵
-- +-------+---+---+------+-----------+ + 1 (curr.)
LAG()
window functionThe LAG
function, in the simplest form (LAG(x)
), returns the previous value of the given column. A typical nuisance of window functions is to deal with the first record(s) in the window - since there is no previous record, they return NULL. With LAG, we can specify the value we want as third parameter:
LAG(x, 1, x - 1) -- defaults to `x -1`
LAG(y, 1, y) -- defaults to `y`
By specifying the defaults above, we make sure that the very first seat in the window will be treated by the logic as adjacent to the previous one (x - 1
) and in the same row (y
).
The alternative to defaults is typically IFNULL
, which is very intrusive, especially considering the relative complexity of the expression:
-- Both valid. And both ugly!
--
IFNULL(x > LAG(x) OVER tzw + 1 OR y != LAG(y) OVER tzw, 0)
IFNULL(x > LAG(x) OVER tzw + 1, FALSE) OR IFNULL(y != LAG(y) OVER tzw, FALSE)
The second LAG()
parameter is the number of positions to go back in the window; 1
is the previous, which is also the default value.
In this query, we’re using multiple times the same window. The following queries are formally equivalent:
SELECT
id,
x > LAG(x, 1, x - 1) OVER tzw + 1
OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x);
SELECT
id,
x > LAG(x, 1, x - 1) OVER (ORDER BY y, x) + 1
OR y != LAG(y, 1, y) OVER (ORDER BY y, x)
FROM seats
WHERE venue_id = @venue_id;
However, the latter may cause a suboptimal plan (which I’ve experienced, at least in the past); the optimizer may treat the windows as independent, and iterate them separately.
For this reason, I advise to always use named windows, at least when there are duplicated ones.
PARTITION BY
clauseTypically, window functions are executed over a partition, which in this case would be:
SELECT
id,
x > LAG(x, 1, x - 1) OVER tzw + 1
OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (PARTITION BY venue_id ORDER BY y, x); -- here!
Since the window matches the full set of records (which is filtered by the WHERE
condition), we don’t need to specify it.
If we had to run this query over the whole seats
table, then we’d need it, so that, across each venue_id
, the window is reset.
In the query, the ORDER BY
is specified at the window level:
SELECT
id,
x > LAG(x, 1, x - 1) OVER tzw + 1
OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
The window ordering is separate from the SELECT
one. This is crucial! The behavior of this query:
SELECT
id,
x > LAG(x, 1, x - 1) OVER tzw + 1
OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS ()
ORDER BY y, x
is unspecified. Let’s have a look at the manpage:
Query result rows are determined from the FROM clause, after WHERE, GROUP BY, and HAVING processing, and windowing execution occurs before ORDER BY, LIMIT, and SELECT DISTINCT.
Abstractly speaking, in order to solve this class of problems, instead of representing each entry as as a function of the previous one, we calculate the state change for each entry, then sum the changes up.
Although more complex than the functionality it replaces, this solution is very solid. This approach though, may not be always possible, or at least easy, so that’s where the recursive CTE solution comes into play.
This approach requires a workaround due to a limitation in MySQL’s CTE functionality, but, on the other hand, it’s a generic, direct, solution, and as such, it doesn’t require any rethinking of the approach.
Let’s start from a the simplified version of the end query:
-- `p_` is for `Previous`, in order to make the conditions a bit more intuitive.
--
WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
)
UNION ALL
SELECT
s.id, s.venue_id, s.y, s.x,
p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (s.y, s.x) > (p_y, p_x)
ORDER BY s.venue_id, s.y, s.x
LIMIT 1
)
SELECT * FROM groupings;
Bingo! This query is (relatively) simple, but most importantly, it expresses the grouping accumulating function in the simplest possible way:
p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
-- the above is equivalent to:
@grouping := @grouping + 1 + (seats.x > @x + 1 OR seats.y != @y),
@y := seats.y,
@x := seats.x
Even for those who are not accustomed with CTEs, the logic is simple.
The initial row is the first seat of the venue, in order:
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
In the recursive part, we proceed with the iteration:
SELECT
s.id, s.venue_id, s.y, s.x,
p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (s.y, s.x) > (p_y, p_x)
ORDER BY s.venue_id, s.y, s.x
LIMIT 1
the WHERE
condition, along with the ORDER BY
and LIMIT
clauses, simply find the next seat, that is, the one seat with the same venue id, which, in order of (venue_id, x, y)
, has greater (x, y)
coordinates.
The s.venue_id
part of the ordering is crucial! This allows us to use the index.
The SELECT
clause takes care of:
(p_)grouping
),s.id, s.venue_id, s.y, s.x
) to the next cycle.We select FROM groupings
so that we fulfill the requirements for the CTE to be recursive.
What’s interesting here is that we use the recursive CTE essentially as iterator, via selection from the groupings
table in the recursive subquery, while joining with seats
, in order to find the data to work on.
The JOIN is formally a cross join, however, only one record is returned, due to the LIMIT
clause.
Unfortunately, the above query doesn’t work because the ORDER BY
clause is currently not supported in the recursive subquery; additionally, the semantics of the LIMIT
as used here are not the intended ones, as they apply to the outermost query:
LIMIT is now supported […] The effect on the result set is the same as when using LIMIT in the outermost SELECT
However, it’s not a significant problem. Let’s have a look at the working version:
WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
)
UNION ALL
SELECT
s.id, s.venue_id, s.y, s.x,
p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
FROM groupings, seats s WHERE s.id = (
SELECT si.id
FROM seats si
WHERE si.venue_id = p_venue_id AND (si.y, si.x) > (p_y, p_x)
ORDER BY si.venue_id, si.y, si.x
LIMIT 1
)
)
SELECT * FROM groupings;
-- +-------+------+------+------------+
-- | p_id | p_y | p_x | p_grouping |
-- +-------+------+------+------------+
-- | 24887 | 0 | 0 | 1 |
-- | 27186 | 0 | 1 | 2 |
-- | 29485 | 1 | 0 | 4 |
-- | 31784 | 1 | 2 | 6 |
-- | 34083 | 2 | 0 | 8 |
-- +-------+------+------+------------+
It’s a bit of shame having to use a subquery, but it works, and the boilerplate is minimal, as several clauses are required anyway.
Here, instead of performing the ordering and limiting, in the relation resulting from the join of groupings
and seats
, we do it in a subquery, and pass it to the outer query, which will consequently select only the target record.
Let’s have a look at the query plan, using the EXPLAIN ANALYZE
functionality:
mysql> EXPLAIN ANALYZE WITH RECURSIVE groupings [...]
-> Table scan on groupings (actual time=0.000..0.001 rows=5 loops=1)
-> Materialize recursive CTE groupings (actual time=0.140..0.141 rows=5 loops=1)
-> Limit: 1 row(s) (actual time=0.019..0.019 rows=1 loops=1)
-> Index lookup on seats using venue_id_y_x (venue_id=(@venue_id)) (cost=0.75 rows=5) (actual time=0.018..0.018 rows=1 loops=1)
-> Repeat until convergence
-> Nested loop inner join (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
-> Scan new records on groupings (cost=2.73 rows=2) (actual time=0.001..0.001 rows=2 loops=2)
-> Filter: (s.id = (select #5)) (cost=0.30 rows=1) (actual time=0.020..0.020 rows=1 loops=5)
-> Single-row index lookup on s using PRIMARY (id=(select #5)) (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
-> Select #5 (subquery in condition; dependent)
-> Limit: 1 row(s) (actual time=0.007..0.008 rows=1 loops=9)
-> Filter: ((si.y,si.x) > (groupings.p_y,groupings.p_x)) (cost=0.75 rows=5) (actual time=0.007..0.007 rows=1 loops=9)
-> Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id) (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)
The plan is very much as expected. The foundation of an optimal plan for this case, is in the index lookups:
-> Nested loop inner join (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
-> Single-row index lookup on s using PRIMARY (id=(select #5)) (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
-> Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id) (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)
which are paramount; if even an index scan is performed (in short, when the index entries are scanned linearly, instead of finding directly the desired one), the performance will tank.
Therefore, the requirements for this strategy to work, are that the related indexes are in place and are used by the optimizer very efficiently.
It’s expected that, in the future, if the restrictions are lifted, not having to use the subquery will make the task considerably simpler for the optimizer.
For particular use cases where an optimal plan can’t be found, just use a temporary table:
CREATE TEMPORARY TABLE selected_seats (
id INT NOT NULL PRIMARY KEY,
y INT,
x INT,
UNIQUE (y, x)
)
SELECT id, y, x
FROM seats WHERE venue_id = @venue_id;
WITH RECURSIVE
groupings (p_id, p_y, p_x, p_grouping) AS
(
(
SELECT id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1
)
UNION ALL
SELECT
s.id, s.y, s.x,
p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
FROM groupings, seats s WHERE s.id = (
SELECT ss.id
FROM selected_seats ss
WHERE (ss.y, ss.x) > (p_y, p_x)
ORDER BY ss.y, ss.x
LIMIT 1
)
)
SELECT * FROM groupings;
Even if index scans are performed in this query, they’re very cheap, as the selected_seats
table is very small.
I’m very pleased that a very effective but flawed workflow, can be replaced with clean (enough) functionalities, which have been brought by MySQL 8.0.
There are still new (underlying) functionalities in development in the 8.0 series, which therefore keeps proving to be a very strong release.
Happy recursion 😄
]]>While MySQL is not there yet, it’s now possible to cover a significant use case: storing denormalized columns (or arrays in general), and accessing them via index.
In this article I’ll give some context about denormalized data and indexes, including the workaround for such functionality on MySQL 5.7, and describe how this is (rather) cleanly accomplished on MySQL 8.0.
Although B-trees are technically inverted indexes, in this context I’ll use the “inverted index” term to describe document-oriented indexes, like PostgreSQL’s GIN or InnoDB’s fulltext index, and I’ll refer to B-trees with their name.
Also, I won’t make any distinction between B-trees and B+trees, using only the “B-tree” term.
MySQL doesn’t have an array data type. This is a fundamental problem in architectures where storing denormalized rows is a requirement, for example, where MySQL is (also) used for data warehousing.
Storage and access are two sides of the same coin: missing optimal storage data structures for a certain class of data almost certainly implies the lack of optimal related algorithms; in this case, it translates to lack of (direct) indexing.
Storing arrays is not a big problem in itself: assuming simple data types, like integers, we can easily adopt the workaround of using a VARCHAR/TEXT column to store the values with an arbitrary separator (space is the most convenient), however, MySQL is (was) not designed to index this scenario.
Again, we can adopt another workaround: fulltext indexes. We can either set the InnoDB fulltext minimum token size to 1, but this has the downside of being a global setting, or pad the values, which works, although it’s suboptimal in terms of storage.
This is a working solution, if one really needs to: it has with the downsides of InnoDB’s fulltext indexes support, which are not few, but it’s good enough.
MySQL can store arrays since v5.7, through the JSON data type:
-- Note how we're using the v8.0.19's new `ROW()` construct for inserting multiple rows.
--
CREATE TEMPORARY TABLE t_json_arrays(
id INT PRIMARY KEY AUTO_INCREMENT,
c_array JSON NOT NULL
)
SELECT *
FROM (
VALUES
ROW("[1, 2, 3]"),
ROW(JSON_ARRAY(4, 5, 6))
) v (c_array);
SELECT * FROM t_json_arrays;
-- +----+-----------+
-- | id | c_array |
-- +----+-----------+
-- | 1 | [1, 2, 3] |
-- | 2 | [4, 5, 6] |
-- +----+-----------+
We can insert a JSON document (array) either as a string, or using the JSON_ARRAY
function.
Some operators are available for accessing the data stored in the JSON document, e.g. ->
:
-- Functionality for accessing JSON data
--
SELECT id, c_array -> "$[1]" `array_entry_1` FROM t_json_arrays;
-- +----+---------------+
-- | id | array_entry_1 |
-- +----+---------------+
-- | 1 | 2 |
-- | 2 | 5 |
-- +----+---------------+
However, indexing has been introduced only with v8.0.17, along with new search functionalities:
-- This is a functional index.
--
ALTER TABLE t_json_arrays ADD KEY ( (CAST(c_array -> '$' AS UNSIGNED ARRAY)) );
SELECT * FROM t_json_arrays WHERE 3 MEMBER OF (c_array);
-- +----+-----------+
-- | id | c_array |
-- +----+-----------+
-- | 1 | [1, 2, 3] |
-- +----+-----------+
EXPLAIN FORMAT=TREE SELECT * FROM t_json_arrays WHERE 3 MEMBER OF (c_array -> '$');
-- -> Filter: json'3' member of (cast(json_extract(t_json_arrays.c_array,_utf8mb4'$') as unsigned array)) (cost=1.10 rows=1)
-- -> Index lookup on t_json_arrays using functional_index (cast(json_extract(t_json_arrays.c_array,_utf8mb4'$') as unsigned array)=json'3') (cost=1.10 rows=1)
Note how the WHERE
condition must replicate exactly the functional key part (in this case, c_array -> '$'
).
According to the functionality worklog, the index is a slightly modified B-tree:
In general, multi-valued index is a regular functional index, with the exception that it requires additional handling under the hood on INSERT/UPDATE for multi-valued key parts.
SHOW INDEXES FROM t_json_arrays WHERE Key_name NOT LIKE 'PRIMARY'\G
-- *************************** 1. row ***************************
-- Table: t_json_arrays
-- Key_name: functional_index
-- Index_type: BTREE
-- [...]
Using a simple B-tree for this purpose has the specular opposite advantages and disadvantages of inverted indexes, the crucial difference being that the operations cost increases linearly with the size of the array stored.
This is because B-trees don’t have optimizations for large/batch insertions (inverted indexes are document-oriented, so it’s expected for insertions to be large); each array entry is one key in the index.
On the other hand, the DMLs cost is constant¹; there are no spikes caused by maintenance operations (ie. index merging.
An interesting point is that:
Only one multi-valued key part is allowed per index, to avoid exponential explosion. E.g if there would be two multi-valued key parts, and server would provide 10 values for each, SE would have to store 100 index records.
Why is that?
Because there are no convenient data structures for optimizing such case.
With the current data structure, the tuple [1, 2], [4, 5]
would generate the index keys:
(1, 4)
,(1, 5)
,(2, 4)
,(2, 5)
.Suppose that we tackled the problem by reducing the keys to a composition of each value of the first array with the second array:
(1, 4, 5)
,(2, 4, 5)
.we couldn’t efficiently search in both arrays, since the index is only on the first element; for example, searching on:
1, 4
could only lookup for 1
entries, not for 4
ones.
Sounds familiar? This is essentially the leftmost string prefix search problem.
The arrays of each tuple can still be independently indexed; probably, such configuration could lead to the index merge intersection optimization.
We’ve played with arrays storage and indexing; how about creating a column of UNSIGNED ARRAY data type?:
CREATE TEMPORARY TABLE t_json_arrays(
id INT PRIMARY KEY AUTO_INCREMENT,
c_array UNSIGNED ARRAY NOT NULL
);
-- ERROR 1064 (42000): You have an error in your SQL syntax [...] near 'UNSIGNED ARRAY NOT NULL
Ouch! There is no currently such data type. Internally, everything is done via json; the worklog explains this:
[…] server creates virtual generated column using the typed array field (instead of a regular field) for a function for which is_returns_array() method returns true. This WL adds one such function - CAST(… AS … ARRAY).
The typed array field (Field_typed_array class) essentially is a JSON field, a descendant of Field_json, but it reports itself as a regular field which type is typed array element’s type. […]
Adding a new data type would require a considerable amount of work; the team’s resources are evidently focused on other functionalities, so they released a good-enough functionality, which in my opinion, is a balanced choice.
We’re very excited by the introduction of this data type, and we’re in the process of migrating the fulltext indexes used for pseudo-arrays, to JSON-based array columns/indexes; I think this is a very significant step in making MySQL a well-rounded RDBMS, and covers an important use case in applications of a certain size.
¹: Insertion cost in B-trees is not constant, however, the maintenance cost (rebalancing) is negligible in this context.
]]>Although this is not a strictly new concept in the MySQL world (indexed generated columns provided the same functionality), I find it worth reviewing, through some applications, notes and considerations.
All in all, I’m not 100% bought into functional indexes (as opposed to indexed generated columns); I’ll elaborate on this over the course of the article.
As a natural fit, generated columns are included in the article; additionally, some constructs build on my previous article, in relation to the subject of CTEs.
Updated on 12/Mar/2020: Found another bug.
Contents:
In this article I’ll use the term “Functional index” to the refer to indexes both with (8.0) and without (5.7) underlying generated columns.
Where I need to refer to the 8.0 version, I’ll use the term “Functional key part” (even if it may not be entirely appropriate).
Before explaining the functional indexes, I’ll give a brief introduction to generated columns, since the latter are built on top of the former.
A generated column is a column whose content is a function of another column.
Virtual generated columns - the default type - take no storage; the alternative type, “stored”, actually store the data. In this article I’ll refer exclusively to the virtual ones.
The syntax is simple: in the most minimal form, the definition is <column_name> <data_type> AS (<function>)
.
This is a sample table:
CREATE TEMPORARY TABLE t_generated_column
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parameters JSON NOT NULL,
parameter_serial CHAR(4) AS (parameters ->> '$.serial')
);
INSERT INTO t_generated_column (parameters)
VALUES
('{"serial": "foo0", "reserved": true}'),
('{"serial": "bar1", "reserved": false}'),
('{"serial": "baz2", "reserved": false}');
There are a few interesting concepts here.
First, the fact that a JSON column is used to store documents; we’re using MySQL as rudimentary document storage.
This is an interesting use case for generated columns (and likely, the original driver). On a complex enough application, at some point documents may be stored; if their usage is not sophisticated enough to require an external storage engine, MySQL can act as good enough tool for the job, in order to keep the system architecture as simple as possible.
The way the generated columns are defined, and work, is simple. In this case, the operator ->>
(JSON inline path) is used, which is a shorthand for JSON_UNQUOTE(JSON_EXTRACT())
. By default, JSON_EXTRACT
includes quotes in the result (for strings), which we don’t require (in this context).
Finally, we can’t specify a NOT NULL
constraint on the generated column - attempting to do so will return a syntax error.
Let’s have at look at how the data looks on SELECT
ion:
SELECT * FROM t_generated_column;
-- +----+---------------------------------------+------------------+
-- | id | parameters | parameter_serial |
-- +----+---------------------------------------+------------------+
-- | 1 | {"serial": "foo0", "reserved": true} | foo0 |
-- | 2 | {"serial": "bar1", "reserved": false} | bar1 |
-- | 3 | {"serial": "baz2", "reserved": false} | baz2 |
-- +----+---------------------------------------+------------------+
Nice!
Storing the data with the intention of unindexed access has definitely use cases, however, in applications where a significant part of the access to this data is performed at the DB layer, indexing will be crucial.
Generated columns can be indexed as any other column - in MySQL 5.7, this was the only way to build a functional index.
This is the previous table, with the index added and sample data:
CREATE TEMPORARY TABLE t_indexed_generated_column
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parameters JSON NOT NULL,
parameter_serial CHAR(4) AS (parameters ->> '$.serial'),
KEY (parameter_serial)
)
WITH RECURSIVE counter (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM counter WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
CONCAT('{"serial": "', HEX(RANDOM_BYTES(2)), '"}') `parameters`
FROM counter;
ANALYZE TABLE t_indexed_generated_column;
Now we have a mean to address the JSON document via index (of course, limited to the specific field):
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_indexed_generated_column WHERE parameter_serial = 'CAFE';
-- -> Aggregate: count(0)
-- -> Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE') (cost=1.10 rows=1)
The functionality above applies also to MySQL versions prior to 8.0, however, the latest version lifted a restriction: the backing generated column is not required anymore. A specific name is also given: “Functional key parts”, because indexes can now be composed of both functions and column references.
Behind the scenes, there’s nothing really new; appropriately, the engineers recycled the existing functionality, so that a functional indexes are backed by a hidden generated column.
Let’s create the table without the generated column, and fill it with random strings:
CREATE TEMPORARY TABLE t_functional_index
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parameters JSON NOT NULL,
KEY ( (CAST(parameters ->> '$.serial' AS CHAR(4))) )
);
INSERT INTO t_functional_index (parameters)
WITH RECURSIVE counter (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM counter WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
CONCAT('{"serial": "', HEX(RANDOM_BYTES(2)), '"}') `parameters`
FROM counter;
ANALYZE TABLE t_functional_index;
The syntax is conceptually the same as generated columns - the function is wrapped by round brackets (the surrounding spaces are cosmetic).
Note that in this case, we must CAST
the extracted value to CHAR
, because we Cannot create a functional index on an expression that returns a BLOB or TEXT
: the implicit function JSON_UNQUOTE
return type is LONGTEXT
.
We’re also hitting a limitation of functional indexes - while with normal indexes we could specify an index prefix (thus, converting the LONGTEXT
into a (VAR)CHAR
), this is not possible with functional indexes.
Now let’s test the index:
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE parameters ->> '$.serial' = 'CAFE';
-- -> Aggregate: count(0)
-- -> Filter: (json_unquote(json_extract(t_functional_index.parameters,'$.serial')) = 'CAFE') (cost=10384.20 rows=100312)
-- -> Table scan on t_functional_index (cost=10384.20 rows=100312)
Nuts! A table scan. What happened?
I’ll summarize here a few gotchas with JSON functional indexes. While the expression exactness is obvious, the other two aren’t [so much 😉].
When using functional indexes, the match condition must be exact, in order for the index to be used. This is because MySQL needs to evaluates expressions in a general form, and, although some expressions can certainly be transformed (and some actually are, by the optimizer), a sensible design choice is to shift the burden to the developer, in some cases, including this one.
Let’s use a condition with the same function as the index definition:
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE CAST(parameters ->> '$.serial' AS CHAR(4)) = 'CAFE';
-- -> Aggregate: count(0)
-- -> Index lookup on t_functional_index using functional_index (cast(json_unquote(json_extract(t_functional_index.parameters,_utf8mb4'$.serial')) as char(4) charset utf8mb4)='CAFE') (cost=1.10 rows=1)
Even a minor change will make the optimizer discard the index:
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE CAST(parameters ->> '$.serial' AS CHAR(5)) = 'CAFE';
-- -> Aggregate: count(0)
-- -> Filter: (cast(json_unquote(json_extract(t_functional_index.parameters,'$.serial')) as char(5) charset utf8mb4) = 'CAFE') (cost=10384.20 rows=100312)
-- -> Table scan on t_functional_index (cost=10384.20 rows=100312)
Interestingly, if we use the form generated column with index, in place of the functional index, the index will be used:
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_indexed_generated_column WHERE parameters ->> '$.serial' = 'CAFE';
-- -> Aggregate: count(0)
-- -> Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE') (cost=1.10 rows=1)
there is an inconsistency between a functional index and its generated column and index equivalent.
Let’s review the table definitions:
CREATE TEMPORARY TABLE t_indexed_generated_column
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parameters JSON NOT NULL,
parameter_serial CHAR(4) AS (parameters ->> '$.serial'),
KEY (parameter_serial)
);
CREATE TEMPORARY TABLE t_functional_index
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parameters JSON NOT NULL,
KEY ( (CAST(parameters ->> '$.serial' AS CHAR(4))) )
);
There is no obvious reason for the optimizer not to use the functional index; it would definitely benefit from this improvement, in order for functional indexes to be a solid choice.
The combination of the CAST
and JSON_UNQUOTE
required in the context of functional indexes/generated columns has also another unintended effect: different results, based on the collation chosen by the query structure.
Let’s create a table with a generated column and an index:
CREATE TEMPORARY TABLE t_encoding_test
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
parameters JSON NOT NULL,
parameters_serial CHAR(4) AS (CAST(parameters ->> '$.serial' AS CHAR(4))),
KEY (parameters_serial)
)
SELECT '{"serial": "CAFE"}' `parameters`;
If a query uses the index indirectly (here we query on parameters
, but the optimizer automatically uses the index on parameters_serial
), we get a case insensitive search:
SELECT COUNT(*) FROM t_encoding_test WHERE parameters ->> '$.serial' = 'CAFe';
-- +----------+
-- | COUNT(*) |
-- +----------+
-- | 1 |
-- +----------+
this happens because the CAST
function used to build the index, is associated to the system collation, which is case insensitive (by default, utf8mb4_0900_ai_ci
).
However, if the index is not used:
SELECT COUNT(*) FROM t_encoding_test USE INDEX () WHERE parameters ->> '$.serial' = 'CAFe';
-- +----------+
-- | COUNT(*) |
-- +----------+
-- | 0 |
-- +----------+
the record is not matched! This is because the ->>
operator uses JSON_UNQUOTE
, whose hardcoded collation is utf8mb4_bin
, which is case insensitive.
For more details, see the MySQL manpage or even the worklog.
Let’s take another example, and test the index:
CREATE TEMPORARY TABLE date_functional_index
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
created_at DATETIME NOT NULL,
INDEX ( (DATE(created_at)) )
);
INSERT INTO date_functional_index (created_at)
WITH RECURSIVE sequence (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 100K) */
NOW() - INTERVAL (90 * RAND()) DAY `created_at`
FROM sequence;
ANALYZE TABLE date_functional_index;
(There are two issues in relation to this test; the details are given below)
Let’s test the index access:
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM date_functional_index WHERE DATE(created_at) = CURDATE();
-- -> Aggregate: count(0)
-- -> Index lookup on date_functional_index using functional_index (cast(date_functional_index.created_at as date)=curdate()) (cost=668.80 rows=608)
Works as expected; with this data type, we don’t need to deal with BLOBs and/or collations.
How about joins?
EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
SELECT CURDATE() - INTERVAL 90 DAY
UNION ALL
SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT d, COUNT(id)
FROM
dates_range
LEFT JOIN date_functional_index ON d = DATE(created_at)
GROUP BY d;
-- -> Table scan on <temporary>
-- -> Aggregate using temporary table
-- -> Nested loop left join
-- -> Table scan on dates_range
-- -> [...]
-- -> Filter: (dates_range.d = cast(date_functional_index.created_at as date)) (cost=3429.97 rows=100649)
-- -> Table scan on date_functional_index (cost=3429.97 rows=100649)
Ouch! The index is not used; this is definitely something that needs to be considered.
Indexes on generated columns exhibit the same behavior, however, we can perform the join against the generated column, whose index is then used by the optimizer:
CREATE TEMPORARY TABLE date_generated_column_functional_index
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
created_at DATETIME NOT NULL,
created_at_date DATE AS (DATE(created_at)),
INDEX (created_at_date)
)
WITH RECURSIVE sequence (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 100K) */
NOW() - INTERVAL (90 * RAND()) DAY `created_at`
FROM sequence;
ANALYZE TABLE date_generated_column_functional_index;
EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
SELECT CURDATE() - INTERVAL 90 DAY
UNION ALL
SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT d, COUNT(id)
FROM
dates_range
LEFT JOIN date_generated_column_functional_index ON d = created_at_date
GROUP BY d;
-- -> Table scan on <temporary>
-- -> Aggregate using temporary table
-- -> Nested loop left join
-- -> Table scan on dates_range
-- -> [...]
-- -> Index lookup on date_generated_column_functional_index using created_at_date (created_at_date=dates_range.d) (cost=36.18 rows=1026)
Therefore, it’s not possible to use functional key parts with JOINs at all, while it’s possible with indexed generated columns. This makes functional key parts less appealing, when considering the overall design.
I’ve filed this as feature request.
CREATE TABLE ... SELECT
In some of the previous queries I’ve used CREATE TABLE
+ INSERT
instead of CREATE TABLE ... SELECT
. Why?
Because of a bug:
CREATE TEMPORARY TABLE bug_functional_index (
sold_on DATETIME NOT NULL,
INDEX sold_on_date ((DATE(sold_on)))
)
SELECT NOW() `sold_on`;
-- ERROR 3105 (HY000): The value specified for generated column '3351ae78dcbae4f473d53aebdc350681' in table 'bug_functional_index' is not allowed.
the above should work, considering split form works ok:
CREATE TEMPORARY TABLE bug_functional_index (
sold_on DATETIME NOT NULL,
INDEX sold_on_date ((DATE(sold_on)))
);
INSERT INTO bug_functional_index VALUES (NOW());
-- Query OK, 1 row affected (0,00 sec)
I’ve reported this to the MySQL bug tracker.
LOAD DATA INFILE
There is also an additional bug: LOAD DATA INFILE
statements will fail, if the columns are not explicitly specified:
echo '[]' > /tmp/test_data.csv
mysql <<'SQL'
CREATE SCHEMA IF NOT EXISTS tmp;
CREATE TEMPORARY TABLE tmp.issue_load_data_on_functional_index
(
json_col JSON,
KEY json_col ( (CAST(json_col -> '$' AS UNSIGNED ARRAY)) )
);
LOAD DATA INFILE '/tmp/test_data.csv' INTO TABLE tmp.issue_load_data_on_functional_index;
SQL
# ERROR 1261 (01000) at line 9: Row 1 doesn't contain data for all columns
The workaround is to explicitly specify the columns:
LOAD DATA INFILE '/tmp/test_data.csv' INTO TABLE tmp.issue_load_data_on_functional_index (json_col);
I’ve reported this bug as well.
I’m not bought into functional key parts.
While I find functional indexes an important functionality of solid, modern, RDBMSs, I think that the functional key parts feature itself needs some time to mature, especially considering that indexed generated columns can do the same work (with some exceptions, e.g. multi-valued indexing).
Now moving on to another new 8.0 interesting feature (window functions!) 😄
]]>As of MySQL 8.0, this functionality is still not supported in a general sense, however, it’s now possible to generate a sequence to be used within a single query.
In this article, I’ll give a brief introduction to CTEs, and explain how to build different sequence generators; additionally, I’ll introduce the new (cool) MySQL 8.0 query hint SET_VAR
, and a pinch of virtual columns and functional indexes (“functional key parts”, another MySQL 8.0 feature).
Contents:
Roughly, Common Table Expressions (CTE
s) can be thought as ephemeral views or temporary tables.
CTEs bring very significant advantages, one of the most important being recursion, which, barring hacks, wasn’t supported before.
The simplest syntax is:
WITH <cte_name> (<colums>) AS
(
<cte_query>
)
<main_query>
for example¹:
CREATE TABLE line_items(
item_number INT UNSIGNED PRIMARY KEY,
item_total DECIMAL(8,2) NOT NULL,
order_number INT UNSIGNED NOT NULL
);
INSERT INTO line_items VALUES
(1, 10, 1),
(2, 10, 1),
(3, 15, 2)
;
WITH order_totals(order_number, order_total) AS
(
SELECT order_number, SUM(item_total) `order_total`
FROM line_items
GROUP BY order_number
)
SELECT item_number, item_total, order_number, order_total
FROM line_items
JOIN order_totals USING (order_number)
;
-- +-------------+------------+--------------+-------------+
-- | item_number | item_total | order_number | order_total |
-- +-------------+------------+--------------+-------------+
-- | 1 | 10.00 | 1 | 20.00 |
-- | 2 | 10.00 | 1 | 20.00 |
-- | 3 | 15.00 | 2 | 15.00 |
-- +-------------+------------+--------------+-------------+
The syntax is intuitive; in this example, it’s used very much like a temporary table, with the advantage that no cleanup (DROP TEMPORARY TABLE
) is needed.
If one has to create a table filled with integers, say, as an example for a blog post 😉, the common approach is to use extended INSERT
s (the form that stores multiple rows in one statement).
We can accomplish this more elegantly with a CTE, specifically, with a recursive one.
The syntax of recursive CTEs is:
WITH RECURSIVE <cte_name> (<colums>) AS
(
<base_case_query>
UNION ALL
<recursive_step_query> -- invoke the CTE here!
)
<main_query>
The concept we apply here is to simulate iteration via recursion (more on this later).
Straight to the generator!:
-- Create a table with the integers in the range [0, 10].
--
CREATE TABLE int_sequence
WITH RECURSIVE sequence (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM sequence WHERE n + 1 <= 10
)
SELECT n
FROM sequence;
The table creation syntax may be slightly odd - one may expect CREATE TABLE
to be below the WITH
clause - but the working is straightforward.
When the SELECT
invokes the CTE:
SELECT 0
);This is all in all, simple. However, something important to pay attention to, is the termination condition: WHERE n + 1 <= 0
. Why not using WHERE n <= ...
?
Because this is a part where, it’s easy to do a fencepost error. Let’s see the wrong case:
-- Attempt to select the integers in the range [0, 10], the wrong way.
--
WITH RECURSIVE sequence (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM sequence WHERE n <= 10
)
SELECT n
FROM sequence;
What happens here is that one confuses the returned row with the last verified condition. On the two last steps,
n = 10
;SELECT n + 1
is executed, returning 11
;n = 11
;Now, two alternatives are the conditions WHERE n <= 9
or WHERE n < 10
; while they are correct, they may be less intuitive than WHERE n + 1 <= 10
, which mimicks the SELECT
ed expression.
I’ll conclude with two final notes.
First, we’re using recursion as a way of performing iteration; this is subject to the same criticism of teaching recursion via Fibonacci series: it can arguably be considered as an overengineered/underperforming solution to a problem.
I don’t take any position in this case, however, my personal order of increasing elegance for filling a table with a series of numbers is:
INSERT
,Since MySQL doesn’t provide 3., I’m happy to use 2. 😬.
The second note is more interesting, and I’ll highlight it with a dedicated section.
MySQL limits by default the number of recursions 1000, via the cte_max_recursion_depth
sysvar.
Now, if we want to generate a long sequence, we should:
This procedure consists of three statements, which is of course inconvenient. What do we do?
Enters the scene the Per-statement variables setting.
This is a lesser known MySQL 8.0 new feature, that comes very handy where needed.
In short, SET_VAR
is a query hint, that allows one or more variables to be set exclusively within the scope of a statement.
In this case, if we want to generate a 1M numbers sequence, we set cte_max_recursion_depth
:
-- Select the integers in the range [0, 1000000].
--
WITH RECURSIVE sequence (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM sequence WHERE n + 1 <= 1000000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
n
FROM sequence;
(I’ve actually opened a bug suggesting to include this function in the CTE manpage.)
If we want to create random numbers, we use RAND()
² and SELECT
only the associated expression:
-- Create a table with 1000 random integers in the range [0, 65536).
--
CREATE TABLE random_int_sequence
WITH RECURSIVE sequence (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM sequence WHERE n + 1 < 1000
)
SELECT FLOOR(65536 * RAND()) `rand_n`
FROM sequence;
Nothing prohibits us from generating a sequence of characters; in this case, we’ll use the CHAR()
and ORD()
functions to increment the current value:
CREATE TABLE random_char_sequence
WITH RECURSIVE sequence (c) AS
(
SELECT 'A'
UNION ALL
SELECT CHAR(ORD(c) + 1 USING ASCII) FROM sequence WHERE CHAR(ORD(c) + 1 USING ASCII) <= 'Z'
)
SELECT c
FROM sequence;
Finally, we’ll generate a dates interval.
In this section, it’s worth mentioning an interesting usage. Suppose one is reporting monthly sales. Is this query correct?:
-- Underlying table structure.
--
-- CREATE TABLE line_items(
-- id INT UNSIGNED PRIMARY KEY,
-- total DECIMAL(8,2) NOT NULL,
-- sold_on DATETIME NOT NULL
-- );
SELECT YEAR(sold_on) `sale_year`, MONTH(sold_on) `sale_month`, SUM(total) `month_sales`
FROM line_items
GROUP BY sale_year, sale_month;
The answer is: it depends on the requirements.
If the requirement is that all the months must be displayed, one may miss rows for months when there are no sales.
A solution is to use a sequence with all the months in the required interval, and (left) join the CTE with the table.
Let’s prepare some data (via CTE, of course! 😉), for a few months (except the current):
CREATE TABLE line_items(
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
total DECIMAL(8,2) NOT NULL,
sold_on DATETIME NOT NULL,
sold_on_date DATE AS (DATE(sold_on)),
KEY (sold_on_date)
)
WITH RECURSIVE sequence (n) AS
(
SELECT 0
UNION ALL
SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
CAST(20 * RAND() AS DECIMAL) `total`,
NOW() - INTERVAL DAYOFMONTH(CURDATE()) DAY - INTERVAL (100 * RAND()) DAY `sold_on`
FROM sequence;
There are a couple of interesting concepts here:
The first is that by using NOW() - INTERVAL DAYOFMONTH(CURDATE()) DAY
as base, we ensure that we don’t store any sales for the current month.
The second is that, in order to perform an efficient left join, a functional index is required; there are a few considerations about this subject, which I’ll leave to a separate article.
Additionally, note that float INTERVAL
s are rounded (but it’s irrelevant in this context).
Now we can query!
WITH RECURSIVE dates_range (d) AS
(
SELECT CURDATE() - INTERVAL 124 DAY
UNION ALL
SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT YEAR(d) `sales_year`, MONTH(d) `sales_month`, SUM(total) `month_total_sales`
FROM
dates_range
LEFT JOIN line_items ON d = sold_on_date
GROUP BY sales_year, sales_month
ORDER BY sales_year, sales_month;
-- +------------+-------------+-------------------+
-- | sales_year | sales_month | month_total_sales |
-- +------------+-------------+-------------------+
-- | 2019 | 11 | 27895.00 |
-- | 2019 | 12 | 331700.00 |
-- | 2020 | 1 | 335775.00 |
-- | 2020 | 2 | 306289.00 |
-- | 2020 | 3 | NULL |
-- +------------+-------------+-------------------+
Excellent. The current month is displaying, as intended, even if it has no sales.
Let’s check the optimizer plan (note that I’ve removed the ORDER BY
clause for simplicity):
EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
SELECT CURDATE() - INTERVAL 124 DAY
UNION ALL
SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT YEAR(d) `sales_year`, MONTH(d) `sales_month`, SUM(total) `month_total_sales`
FROM
dates_range
LEFT JOIN line_items ON d = sold_on_date
GROUP BY sales_year, sales_month\G
-- *************************** 1. row ***************************
-- EXPLAIN: -> Table scan on <temporary>
-- -> Aggregate using temporary table
-- -> Nested loop left join
-- -> Table scan on dates_range
-- -> Materialize recursive CTE dates_range
-- -> Rows fetched before execution
-- -> Repeat until convergence
-- -> Filter: ((dates_range.d + interval 1 day) <= <cache>(curdate())) (cost=2.73 rows=2)
-- -> Scan new records on dates_range (cost=2.73 rows=2)
-- -> Index lookup on line_items using sold_on_date (sold_on_date=dates_range.d) (cost=0.28 rows=1)
The plan has a few interesting points, but they are left to the reader, since they are out of the scope of this article.
MySQL 8.0 brought many, very interesting, features. Although sequences/generator are still not fully supported, we can use the (very flexible) CTEs to cover a part of the use cases.
Happy querying with MySQL 8.0!
¹: Please note that real-world schemas are generally designed differently, and this example has been written with simplicity in mind instead.
²: Remember that RAND()
is not a cryptographically secure function.
I’ve already published two posts on two specific issues; in this article, I’ll give the complete picture.
As usual, I’ll use this post to introduce tooling concepts that may be useful in generic system administration.
The presentation code is hosted on a GitHub repository (including the the source files and the output slides in PDF format), and on Slideshare.
Contents:
The following are the basic issues to handle when migrating:
utf8mb4
/utf8mb4_0900_ai_ci
;Of course, the larger the scale, the more aspects will need to be considered; for example, large-scale write-bound systems may need to handle:
In this article, I’ll only deal with what can be reasonably considered the lowest common denominator of all the migrations.
All the SQL examples are executed on MySQL 8.0.
utf8mb4
/utf8mb4_0900_ai_ci
References:
utf8
to the utf8mb4
charsetMySQL introduces a new collation - utf8mb4_0900_ai_ci
. Why?
Basically, it’s an improved version of the general_ci
version - it supports Unicode 9.0, it irons out a few issues, and it’s faster.
The collation utf8(mb4)_general_ci
wasn’t entirely correct; a typical example is Å
:
-- Å = U+212B
SELECT "sÅverio" = "saverio" COLLATE utf8mb4_general_ci;
-- +--------+
-- | result |
-- +--------+
-- | 0 |
-- +--------+
SELECT "sÅverio" = "saverio"; -- Default (COLLATE utf8mb4_0900_ai_ci);
-- +--------+
-- | result |
-- +--------+
-- | 1 |
-- +--------+
From this, you can also guess what ai_ci
means: a
ccent i
nsensitive/c
ase i
nsensitive.
So, what’s the problem?
Legacy.
Technically, utf8mb4
has been available in MySQL for a long time. At least a part of the industry started the migration long before, and publicly documented the process.
However, by that time, only utf8mb4_general_ci
was available. Therefore, a vast amount of documentation around suggests to move to such collation.
While this is not an issue per se, is it a big issue when considering that the two collations are incompatible.
For people who like (and frequently use) them, regular expressions are a fundamental tool.
In particular when performing administration tasks (using them in an application for data matching is a different topic), they can streamline some queries, avoiding lengthy concatenations of conditions.
In particular, I find it practical as a sophisticated SHOW <object>
supplement.
SHOW <object>
, in MySQL, supports LIKE
, however, it’s fairly limited in functionality, for example:
SHOW GLOBAL VARIABLES LIKE 'character_set%'
-- +--------------------------+-------------------------------------------------------------------------+
-- | Variable_name | Value |
-- +--------------------------+-------------------------------------------------------------------------+
-- | character_set_client | utf8mb4 |
-- | character_set_connection | utf8mb4 |
-- | character_set_database | utf8mb4 |
-- | character_set_filesystem | binary |
-- | character_set_results | utf8mb4 |
-- | character_set_server | utf8mb4 |
-- | character_set_system | utf8 |
-- | character_sets_dir | /home/saverio/local/mysql-8.0.19-linux-glibc2.12-x86_64/share/charsets/ |
-- +--------------------------+-------------------------------------------------------------------------+
Let’s turbocharge it!
Let’s get all the meaningful charset-related variables, but not one more, in a single swoop:
SHOW GLOBAL VARIABLES WHERE Variable_name RLIKE '^(character_set|collation)_' AND Variable_name NOT RLIKE 'system|data';
-- +--------------------------+--------------------+
-- | Variable_name | Value |
-- +--------------------------+--------------------+
-- | character_set_client | utf8mb4 |
-- | character_set_connection | utf8mb4 |
-- | character_set_results | utf8mb4 |
-- | character_set_server | utf8mb4 |
-- | collation_connection | utf8mb4_general_ci |
-- | collation_server | utf8mb4_general_ci |
-- +--------------------------+--------------------+
Nice. The first regex reads: “string starting with (^
) either character_set
or collation
”, and followed by _
. Note that if we don’t group character_set
and collation
(via (
…)
), the ^
metacharacter applies only to the first.
Character set and collation are a very big deal, because changing them in this case requires to literally (in a literal sense 😉) rebuild the entire database - all the records (and related indexes) including strings will need to be rebuilt.
In order to understand the concepts, let’s have a look at the MySQL server settings again; I’ll reorder and explain them.
Literals sent by the client are assumed to be in the following charset:
character_set_client
(default: utf8mb4
)after, they’re converted and processed by the server, to:
character_set_connection
(default: utf8mb4
)collation_connection
(default: utf8mb4_0900_ai_ci
)The above settings are crucial, as literals are a foundation for exchanging data with the server. For example, when an ORM inserts data in a database, it creates an INSERT
with a set of literals.
When the database system sends the results, it sends them in the following charset:
character_set_results
(default: utf8mb4
)Literals are not the only foundation. Database objects are the other side of the coin. Base defaults for database objects (e.g. the databases) use:
character_set_server
(default: utf8mb4
)collation_server
(default: utf8mb4_0900_ai_ci
)Some developers would define a string as a stream of bytes; this is not entirely correct.
To be exact, a string is a stream of bytes associated to a character set.
Now, this concept applies to strings in isolation. How about operations on sets of strings, e.g. comparisons?
In a similar way, we need another concept: the “collation”.
A collation is a set of rules that defines how strings are sorted, which is required to perform comparisons.
In a database system, a collation is associated to objects and literal, both through system and specific defaults: a column, for example, will have its own collation, while a literal will use the default, if not specified.
But when comparing two strings with different collations, how is it decided which collation to use?
Enter the “Collation coercibility”.
general
<> 0900_ai
Reference: Collation Coercibility in Expressions
Coercibility is a property of collations, which defines the priority of collations in the context of a comparison.
MySQL has seven coercibility values:
0: An explicit COLLATE clause (not coercible at all) 1: The concatenation of two strings with different collations 2: The collation of a column or a stored routine parameter or local variable 3: A “system constant” (the string returned by functions such as USER() or VERSION()) 4: The collation of a literal 5: The collation of a numeric or temporal value 6: NULL or an expression that is derived from NULL
it’s not necessary to know them by heart, since their ordering makes sense, but it’s important to know how the main ones work in the context of a migration:
What we want to know is what happens in the workflow of a migration, in particular, if we:
Let’s create a table with all the related collations:
CREATE TABLE chartest (
c3_gen CHAR(1) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci,
c4_gen CHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
c4_900 CHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci
);
INSERT INTO chartest VALUES('ä', 'ä', 'ä');
Note how we insert characters in the Basic Multilingual Plane) (BMP
, essentially, the one supported by utf8mb3
) - we’re simulating a database where we only changed the defaults, not the data.
Let’s compare with BMP utf8mb4
:
SELECT c3_gen = 'ä' `result` FROM chartest;
-- +--------+
-- | result |
-- +--------+
-- | 1 |
-- +--------+
Nice; it works. Coercion values:
More critical: we compare against a character in the Supplementary Multilingual Plane (SMP
, essentially, one added by utf8mb4
), with explicit collation:
SELECT c3_gen = '🍕' COLLATE utf8mb4_0900_ai_ci `result` FROM chartest;
-- +--------+
-- | result |
-- +--------+
-- | 0 |
-- +--------+
Coercion values:
MySQL converts the first value and uses the explicit collation.
Most critical: compare against a character in the SMP, without implicit collation:
SELECT c3_gen = '🍕' `result` FROM chartest;
ERROR 1267 (HY000): Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation '='
WAT!!
Weird?
Well, this is because:
MySQL tries to coerce the charset/collation to the column’s one, and fails!
This gives a clear indication to the migration: do not allow SMP characters in the system, until the entire dataset has been migrated.
Now, let’s see what happens between columns!
SELECT COUNT(*) FROM chartest a JOIN chartest b ON a.c3_gen = b.c4_gen;
-- +----------+
-- | COUNT(*) |
-- +----------+
-- | 1 |
-- +----------+
SELECT COUNT(*) FROM chartest a JOIN chartest b ON a.c3_gen = b.c4_900;
-- +----------+
-- | COUNT(*) |
-- +----------+
-- | 1 |
-- +----------+
SELECT COUNT(*) FROM chartest a JOIN chartest b ON a.c4_gen = b.c4_900;
ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,IMPLICIT) for operation '='
Ouch. BIG OUCH!
Why?
This is what happens to people who migrated, referring to obsolete documentation, to utf8mb4_general_ci
- they can’t easily migrate to the new collation.
The migration path outlined:
is viable for production systems.
There’s another unexpected property of the new collation.
Let’s simulate MySQL 5.7:
-- Not exact, but close enough
--
SELECT '' = _utf8' ' COLLATE utf8_general_ci;
-- +---------------------------------------+
-- | '' = _utf8' ' COLLATE utf8_general_ci |
-- +---------------------------------------+
-- | 1 |
-- +---------------------------------------+
How does this work on MySQL 8.0?:
-- Current (8.0):
--
SELECT '' = ' ';
-- +----------+
-- | '' = ' ' |
-- +----------+
-- | 0 |
-- +----------+
Ouch!
Where does this behavior come from? Let’s get some more info from the collations (with a regular expression, of course 😉):
SHOW COLLATION WHERE Collation RLIKE 'utf8mb4_general_ci|utf8mb4_0900_ai_ci';
-- +--------------------+---------+-----+---------+----------+---------+---------------+
-- | Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
-- +--------------------+---------+-----+---------+----------+---------+---------------+
-- | utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
-- | utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE |
-- +--------------------+---------+-----+---------+----------+---------+---------------+
Hmmmm 🤔. Let’s have a look at the formal rules from the SQL (2003) standard (section 8.2):
3) The comparison of two character strings is determined as follows:
a) Let CS be the collation […]
b) If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the pad character is chosen based on CS. If CS has the NO PAD characteristic, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a space.
In other words: the new collation does not pad.
This is not a big deal. Just, before migrating, trim the data, and make 100% sure that new instances are not introduced by the application before the migration is completed.
Triggers are fairly easy to handle, as they can be dropped/rebuilt with the new settings - just make sure to consider comparisons inside the trigger body.
Sample of a trigger (edited):
SHOW CREATE TRIGGER enqueue_comments_update_instance_event\G
-- SQL Original Statement:
CREATE TRIGGER `enqueue_comments_update_instance_event`
AFTER UPDATE ON `comments`
FOR EACH ROW
trigger_body: BEGIN
SET @changed_fields := NULL;
IF NOT (OLD.description <=> NEW.description COLLATE utf8_bin AND CHAR_LENGTH(OLD.description) <=> CHAR_LENGTH(NEW.description)) THEN
SET @changed_fields := CONCAT_WS(',', @changed_fields, 'description');
END IF;
IF @changed_fields IS NOT NULL THEN
SET @old_values := NULL;
SET @new_values := NULL;
INSERT INTO instance_events(created_at, instance_type, instance_id, operation, changed_fields, old_values, new_values)
VALUES(NOW(), 'Comment', NEW.id, 'UPDATE', @changed_fields, @old_values, @new_values);
END IF;
END
-- character_set_client: utf8mb4
-- collation_connection: utf8mb4_0900_ai_ci
-- Database Collation: utf8mb4_0900_ai_ci
As you see, a trigger has associated charset/collation settings. This is because, differently from a statement, it’s not sent by a client, so it needs to keep its own settings.
In the trigger above, dropping/recreating in the context of a system with the new default works, however, it’s not enough - there’s a comparison in the body!
Conclusion: don’t forget to look inside the triggers. Or better, make sure you have a solid test suite 😉.
We’ve been long time users of MySQL triggers. They make a wonderful callback system.
When a system grows, it’s increasingly hard (tipping into the unmaintainable) to maintain application-level callbacks. Triggers will never miss any database update, and with a logic like the above, a queue processor can process the database changes.
Now that we’ve examined the compatibility, let’s examine the performance aspect.
Indexes are still usable cross-charset, due to automatic conversion performed by MySQL. The point to be aware of is that the values are converted after being read from the index.
Let’s create test tables:
CREATE TABLE indextest3 (
c3 CHAR(1) CHARACTER SET utf8,
KEY (c3)
);
INSERT INTO indextest3 VALUES ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'), ('i'), ('j'), ('k'), ('l'), ('m');
CREATE TABLE indextest4 (
c4 CHAR(1) CHARACTER SET utf8mb4,
KEY (c4)
);
INSERT INTO indextest4 SELECT * FROM indextest3;
Querying against a constant yields interesting results:
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM indextest4 WHERE c4 = _utf8'n'\G
-- -> Aggregate: count(0)
-- -> Filter: (indextest4.c4 = 'n') (cost=0.35 rows=1)
-- -> Index lookup on indextest4 using c4 (c4='n') (cost=0.35 rows=1)
MySQL recognizes that n
is a valid utf8mb4 character, and matches it directly.
Against a column with index:
EXPLAIN SELECT COUNT(*) FROM indextest3 JOIN indextest4 ON c3 = c4;
-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+
-- | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+
-- | 1 | SIMPLE | indextest3 | NULL | index | NULL | c3 | 4 | NULL | 13 | 100.00 | Using index |
-- | 1 | SIMPLE | indextest4 | NULL | ref | c4 | c4 | 5 | func | 1 | 100.00 | Using where; Using index |
-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+
EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM indextest3 JOIN indextest4 ON c3 = c4\G
-- -> Aggregate: count(0)
-- -> Nested loop inner join (cost=6.10 rows=13)
-- -> Index scan on indextest3 using c3 (cost=1.55 rows=13)
-- -> Filter: (convert(indextest3.c3 using utf8mb4) = indextest4.c4) (cost=0.26 rows=1)
-- -> Index lookup on indextest4 using c4 (c4=convert(indextest3.c3 using utf8mb4)) (cost=0.26 rows=1)
MySQL is using the index, so all good. However, what’s the func
?
It simply tell us that the value used against the index is the result of a function. In this case, MySQL is converting the charset for us (convert(indextest3.c3 using utf8mb4)
).
This is another crucial consideration for a migration - indexes will still be effective. Of course, (very) complex queries will need to be carefully examined, but there are the grounds for a smooth transition.
Reference: The CHAR and VARCHAR Types
One concept to be aware of, although unlikely to hit real-world application, is that utf8mb4 characters will take up to 33% more.
In storage terms, databases need to know what’s the maximum limit of the data they handle. This means that even if a string will take the same space both in utf8mb3
and utf8mb4
, MySQL needs to know what’s the maximum space it can take.
The InnoDB index limit is 3072 bytes in MySQL 8.0; generally speaking, this is large enough not to care.
Remember!:
[VAR]CHAR(n)
refers to the number of characters; therefore, the maximum requirement is 4 * n
bytes, butTEXT
fields refer to the number of bytes.Reference: The INFORMATION_SCHEMA STATISTICS Table
Up to MySQL 5.7, information_schema
statistics are updated real-time. In MySQL 8.0, statistics are cached, and updated only every 24 hours (by default).
In web applications, this affects only very specific use cases, but it’s important to know if one’s application is subject to this new behavior (our application was).
Let’s see the effects of this:
CREATE TABLE ainc (id INT AUTO_INCREMENT PRIMARY KEY);
-- On the first query, the statistics are generated.
--
SELECT TABLE_NAME, AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'ainc';
-- +------------+----------------+
-- | TABLE_NAME | AUTO_INCREMENT |
-- +------------+----------------+
-- | ainc | NULL |
-- +------------+----------------+
INSERT INTO ainc VALUES ();
SELECT TABLE_NAME, AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'ainc';
-- +------------+----------------+
-- | TABLE_NAME | AUTO_INCREMENT |
-- +------------+----------------+
-- | ainc | NULL |
-- +------------+----------------+
Ouch! The cached values are returned.
How about SHOW CREATE TABLE
?
SHOW CREATE TABLE ainc\G
-- CREATE TABLE `ainc` (
-- `id` int NOT NULL AUTO_INCREMENT,
-- PRIMARY KEY (`id`)
-- ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
This command is always up to date.
How to update the statistics? By using ANALYZE TABLE
:
ANALYZE TABLE ainc;
SELECT TABLE_NAME, AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'ainc';
-- +------------+----------------+
-- | TABLE_NAME | AUTO_INCREMENT |
-- +------------+----------------+
-- | ainc | 2 |
-- +------------+----------------+
There you go. Let’s find out the related setting:
SHOW GLOBAL VARIABLES LIKE '%stat%exp%';
-- +---------------------------------+-------+
-- | Variable_name | Value |
-- +---------------------------------+-------+
-- | information_schema_stats_expiry | 86400 |
-- +---------------------------------+-------+
Developers who absolutely need to revert to the pre-8.0 behavior can set this value to 0.
Up to MySQL 5.7, GROUP BY
’s result was sorted.
This was unnecessary - optimization-seeking developers used ORDER BY NULL
in order to spare the sort, however, accidentally or not, some relied on it.
Those who relied on it are unfortunately required to scan the codebase. There isn’t a one-size-fits-all solution, and in this case, writing an automated solution may not be worth the time of manually inspecting the occurrences, however, this doesn’t prevent the Unix tools to help 😄
Let’s simulate a coding standard where ORDER BY
is always on the line after GROUP BY
, if present:
cat > /tmp/test_groupby_1 << SQL
GROUP BY col1
-- ends here
GROUP BY col2
ORDER BY col2
GROUP BY col3
-- ends here
GROUP BY col4
SQL
cat > /tmp/test_groupby_2 << SQL
GROUP BY col5
ORDER BY col5
SQL
A basic version would be a simple grep scan with 1
line A
fter each GROUP BY
match:
$ grep -A 1 'GROUP BY' /tmp/test_groupby_*
/tmp/test_groupby_1: GROUP BY col1
/tmp/test_groupby_1- -- ends here
--
/tmp/test_groupby_1: GROUP BY col2
/tmp/test_groupby_1- ORDER BY col2
--
/tmp/test_groupby_1: GROUP BY col3
/tmp/test_groupby_1- -- ends here
--
/tmp/test_groupby_1: GROUP BY col4
--
/tmp/test_groupby_2: GROUP BY col5
/tmp/test_groupby_2- ORDER BY col5
However, with some basic scripting, we can display only the GROUP BY
s matching the criteria:
# First, we make Perl speak english: `-MEnglish`, which enables `$ARG` (among the other things).
#
# The logic is simple: we print the current line if the previous line matched /GROUP BY/, and the
# current doesn't match /ORDER BY/; after, we store the current line as `$previous`.
#
perl -MEnglish -ne 'print "$ARGV: $previous $ARG" if $previous =~ /GROUP BY/ && !/ORDER BY/; $previous = $ARG' /tmp/test_groupby_*
# As next step, we automatically open all the files matching the criteria, in an editor:
#
# - `-l`: adds the newline automatically;
# - `$ARGV`: is the filename (which we print instead of the match);
# - `unique`: if a file has more matches, the filename will be printed more than once - with
# `unique`, we remove duplicates; this is optional though, as editors open each file(name) only
# once;
# - `xargs`: send the filenames as parameters to the command (in this case, `code`, from Visual Studio
# Code).
#
perl -MEnglish -lne 'print $ARGV if $previous =~ /GROUP BY/ && !/ORDER BY/; $previous = $ARG' /tmp/test_groupby_* | uniq | xargs code
There is another approach: an inverted regular expression match:
# Match lines with `GROUP BY`, followed by a line _not_ matching `ORDER BY`.
# Reference: https://stackoverflow.com/a/406408.
#
grep -zP 'GROUP BY .+\n((?!ORDER BY ).)*\n' /tmp/test_groupby_*
This is, however, freaky, and as regular expressions in general, has a high risk of hairpulling (of course, this is up to the developer’s judgement). It will be the subject of a future article, though, because I find it is a very interesting case.
This is an easily missed problem! Some tools may not support MySQL 8.0.
There’s a known showstopper bug on the latest Gh-ost release, which prevents operations from succeeding on MySQL 8.0.
As a workaround, one case use trigger-based tools, like pt-online-schema-change
v3.1.1 or v3.0.x (but v3.1.0 is broken!) or Facebook’s OnlineSchemaChange.
When MySQL is installed via Homebrew (as of January 2020), the default collation is utf8mb4_general_ci
.
There are a couple of solution to this problem.
A simple thing to do is to correct the Homebrew formula, and recompile the binaries.
For illustrative purposes, as part of this solution, I use the so-called “flip-flop” operator, which is something frowned upon… by people not using it 😉. As one can observe in fact, for the target use cases, it’s very convenient.
# Find out the formula location
#
$ mysql_formula_filename=$(brew formula mysql)
# Out of curiosity, let's print the relevant section.
#
# Flip-flop operator (`<condition> .. <condition>`): it matches *everything* between lines matching two conditions, in this case:
#
# - start: a line matching `/args = /`;
# - end: a line matching `/\]/` (a closing square bracket, which needs to be escaped, since it's a regex metacharacter).
#
$ perl -ne 'print if /args = / .. /\]/' "$(mysql_formula_filename)"
args = %W[
-DFORCE_INSOURCE_BUILD=1
-DCOMPILATION_COMMENT=Homebrew
-DDEFAULT_CHARSET=utf8mb4
-DDEFAULT_COLLATION=utf8mb4_general_ci
-DINSTALL_DOCDIR=share/doc/#{name}
-DINSTALL_INCLUDEDIR=include/mysql
-DINSTALL_INFODIR=share/info
-DINSTALL_MANDIR=share/man
-DINSTALL_MYSQLSHAREDIR=share/mysql
-DINSTALL_PLUGINDIR=lib/plugin
-DMYSQL_DATADIR=#{datadir}
-DSYSCONFDIR=#{etc}
-DWITH_BOOST=boost
-DWITH_EDITLINE=system
-DWITH_SSL=yes
-DWITH_PROTOBUF=system
-DWITH_UNIT_TESTS=OFF
-DENABLED_LOCAL_INFILE=1
-DWITH_INNODB_MEMCACHED=ON
]
# Fix it!
#
$ perl -i.bak -ne 'print unless /CHARSET|COLLATION/' "$(mysql_formula_filename)"
# Now recompile and install the formula
#
$ brew install --build-from-source mysql
An alternative solution is for the server to ignore the client encoding on handshake.
When configured this way, the server will impose on the clients the the default character set/collation.
In order to apply this solution, add character-set-client-handshake = OFF
to the server configuration.
A very good practice when performing (major/minor) upgrades is to compare the system variables, in order to spot differences that may have an impact.
The MySQL Parameters website gives a visual overview of the differences between versions.
For example, the URL https://mysql-params.tmtms.net/mysqld/?vers=5.7.29,8.0.19&diff=true shows the differences between the system variables of v5.7.29 and v8.0.19.
The migration to MySQL 8.0 at Ticketsolve has been one of the smoothest, historically speaking.
This is a bit of a paradox, because we never had to rewrite our entire database for an upgrade, however, with sufficient knowledge of what to expect, we didn’t hit any significant bump (in particular, nothing unexpected in the optimizer department, which is usually critical).
Considering the main issues and their migration requirements:
the conclusion is that the preparation work can be entirely done before the upgrade, and subsequently perform it with reasonable expectations of low risk.
Happy migration 😄
]]>Trailing spaces are a (not in a good way) surprising, but also widely covered argument. This article gives a short overview, and relates it to how this affects people upgrading to MySQL 8.0.
Contents:
In this article I’m going to analyze only the VARCHAR
data type behavior, as I’d like to keep the article concise. Interested readers can find information in the links provided.
As of MySQL 8.0, utf8
is an alias to utf8mb3
(MySQL 5.7’s underlying standard); using utf8
/utf8mb3
will generate warnings when running some statements on an 8.0 server, which can be ignored in the context of this article.
The reader needs to have an idea of what a collation is (in short: a set of rules for comparing strings).
The MySQL version used, and required to run the article content, is 8.0.
=
) predicate (1)The comparison (=
) predicate specification is defined independently of its context, therefore, it behaves the same both in the select list (SELECT ...
) and the search condition (WHERE ...
).
Let’s start observing the MySQL 5.7 typical behavior:
CREATE TABLE test_comparison_ps (
id INT PRIMARY KEY AUTO_INCREMENT,
str VARCHAR(10) CHARSET utf8
);
INSERT INTO test_comparison_ps (str) VALUES(''), (' ');
SET NAMES utf8 COLLATE utf8_general_ci; # set the connection charset/collation
SELECT id, CONCAT('<', str, '>') `qstr`, str = '' , str = ' ' FROM test_comparison_ps;
# +----+------+----------+-----------+
# | id | qstr | str = '' | str = ' ' |
# +----+------+----------+-----------+
# | 1 | <> | 1 | 1 |
# | 2 | < > | 1 | 1 |
# +----+------+----------+-----------+
They’re all equal! This matches the typical outlook that “MySQL removes all the trailing spaces”.
But why so? Who’s responsible?
According to the SQL standard, trailing spaces are not removed on storage and retrieval. In MySQL, this is a responsibility of the storage engine, in this case InnoDB; from the related manpage, we read:
Trailing spaces are not truncated from VARCHAR columns.
It turns out, the responsible is the collation. In this case, utf8_general_ci
, the default collation of the default MySQL 5.7 charset, does not pad the strings during comparison.
How do we know how comparisons behave in relateion to padding? Let’s ask the information schema:
SELECT COLLATION_NAME, PAD_ATTRIBUTE FROM information_schema.collations WHERE COLLATION_NAME RLIKE 'utf8(mb4)?_(general|0900_ai)_ci';
/*
+--------------------+---------------+
| COLLATION_NAME | PAD_ATTRIBUTE |
+--------------------+---------------+
| utf8_general_ci | PAD SPACE | # 5.7 default
| utf8mb4_general_ci | PAD SPACE | # utf8mb4 default in MySQL 5.7
| utf8mb4_0900_ai_ci | NO PAD | # 8.0 default
+--------------------+---------------+
*/
From the manpages page 1 and page 2:
The pad attribute determines how trailing spaces are treated for comparison of nonbinary strings (CHAR, VARCHAR, and TEXT values):
- For PAD SPACE collations, trailing spaces are insignificant in comparisons; strings are compared without regard to any trailing spaces.
- NO PAD collations treat spaces at the end of strings like any other character.
The following are the formal rules from the SQL (2003) standard (section 8.2):
3) The comparison of two character strings is determined as follows:
a) Let CS be the collation as determined by Subclause 9.13, “Collation determination”, for the declared types of the two character strings.
b) If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the pad character is chosen based on CS. If CS has the NO PAD characteristic, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a
. c) The result of the comparison of X and Y is given by the collation CS.
d) Depending on the collation, two strings may compare as equal even if they are of different lengths or contain different sequences of characters. When any of the operations MAX, MIN, and DISTINCT reference a grouping column, and the UNION, EXCEPT, and INTERSECT operators refer to character strings, the specific value selected by these operations from a set of such equal values is implementation- dependent.
the crucial point is b).
=
) predicate (2)Now we can go back, and observe a different collation - utf8mb4_0900_ai_ci
, MySQL 8.0 default:
CREATE TABLE test_comparison_np (
id INT PRIMARY KEY AUTO_INCREMENT,
str VARCHAR(10) CHARSET utf8mb4
);
INSERT INTO test_comparison_np (str) VALUES(''), (' ');
SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci; # behave like a standard MySQL 8.0 installation
SELECT id, CONCAT('<', str, '>') `qstr`, str = '' , str = ' ' FROM test_comparison_np;
/*
+----+------+----------+-----------+
| id | qstr | str = '' | str = ' ' |
+----+------+----------+-----------+
| 1 | <> | 1 | 0 |
| 2 | < > | 0 | 1 |
+----+------+----------+-----------+
*/
… so MySQL doesn’t “remove all the trailing spaces” after all.
LIKE
predicateLet’s see how the LIKE
predicate behaves:
CREATE TABLE test_like (
id INT PRIMARY KEY AUTO_INCREMENT,
str VARCHAR(10) CHARSET utf8
);
INSERT INTO test_like (str) VALUES(''), (' ');
SET NAMES utf8 COLLATE utf8_general_ci;
SELECT id, CONCAT('<', str, '>') `qstr`, str LIKE '' , str LIKE ' ' FROM test_like;
/*
+----+------+-------------+--------------+
| id | qstr | str LIKE '' | str LIKE ' ' |
+----+------+-------------+--------------+
| 1 | <> | 1 | 0 |
| 2 | < > | 0 | 1 |
+----+------+-------------+--------------+
*/
Yikes! LIKE
does not perform padding, even on a PAD SPACE
collation such as utf8_general_ci
.
LIKE
has some semantic differences from =
, which are confusing (for example, when dealing with JSON), however, they’re expected.
Therefore, as long as we keep in mind that LIKE
differs from =
, we are less likely to make mistakes.
Let’s see how unique indexes behave:
CREATE TABLE test_unique_index (
id INT PRIMARY KEY AUTO_INCREMENT,
str_ps VARCHAR(10) CHARSET utf8 COLLATE utf8_general_ci,
str_np VARCHAR(10) CHARSET utf8mb4 COLLATE utf8mb4_0900_ai_ci
);
INSERT INTO test_unique_index (str_ps, str_np) VALUES('', ''), (' ', ' ');
ALTER TABLE test_unique_index ADD UNIQUE (str_ps);
-- ERROR 1062 (23000): Duplicate entry '' for key 'str_ps'
ALTER TABLE test_unique_index ADD UNIQUE (str_np);
-- Query OK, 0 rows affected (0,02 sec)
Unique indexes behave like the comparison predicate; this makes sense, since comparison is the core operation they’re associated to.
DISTINCT
predicateLet’s see the effects of the DISTINCT
predicate:
CREATE TABLE test_distinct (
id INT PRIMARY KEY AUTO_INCREMENT,
str VARCHAR(10) CHARSET utf8
);
INSERT INTO test_distinct (str) VALUES(''), (' ');
SET NAMES utf8 COLLATE utf8_general_ci;
SELECT DISTINCT str FROM test_distinct;
/*
+------+
| str |
+------+
| | # ''
| | # ' '
+------+
*/
Very confusing: DISTINCT
does not perform padding.
This is something to keep in mind.
GROUP BY
clauseFinally, the GROUP BY
clause:
CREATE TABLE group_by (
id INT PRIMARY KEY AUTO_INCREMENT,
str VARCHAR(10) CHARSET utf8
);
INSERT INTO group_by (str) VALUES(''), (' ');
SET NAMES utf8 COLLATE utf8_general_ci;
SELECT DISTINCT str FROM group_by;
/*
+------+
| str |
+------+
| | # ''
| | # ' '
+------+
*/
Very confusing, again, although in a way, we could have expected this, since RDBMSs, in some cases, can process DISTINCT
and GROUP BY
the same way.
All in all, the padding rules in MySQL are not so confusing, but one needs to be aware of them - and I haven’t even explored the CHAR
data type.
In my opinion, they’re not worth the hassle, so MySQL 8.0’s behavior is a very welcome simplification. Time to update the database! 😄
]]>The operation itself is simple, however, if we want to script the operation, using text processing in a sharp way, it’s not immediate what the best solution is.
In this post I’ll explore the process of looking for a satisfying solution, going through grep, perl, and awk.
Contents:
For simplicity, we assume that the filenames returned by the mysqld
commands, and the user home path, don’t require quoting (e.g. have spaces).
Finding the configuration files is a simple operation:
$ mysqld --verbose --help
This yields a pages-long text, with all the command lines parameter and the server configuration; the relevant section is:
# ...
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
# ...
A generic, manual, approach is to use grep to isolate the text:
$ mysqld --verbose --help | grep -A 1 "^Default options"
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
Using the option -A
(--after-context
), we tell grep to print the given number of lines after the match.
Now we isolate the options line:
$ mysqld --verbose --help | grep -A 1 "^Default options" | tail -n 1
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
Standard approach - we use tail -n 1
in order to print the last 1 line(s).
There’s a problem now; we need to expand the tilde (~
).
Since the string ~/.my.cnf
is the output of a command, it’s not expanded by the subshell; this simplified example fails:
$ ls -l $(echo '~/.my.cnf')
ls: cannot access '~/.my.cnf': No such file or directory
We’ll try search/replace the tilde with the home path ($HOME
in any shell) via Perl:
$ mysqld --verbose --help | grep -A 1 "^Default options" | tail -n 1 | perl -pe "s/~/$HOME/g"
Unknown regexp modifier "/h" at -e line 1, at end of line
syntax error at -e line 1, at EOF
Execution of -e aborted due to compilation errors.
Yikes! What happened?
The problem is that $HOME
, in my case /home/saverio
, contains backslashes, which are interpolated by the shell, and ultimately interpreted by Perl; this is the simplified example:
$ echo perl -pe "s/~/$HOME/g"
perl -pe s/~//home/saverio/g
$ echo | perl -pe 's/~//home/saverio/g'
Unknown regexp modifier "/h" at -e line 1, at end of line
Execution of -e aborted due to compilation errors.
which causes the error previously raised.
Perl can access environment variables - this comes to our rescue:
$ echo '~/.my.cnf' | perl -pe 's/~/$ENV{"HOME"}/'
/home/saverio/.my.cnf
We now have the building blocks of a fully functional command:
$ mysqld --verbose --help | grep -A 1 "^Default options" | tail -n 1 | perl -pe 's/~/$ENV{"HOME"}/g'
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf /home/saverio/.my.cnf
Don’t forget the /g
regex modifier! It tells Perl to replace all the occurrences of a pattern in each matching line, if there’s more than one match (per line).
Our task is now accomplished. Can we do better?
While the last revision of the command works, it contains way too many commands. Does the GNU toolbox have better tools?
Let’s see what awk offers.
Awk is a (Turing-complete!) programming language, dedicated to text-processing; hopefully, it includes built-in functions relevant to our task.
The ugliest part right now is to isolate the options string from the entire mysqld
help. The logic required is:
with grep, unfortunately we can’t just print the line below without printing the matching line. But we can with awk!:
$ mysqld --verbose --help | awk '/^Default options/ { getline; print }'
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
Awk’s language is fortunately fairly intuitive.
We use pattern matching /<pattern>/
to match the intended line, and for the matches we execute a block ({ ... }
) that goes to the next line (getline
) and then prints the current one (print
).
Now, in the current revision, we still have two commands, awk
and perl
:
mysqld --verbose --help | awk '/^Default options/ { getline; print }' | perl -pe 's/~/$ENV{"HOME"}/g'
Let’s merge them! We use awk’s search and replace, and environment variables access:
$ mysqld --verbose --help | awk '/^Default options/ { getline; gsub("~", ENVIRON["HOME"]); print }'
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf /home/saverio/.my.cnf
Here we use the search and replace function (gsub(source[, destination[, how]])
; how
is not relevant to this article) and associative arrays applied to environment variables (ENVIRON[<variable_name>]
).
Note that gsub
is the global version of search/replace; it replaces all the occurrence in a string, like perl /g
regex modifier.
As extra step, we want to use the output. Say, let’s add a comment to the [mysqld]
block:
$ perl -i -pe 's/^(\[mysqld\]\n)/# Server configuration group follows:\n$1/' $(mysqld --verbose --help | awk '/^Default options/ { getline; gsub("~", ENVIRON["HOME"]); print }') 2> /dev/null
We just ignore the errors (due to file(s) not found), by sending them to /dev/null
.
Long ago, I thought that one could improve text processing tools with a straight read of educational material. Nowadays, I find much more effective (and pleasant) instead, to try finding out, when I have the opportunity, which are the most effective tools to a accomplish a task.
In this article we’ve done an iterative search of the best text processing tools for the given use case; we’ve found that awk compactly, yet intuitively, satisfies the requirements, and we’ve explored a few, interesting and useful, features along the way.
]]>utf8
to utf8mb4
, and we had the conversion in plan anyway, we anticipated it and performed it as preliminary step for the upgrade.
This post describes in depth the overall experience, including tooling and pitfalls, and related subjects.
Contents:
utf8mb4
is the MySQL encoding that fully covers the UTF-8 standard. Up to MySQL 5.7, the default encoding is utf8
; the name is somewhat misleading, as this is a variant with a maximum width of 3 bytes.
Although there’s no practical purpose nowadays in using 3-bytes rather than 4-bytes UTF-8, this choice was originally made for performance reasons.
From a practical perspective, not all the applications will benefit from the extra byte of width, whose most common use cases include emojis and mathematical letters, however, conforming to standards is a routine task in software engineering.
Since utf8mb4
is a superset of utf8
, the conversion is relatively painless, however, it’s crucial to be aware of the implications of the procedure.
It’s impossible to make a general plan, due to the different requirements of any use case; high traffic applications may for example require that no locking should be involved (ie. no ALTER TABLE
), while low traffic/size applications may just do with a few ALTER TABLE
s.
However, I’ll trace a granular set of steps that should cover the vast majority of the cases; GitHub’s gh-ost is used, therefore, there’s no table locking during the data conversion step.
The setup is assumed to be single-master; there are generally sophisticated multi-master strategies for schema updates, however, they are outside the scope of this article.
The only migration constraint set is that until the end of the migration, the user should not allow 4-byte characters into the database; this gives the certainty that any implicit conversion performed before the end of the migration will succeed.
Users can certainly lift this constraint, however, they must thoroughly analyze the application data flows, in order to be 100% sure that utf8mb4
strings including 4-byte characters won’t mingle with utf8
strings, as this will cause errors.
MySQL 8.0 changed the utf8mb4
default collation from utf8mb4_general_ci
to utf8mb4_0900_ai_ci
(for details, see here and here).
This has a very significant impact - if the utf8
update if performed on a MySQL 5.7 server, without specifying the collation, and then the server is upgraded to v8.0, the collation of all the data structures will not match the default.
Of course, in such case it’s possible to leave the system as is, however, it won’t be the standard (and the settings will need to be set accordingly, in order to ensure that new tables/columns will be created with the intended collation).
It’s crucial to be aware of this, because most of the online information about the utf8
conversion has been written when MySQL 8.0 was not released yet, so it holds the outdated assumption that the default utf8mb4
collation is utf8mb4_general_ci
.
In the following sections, I’ll point out which configuration parameters are required, when performing the conversion on a 5.7 server.
The character set [from now on abbreviated as charset
] and collation of a given string or database object (ultimately, a column), and the operation performed, are determined by one or more settings/properties at different levels:
For example:
and so on.
Additionally, MySQL server attempts to use a compatible combination charset+collation for incompatible charsets, overriding the configuration/settings.
In order to view the connection and database server settings, we can use this handy query:
SHOW VARIABLES WHERE Variable_name RLIKE '^(character_set|collation)_' AND Variable_name NOT RLIKE '_(database|filesystem|system)$';
some settings are skipped, as they’re unrelated or deprecated.
This is a table of the relevant entries:
Setting | New value | Notes | Server setting | Client setting |
---|---|---|---|---|
character_set_client |
utf8mb4 |
data sent by the client | ✓ | |
character_set_connection |
utf8mb4 |
server converts client data into this charset for processing | ✓ | |
collation_connection |
utf8mb4_0900_ai_ci |
server uses this collation for processing | ✓ | |
character_set_results |
utf8mb4 |
data and metadata sent by the server | ✓ | |
character_set_server |
utf8mb4 |
default (and fallback) charset for objects | ✓ | |
collation_server |
utf8mb4_0900_ai_ci |
default (and fallback) collation for objects | ✓ |
Server settings are defined at the server level, and as such, they’re typically set in the server configuration file - this is required if we’re operating on MySQL 5.7 (since it uses utf8
by default).
Client settings are specified by the client on connection; typically, they’re set via the SET NAMES <charset> [COLLATE <collation>]
statement.
This command is invoked when the encoding/collation are configured by the application framework; in the case of Rails, the parameters are in database.yml
:
# Typical structure
login:
encoding: utf8mb4
collation: utf8mb4_0900_ai_ci
# ...
In Django, we add the following to settings.py
:
# Typical structure
DATABASES = {
'default': {
'OPTIONS': {'charset': 'utf8mb4'},
# ...
}
}
The changes above will cause the following statement to be issued on the first connection:
SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci # Rails also sets other variables here.
Based on a brief look at the source code, there is no collation option in Django, so the COLLATE utf8mb4_0900_ai_ci
won’t be specified in the SQL statement.
This step can be performed at the beginning or the end of the migration; the reason is explained in the next subsection.
During the migration, with either utf8
or utf8mb4
connection settings, we’ll find data belonging to the other charset. Is this a problem?
First, an introduction to the the charset/collation settings is required.
Over the course of a database connection, the data (flow) is processed in several steps:
character_set_client
character_set_connection
(and compared using the collation_connection
)character_set_results
All the above settings (unless explicitly set) are set automatically, according to the character_set_client
settings, so we can really think of all of them as a single entity.
So, the core question is: for client data in a given format (utf8
or utf8mb4
), will processing (comparison or storage) always succeed?
Fortunately, in our context, the answer is always yes.
When it comes to storage, the matter is pretty simple; MySQL will take care of “converting” the format. We’re safe here because by using 3-byte characters, we can convert without any problem from and to the other charset.
However, in this context, strings manipulation is not only about storage - comparison is the other aspect to consider. It’s time to introduce the concept of collation and the related rules.
Strings are compared according to a “collation”, which defines how the data is sorted and compared. Each charset has a default collation, which in MySQL is the case-insensitive one (utf8_general_ci
and utf8mb4_general_ci
/utf8mb4_0900_ai_ci
).
Now, when collating strings of mixed type, will the operation succeed? The answer is… no, but yes!
The reason for the no is that, unlike storage, we can’t use a collation for two different charsets. However, MySQL comes to the rescue.
MySQL has a set of coercibility rules, which determine which collation to use in a given operation (or if an error should be raised).
The rules are quite a few, however, they’re consistently defined, so they’re easy to understand.
We’ll see a few relevant examples, where we’ll also introduce a few interesting SQL clauses:
First example:
CREATE TEMPORARY TABLE test_table (
utf8col CHAR(1) CHARACTER SET utf8 COLLATE utf8_bin
)
SELECT _utf8'ä' `utf8col`;
SELECT utf8col < _utf8mb4'🍕' COLLATE utf8mb4_bin `result` FROM test_table;
# +--------+
# | result |
# +--------+
# | 1 |
# +--------+
The relevant rules are:
An explicit COLLATE clause has a coercibility of 0 (not coercible at all)
The collation of a column or a stored routine parameter or local variable has a coercibility of 2
which rule the collation as utf8mb4_bin
. Shouldn’t the utf8col
value fail, due to being an utf8
value, which is not handled by the winning collation?
No! MySQL will automatically convert the value, making it compatible. This is equivalent to:
SELECT _utf8mb4'ä' < _utf8mb4'🍕' COLLATE utf8mb4_bin `result` FROM test_table;
Second example:
SET NAMES utf8mb4;
CREATE TEMPORARY TABLE test_table (
utf8col CHAR(1) CHARACTER SET utf8 COLLATE utf8_bin
)
SELECT _utf8'ä' `utf8col`;
SELECT utf8col < 'ë' `result` FROM test_table;
# +--------+
# | result |
# +--------+
# | 1 |
# +--------+
The relevant rules are:
The collation of a column or a stored routine parameter or local variable has a coercibility of 2
The collation of a literal has a coercibility of 4
The collation will be utf8_bin
. Since ë
can be converted, there’s no problem.
Equivalent statement:
SELECT _utf8'ä' COLLATE utf8_bin < _utf8mb4'ë' `result` FROM test_table;
Final example:
CREATE TEMPORARY TABLE test_table (
utf8col CHAR(1) CHARACTER SET utf8 COLLATE utf8_bin
)
SELECT _utf8'ä' `utf8col`;
SELECT utf8col < _utf8mb4'🍕' `result` FROM test_table;
ERROR 1267 (HY000): Illegal mix of collations (utf8_bin,IMPLICIT) and (utf8mb4_0900_ai_ci,COERCIBLE) for operation '<'
Error! What happened here?
The relevant rules and chosen collation are the same as the previous example, however, in this case, the pizza emoji (🍕
) can’t be converted to utf8
, therefore, the operation fails.
The conclusion is that as long as we use utf8
characters only during the migration, we’ll have no problem, as the only relevant case is the second example.
ALTER
statementsIn this step we’ll prepare all the ALTER
statements that will change the schema/table metadata, and the data.
The operations are performed on a development database with the same structure as production.
First, we convert the database default charset (both production and development):
ALTER SCHEMA production_schema CHARACTER SET=utf8mb4;
data is not changed - only the metadata.
Then, we convert all the table charset to utf8mb4
:
mysqldump "$updating_schema" |
perl -ne 'print "ALTER TABLE $1 CHARACTER SET utf8mb4;\n" if /CREATE TABLE (.*) /' |
mysql "$updating_schema"
again, data is not changed. This operation will cause all the columns that don’t match the new charset (supposedly, all the existing character columns), to show the former (utf8
) charset in their definition:
# before (simplified)
CREATE TABLE mytable (
intcol INT,
strcol CHAR(1),
strcol2 CHAR(1)
);
# after
CREATE TEMPORARY TABLE mytable (
intcol INT,
strcol CHAR(1) CHARACTER SET utf8,
strcol2 CHAR(1) CHARACTER SET utf8
) DEFAULT CHARSET=utf8mb4;
This allows us to write a straight conversion command:
mysqldump --no-data --skip-triggers "$updating_schema" |
egrep '^CREATE TABLE|CHARACTER SET utf8\b' |
perl -0777 -pe 's/(CREATE TABLE [^\n]+ \(\n)+CREATE/CREATE/g' | # remove tables without entries
perl -0777 -pe 's/,?\n(CREATE|$)/;\n$1/g' | # change comma of each last column def to semicolon (or add it)
perl -pe 's/(CHARACTER SET utf8\b)/$1mb4/' | # change charset
perl -pe 's/ `/ MODIFY `/' | # add `MODIFY`
perl -pe 's/^CREATE TABLE (.*) \(/ALTER TABLE $1/' # convert `CREATE TABLE ... (` to `ALTER TABLE`
The output will consist of all the required ALTER TABLES
, for example:
ALTER TABLE `mytable`
MODIFY `strcol` char(1) CHARACTER SET utf8mb4 DEFAULT NULL,
MODIFY `strcol2` char(1) CHARACTER SET utf8mb4 DEFAULT NULL;
A database engine needs to know the maximum length of the stored data, in this case, text, because the data structures are subject to limits.
In relation to the utf8 migration, the two related limits are:
In practice, something that may happen is that a table defined as such:
CREATE TABLE mytable (
longcol varchar(21844) CHARACTER SET utf8
);
will cause an error when converting to utf8mb4:
ALTER TABLE mytable MODIFY longcol varchar(21844) CHARACTER SET utf8mb4;
ERROR 1074 (42000): Column length too big for column 'longcol' (max = 16383); use BLOB or TEXT instead
because of MySQL restriction of 65535 (2^16 - 1) bytes on the combined size of all the columns:
The same limit applies to index prefixes, although in this case there are two limits, 767 and 3072, depending on the row format and the long prefix option.
The restriction specifications can be found in the MySQL manual.
If reducing the column width is not an option, the column will need to be converted to a TEXT
data type.
Note that using very long character columns should be carefully evaluated. Advanced DBAs know the implications, however it’s worth mentioning that in relation to the topic of internal temporary tables, character columns larger than 512 characters cause on-disk tables to be used; large object columns (BLOB
/TEXT
) don’t have this problem from version 8.0.3 onwards (see MySQL manual).
Therefore, large object columns are suitable for a larger amount of use cases than they were in the past.
Triggers and functions also require review.
Since they are executed outside the context of a connection, they carry their charset settings:
SHOW TRIGGERS\G
# [...]
# character_set_client: utf8
# collation_connection: utf8_general_ci
# Database Collation: utf8_general_ci
On one hand, those properties can be executed at any point of the migration, as they act exactly as described in the connection configurations section.
On the other hand, we need to take care of explicit COLLATE
clauses involving columns being converted, if present.
Suppose we have this statement:
SET @column_updated := OLD.strcol <=> NEW.strcol COLLATE utf8_bin;
If we migrate the column to utf8
, as soon as the ALTER TABLE
completes, any operation associated to the trigger (eg. INSERT
) will always fail, because the utf8_bin
collation is not compatible with the new utf8mb4
charset.
The solution is fairly simple - the trigger needs to be dropped before the ALTER TABLE
, and recreated after. This of course, can be a serious challenge for high-traffic websites.
Inevitably, some tables will be converted before others; even assuming parallel conversion, it’s not possible (without locking) to synchronize the end of the conversion of a set of given tables.
This creates a problem for a specific case: JOINs between columns of heterogeneous charsets - in practice, between a utf8
column and an utf8mb4
one.
In theory, this shouldn’t be a problem in itself. Let’s see what MySQL does in this case; let’s create a couple of tables:
CREATE TABLE utf8_table (
mb3col CHAR(1) CHARACTER SET utf8,
KEY `mb3idx` (mb3col)
);
INSERT INTO utf8_table
VALUES ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'), ('i'), ('j'), ('k'), ('l'), ('m'),
('n'), ('o'), ('p'), ('q'), ('r'), ('s'), ('t'), ('u'), ('v'), ('w'), ('x'), ('y'), ('z');
CREATE TABLE utf8mb4_table (
mb4col CHAR(1) CHARACTER SET utf8mb4,
KEY `mb4idx` (mb4col)
);
INSERT INTO utf8mb4_table
VALUES ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'), ('i'), ('j'), ('k'), ('l'), ('m'),
('n'), ('o'), ('p'), ('q'), ('r'), ('s'), ('t'), ('u'), ('v'), ('w'), ('x'), ('y'), ('z'),
('🍕');
First, let’s see what happen for simple index scans.
EXPLAIN SELECT COUNT(*) FROM utf8mb4_table WHERE mb4col = _utf8'n';
# +----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+
# | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
# +----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+
# | 1 | SIMPLE | utf8mb4_table | NULL | ref | mb4idx | mb4idx | 5 | const | 1 | 100.00 | Using index |
# +----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+
SHOW WARNINGS\G
# [...]
# Message: /* select#1 */ select count(0) AS `COUNT(*)` from `db`.`utf8mb4_table` where (`db`.`utf8mb4_table`.`mb4col` = 'n')
Interestingly, it seems that MySQL converts the data before it reaches the optimizer; this is valuable knowledge, because with the current constraint(s), we can rely on the indexes as much as before the migration start.
What happens with JOINs?
EXPLAIN SELECT COUNT(*) FROM utf8_table JOIN utf8mb4_table ON mb3col = mb4col;
# +----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+
# | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
# +----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+
# | 1 | SIMPLE | utf8_table | NULL | index | NULL | mb3idx | 4 | NULL | 26 | 100.00 | Using index |
# | 1 | SIMPLE | utf8mb4_table | NULL | ref | mb4idx | mb4idx | 5 | func | 1 | 100.00 | Using where; Using index |
# +----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+
What’s func
?
SHOW WARNINGS\G
# Message: /* select#1 */ select count(0) AS `COUNT(*)` from `db`.`utf8_table` join `db`.`utf8mb4_table` where (convert(`db`.`utf8_table`.`mb3col` using utf8mb4) = `db`.`utf8mb4_table`.`mb4col`)
Very interesting; we see what MySQL does in this case: it iterates utf8_table.mb3col
(specifically, it iterates the index mb3idx
), and for each value, it converts it to utf8mb4
, so that it can be sought it in the utf8mb4_table.mb4idx
index.
Note that this is a simple case; more complex JOINs in the app should still be carefully reviewed.
Now we can proceed to alter the production schema.
The schema encoding can be changed without any worry, as it’s not a locking operation (up to v5.7, database properties are stored in a separate file, db.opt
).
The table changes are the “big deal”: we need to perform them without locking, and with an awareness of the implications.
In order to avoid table locking, we use gh-ost, which is easy to use and well-documented.
Generally speaking, each ALTER TABLE
of the list generated in the previous step must be converted to a gh-ost
command and executed.
For example, this DDL statement:
ALTER TABLE `mytable`
MODIFY `strcol` char(1) CHARACTER SET utf8mb4 DEFAULT NULL,
MODIFY `strcol2` char(1) CHARACTER SET utf8mb4 DEFAULT NULL;
needs to be performed as [simplified form]:
gh-ost --database="$production_schema" --table="mytable" --alter="
CHARACTER SET utf8mb4,
MODIFY `strcol` char(1) CHARACTER SET utf8mb4 DEFAULT NULL,
MODIFY `strcol2` char(1) CHARACTER SET utf8mb4 DEFAULT NULL
"
This is a fairly simple procedure. Don’t forget to run ANALYZE TABLE
on each table after it’s been rebuilt.
The problem that some users will have is triggers; gh-ost doesn’t support tables with triggers, so an alternative procedure needs to be applied by high-traffic websites using this functionality.
Little gotchas to be aware of!
Don’t forget to convert the other schemas as well!
In particular, if you’re on AWS, the schema tmp
will need to be converted. Forgetting to do so may cause errors if this database is used for temporary data operations that involve the main production database.
ANALYZE TABLE
It’s crucial to always run an ANALYZE TABLE
for each table rebuilt. Gh-ost builds tables via successive insert, and it’s good (MySQL) DBA practice to:
run ANALYZE TABLE after loading substantial data into an InnoDB table, or creating a new index for one
See the MySQL manual for more informations.
DROP TABLE
Gh-ost doesn’t delete the old table after replacing it - it only renames it. Be very careful when deleting it; a straight DROP TABLE
may flood the server with I/O.
Internally, we have a script for dropping large tables that first drops the indexes one by one, then deletes the records in chunks, and only at the end drops the (now empty) table.
There’s a popular post about the same subject, by a V8 developer (Mathias Bynens).
A couple of concepts are worth considering:
# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;
From this, it can be deduced that the author uses MyISAM, as InnoDB doesn’t support REPAIR TABLE
(see the MySQL manual).
make sure to repair and optimize all databases and tables […] ran into some weird bugs where UPDATE statements didn’t have any effect, even though no errors were thrown
this is very likely a bug, and based on the previous point, it may be MyISAM related (or related to ALTER TABLE
). MyISAM has been essentially abandoned for a long time, and we’ve experienced buggy behaviors as well (although not in the context of charsets), so it wouldn’t be a surprise; the post is also very old (2012).
We’re entirely on InnoDB, and we didn’t experience any issue when changing the charset via ALTER TABLE
(small tables in our model have been done this way). It’s also worth considering that gh-ost alters tables by creating an empty table and slowly filling it, which is different from issuing an ALTER TABLE
.
If somebody still wanted to do a rebuild of the table, note that OPTIMIZE TABLE
performs a full rebuild followed by ANALYZE TABLE
, so it’s not required to run the latter statement separately.
Considering that migrating a database to utf8mb4
implies literally rebuilding the entire database’s data, it’s been a ride with relatively few bumps.
The core issue is handling JOINs between columns being migrated; it may not be a trivial matter, but it’s possible to get deterministic behavior with a thorough analysis.
Projects planning to move to MySQL 8.0 are encouraged to perform this step ahead, to shift as many possible changes related to the upgrade ahead of the upgrade itself.
All in all, migrating to utf8mb4
is a very significant change, but knowing where to look at, it’s possible to perform it smoothly.
¹ Very likely, partial indexes are a fit solution to this problem, but they’re not supported by MySQL.
]]>On modern MySQL setups, dropping a column doesn’t lock the table (it does, actually, but for a relatively short time), however, we wanted to improve a very typical Rails migration scenario in a few ways:
I’ll give the Gh-ost tool a brief introduction, and show how to fulfill the above requirements in a simple way, by using this tool and an ActiveRecord flag.
This workflow can be applied to almost any table alteration scenario.
Contents:
Gh-ost is a relatively recent tool by GitHub, which allows online table modifications without locking.
Tools like gh-ost existed before - the first being mk-online-schema-change
(now pt-online-schema-change
), developed by Percona.
The Percona tool relies on triggers in order to achieve the objective, which is a good enough, stable, solution. However, there are a variety of reasons that (can) make the tool inadequate for high-load conditions.
Gh-ost introduced the novel idea of reading from the binary log (which logs all the write operation) in order to reproduce the writes on the temporary table.
Gh-ost can be run in different setups; this article will show the simplest one.
Let’s assume the following table:
CREATE TABLE `customers` (
--- column definitions
`source_id` int(11) NOT NULL,
-- index definitions
KEY `index_customers_on_source_id` (`source_id`)
);
with the corresponding model:
class Customer < ApplicationRecord
# model content
end
and migration:
class DropCustomersSourceId < ActiveRecord::Migration
def change
remove_column :customers, :source_id
end
end
First, we tackle point #2. Let’s have a look at the stages of a typical deploy with migrations:
ALTER TABLE
statement, which will take a long time;The problem is that between the stages 2. and 3. (and also, depending on the app server configuration, during the processes restart), the app servers will have in memory the old version of the codebase, which expects customers.source_id
to be present.
Although this time is relatively short, on a high-load environment, if a Customer
instance is saved, the operation will fail, because ActiveRecord will include the column in the underlying INSERT.
In systems engineering, schema-aware code strategy is sometimes applied: essentially, writing code in the form “if the schema is foo
, do bar
, otherwise, do baz
”.
In the case of a column drop, we have at our disposal a “cheap” schema-aware strategy: ignored_columns
(see the Rails PR).
This directive makes ActiveRecord entirely ignore a column, so that the column can disappear at any time, without ActiveRecord noticing.
Let’s update the model:
class Customer < ApplicationRecord
self.ignored_columns = %w(source_id)
# model content
end
and the migration:
class DropcustomersSourceId < ActiveRecord::Migration[5.2]
def change
remove_column :customers, :source_id unless is_production_environment?
end
def is_production_environment?
# choose strategy
end
end
We can now perform the deploy; this time, the table column will not be dropped. After the deploy, we will use gh-ost, as outlined in the next section.
Gh-ost is pretty straightforward to use. In this context it’s used in the simplest way possible, that is, running directly on master.
Note that there are many options available, including:
A summary document is available here; gh-ost has good documentation.
The sample command we use is:
$ GHOST_TABLE="customers"
$ GHOST_ALTER="DROP source_id"
$ gh-ost \
--user="$GHOST_USER" --password="$GHOST_PASSWORD" --host="$GHOST_HOST" \
--database="$GHOST_SCHEMA" --table="$GHOST_TABLE" --alter="$GHOST_ALTER" \
--allow-on-master --exact-rowcount --verbose --execute
The options are clear; --exact-rowcount
will trade a little execution time for more accurate progress estimation.
Gh-ost will create a temporary (in a logical, not SQL, sense) table, slowly fill it and update with original table updates, then swap (with negligible locking time) them.
A crucial detail is that gh-ost will leave the original table in the database, renamed (in this case, _customers_del
).
Although there is an option to drop the table automatically, do not enable it or do not attempt to do it manually: dropping a large table creates a large amount of I/O, due to MySQL freeing the pool pages, which will likely halt the database system to a grind for some time. Instead, one should follow a progressive table drop workflow:
Between each drop/deletion, SLEEP
calls should be performed, in order to ensure that the writes are fully flushed.
Internally, we have a script for this, and it’s advised to find or develop something similar.
Of course, SLEEP
can be replaced with sophisticated strategies (eg. relying on the server statistics to track the I/O), however, in our system, SLEEP
is a perfectly adequate while simple strategy.
ignored_columns
and redeployAt this point, in production, Rails will be completely unaware of the existence (or not) of the column (being) dropped.
After the column is dropped, we can remove the Customer.ignored_columns
directive, and deploy any time (or even wait for the next deploy).
We’ve been using gh-ost for a long time by now, and we’ve developed a surrounding tooling ecosystem.
Once one gets used to such workflows, it’s actually satisfying to perform “push-button” table alterations without any locking or performance drop in general, instead of being worried of the impact of (relatively) large-scale db operations.
Paraphrasing the typical joke:
;-)
]]>