Saverio Miroddi | Mysql

Announcement: Added a separate feed for MySQL topics

2020-07-18T00:00:00+00:00

Yesterday (17/Jul/2020) I’ve added a separate feed for MySQL topics. It can be accessed at the address https://saveriomiroddi.github.io/feed/mysql.xml (also represented by the dolphin icon in the navigation bar of the blog).

Modern approaches to replacing accumulation user-defined variable hacks, via MySQL 8.0 Window functions and CTEs

2020-06-06T00:00:00+00:00

A common MySQL strategy to perform updates with accumulating functions is to employ user-defined variables, using the UPDATE [...] SET mycol = (@myvar := EXPRESSION(@myvar, mycol)) pattern.

This pattern though doesn’t play well with the optimizer (leading to non-deterministic behavior), so it has been deprecated. This left a sort of void, since the (relatively) sophisticated logic is now harder to reproduce, at least with the same simplicity.

In this article, I’ll have a look at two ways to apply such logic: using, canonically, window functions, and, a bit more creatively, using recursive CTEs.

Requirements and background

Although CTEs are fairly intuitive, I advise, to those unfamiliar with the subject, to read my previous post on the subject.

The same principle applies to the window functions principles; I will break the query/concepts down, however, it’s advised to have at least an idea. There is a vast amount of literature about window functions (which is the reason why I haven’t written about them until now); pretty much all the tutorials use as example either corporate budgets, or populations/countries. Here instead, I’ll use a real-world case.

In relation to the software, MySQL 8.0.19 is convenient (but not required). All the statements need to be run in the same console, due to reusing @venue_id.

There is always an architectural dilemma between placing the logic at the application level as opposed as the database level. Although this is an appropriate debate, in this context the underlying assumption is that it’s necessary that the logic stays at the database level; a requirement for this can be, for example, speed, which has actually been our case.

The problem

In this problem, we manage venue (theater) seats.

As a business requirement, we need to assign a “grouping”: an additional number representing each seat.

In order to set the grouping value:

start with grouping 0, and the top left seat;
if there is a space between the previous and current seat, or if it’s a new row, increase the grouping by 2 (unless it’s the first absolute seat), otherwise, increase by 1;
assign the grouping to the seat;
move to the next seat in the same row, or to the next row (if the row is over), and iterate from point 2., until the seats are exhausted.

In pseudocode:

current_grouping = 0

for each row:
  for each number:
    if (is_there_a_space_after_last_seat or is_a_new_row) and is_not_the_first_seat:
      current_grouping += 2
    else
      current_grouping += 1

    seat.grouping = current_grouping

In practice, we want the setup on the left to have the corresponding values on the right:

  x→  0   1   2        0   1   2
y   ╭───┬───┬───╮    ╭───┬───┬───╮
↓ 0 │ x │ x │   │    │ 1 │ 2 │   │
    ├───┼───┼───┤    ├───┼───┼───┤
  1 │ x │   │ x │    │ 4 │   │ 6 │
    ├───┼───┼───┤    ├───┼───┼───┤
  2 │ x │   │   │    │ 8 │   │   │
    ╰───┴───┴───╯    ╰───┴───┴───╯

Setup

Let’s use a minimalist design for the underlying table:

CREATE TABLE seats (
  id         INT AUTO_INCREMENT PRIMARY KEY,
  venue_id   INT,
  y          INT,
  x          INT,
  `row`      VARCHAR(16),
  number     INT,
  `grouping` INT,
  UNIQUE venue_id_y_x (venue_id, y, x)
);

We won’t need the row/number columns, however, on the other hand, we don’t want to use a table whose records are fully contained in an index, in order to be closer to a real-world setting.

Based on the diagram of the previous section, the seat coordinates are, in the form (y, x):

(0, 0), (0, 1)
(1, 0), (1, 2)
(2, 0)

Note that we’re using y as first coordinate, because it makes it easier to reason in terms of rows.

We’re going to load a large enough number of records, in order to make sure the optimizer doesn’t take unexpected shortcuts. We use recursive CTEs, of course 😉:

INSERT INTO seats(venue_id, y, x, `row`, number)
WITH RECURSIVE venue_ids (id) AS
(
  SELECT 0
  UNION ALL
  SELECT id + 1 FROM venue_ids WHERE id + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  v.id,
  c.y, c.x,
  CHAR(ORD('A') + FLOOR(RAND() * 3) USING ASCII) `row`,
  FLOOR(RAND() * 3) `number`
FROM venue_ids v
     JOIN (
       VALUES
         ROW(0, 0),
         ROW(0, 1),
         ROW(1, 0),
         ROW(1, 2),
         ROW(2, 0)
     ) c (y, x)
;

ANALYZE TABLE seats;

A couple of notes:

we’re using the CTEs in a (hopefully!) interesting way - each cycle represents a venue id, but since we want multiple seats to be generated for each venue (cycle), we cross join with a table including the seats data;
we’re using the v8.0.19’s row constructor (VALUES ROW()...) in order to represent a (joinable) table without actually creating it;
we generate random row/number data, as they’re filler;
for simplicity, no tweaks have been applied (e.g. data types are wider than needed, the indexes are added before the records are inserted, etc.).

The old-school approach

The old-school solution is very straightforward:

SET @venue_id = 5000; -- arbitrary venue id; any (stored) id will do

SET @grouping = -1;
SET @y = -1;
SET @x = -1;

WITH seat_groupings (id, y, x, `grouping`, tmp_y, tmp_x) AS
(
  SELECT
    id, y, x,
    @grouping := @grouping + 1 + (seats.x > @x + 1 OR seats.y != @y),
    @y := seats.y,
    @x := seats.x
  FROM seats
  WHERE venue_id = @venue_id
  ORDER BY y, x
)
UPDATE
  seats s
  JOIN seat_groupings sg USING (id)
SET s.grouping = sg.grouping
;

-- Query OK, 5 rows affected, 3 warnings (0,00 sec)

Nice and easy (but keep in mind the warnings)!

A little side note: I’m taking advantage of boolean arithmetic properties here; specifically, the following statements are equivalent:

SELECT seats.x > @x + 1 OR seats.y != @y `increment`;

SELECT IF (
  seats.x > @x + 1 OR seats.y != @y,
  1,
  0
) `increment`;

some people find it intuitive, some don’t - it’s a matter of taste; since it’s clarified now, for compactness purposes, I will use it for the rest of the article.

Let’s see the outcome:

SELECT id, y, x, `grouping` FROM seats WHERE venue_id = @venue_id ORDER BY y, x;

-- +-------+------+------+----------+
-- | id    | y    | x    | grouping |
-- +-------+------+------+----------+
-- | 24887 |    0 |    0 |        1 |
-- | 27186 |    0 |    1 |        2 |
-- | 29485 |    1 |    0 |        4 |
-- | 31784 |    1 |    2 |        6 |
-- | 34083 |    2 |    0 |        8 |
-- +-------+------+------+----------+

This approach is ideal!

It has just a “small” defect: it may work… or not.

The reason is that the query optimizer doesn’t necessarily evaluate left to right, so the assignment operations (:=) may be evaluated out of order, causing the result to be wrong. This is a problem typically experienced after MySQL upgrades.

As of MySQL 8.0, this functionality is indeed deprecated:

-- To be run immediately after the UPDATE.
--
SHOW WARNINGS\G
-- *************************** 1. row ***************************
--   Level: Warning
--    Code: 1287
-- Message: Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'.
-- [...]

Let’s fix this!

Modern approach #1: Window functions

Window functions have been a long-awaited functionality in the MySQL world.

Generally speaking, the “rolling” nature of window functions fits very well accumulating functions. However, some complex accumulating functions require the results of the latest expression to be available, which is something window functions don’t support, since they work on a column basis.

This doesn’t mean that the problem can’t be solved, rather, than it needs to be re-thought.

In this case, we split the problem in two concepts; we think the grouping value for each seat as the sum of two values:

the sequence number of each seat, and
the cumulative value of the increments of all the seats up to the current one.

Those familiar with window functions will recognize the patterns here 🙂

The sequence number of each seat is a built-in function:

ROW_NUMBER() OVER <window>

The cumulative value is where things get interesting. In order to accomplish this task, we perform two steps:

we calculate each seat increment, and put it on a table (or CTE),
then, for each seat, we use a window function to sum the increments up to that seat.

Let’s see the SQL:

WITH
increments (id, increment) AS
(
  SELECT
    id,
    x > LAG(x, 1, x - 1) OVER tzw + 1 OR y != LAG(y, 1, y) OVER tzw
  FROM seats
  WHERE venue_id = @venue_id
  WINDOW tzw AS (ORDER BY y, x)
)
SELECT
  s.id, y, x,
  ROW_NUMBER() OVER tzw + SUM(increment) OVER tzw `grouping`
FROM seats s
     JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x)
;

-- +-------+---+---+----------+
-- | id    | y | x | grouping |
-- +-------+---+---+----------+
-- | 24887 | 0 | 0 |        1 |
-- | 27186 | 0 | 1 |        2 |
-- | 29485 | 1 | 0 |        4 |
-- | 31784 | 1 | 2 |        6 |
-- | 34083 | 2 | 1 |        8 |
-- +-------+---+---+----------+

Nice!

(Note that for simplicity, I’ll omit the UPDATE from now on.)

Let’s review the query.

High-level logic

The CTE (edited):

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1 OR y != LAG(y, 1, y) OVER tzw `increment`
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)
;

-- +-------+-----------+
-- | id    | increment |
-- +-------+-----------+
-- | 24887 |         0 |
-- | 27186 |         0 |
-- | 29485 |         1 |
-- | 31784 |         1 |
-- | 34083 |         1 |
-- +-------+-----------+

calculates the increments for each seat, compared to the previous (more on LAG() later). It works purely on each record and the previous; it’s not cumulative.

Now, in order to calculate the cumulative increments, we just use a window function to compute the sum, for and up to each seat:

-- (CTE here...)
SELECT
  s.id, y, x,
  ROW_NUMBER() OVER tzw `pos.`,
  SUM(increment) OVER tzw `cum.incr.`
FROM seats s
     JOIN increments i USING (id)
WINDOW tzw AS (ORDER BY y, x);

-- +-------+---+---+------+-----------+
-- | id    | y | x | pos. | cum.incr. | (grouping)
-- +-------+---+---+------+-----------+
-- | 24887 | 0 | 0 |    1 |         0 | = 1 + 0 (curr.)
-- | 27186 | 0 | 1 |    2 |         0 | = 2 + 0 (#24887) + 0 (curr.)
-- | 29485 | 1 | 0 |    3 |         1 | = 3 + 0 (#24887) + 0 (#27186) + 1 (curr.)
-- | 31784 | 1 | 2 |    4 |         2 | = 4 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (curr.)
-- | 34083 | 2 | 1 |    5 |         3 | = 5 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (#31784)↵
-- +-------+---+---+------+-----------+     + 1 (curr.)

`LAG()` window function

The LAG function, in the simplest form (LAG(x)), returns the previous value of the given column. A typical nuisance of window functions is to deal with the first record(s) in the window - since there is no previous record, they return NULL. With LAG, we can specify the value we want as third parameter:

LAG(x, 1, x - 1) -- defaults to `x -1`
LAG(y, 1, y)     -- defaults to `y`

By specifying the defaults above, we make sure that the very first seat in the window will be treated by the logic as adjacent to the previous one (x - 1) and in the same row (y).

The alternative to defaults is typically IFNULL, which is very intrusive, especially considering the relative complexity of the expression:

-- Both valid. And both ugly!
--
IFNULL(x > LAG(x) OVER tzw + 1 OR y != LAG(y) OVER tzw, 0)
IFNULL(x > LAG(x) OVER tzw + 1, FALSE) OR IFNULL(y != LAG(y) OVER tzw, FALSE)

The second LAG() parameter is the number of positions to go back in the window; 1 is the previous, which is also the default value.

Technical aspects

Named windows

In this query, we’re using multiple times the same window. The following queries are formally equivalent:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x);

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER (ORDER BY y, x) + 1
    OR y != LAG(y, 1, y) OVER (ORDER BY y, x)
FROM seats
WHERE venue_id = @venue_id;

However, the latter may cause a suboptimal plan (which I’ve experienced, at least in the past); the optimizer may treat the windows as independent, and iterate them separately.
For this reason, I advise to always use named windows, at least when there are duplicated ones.

`PARTITION BY` clause

Typically, window functions are executed over a partition, which in this case would be:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (PARTITION BY venue_id ORDER BY y, x); -- here!

Since the window matches the full set of records (which is filtered by the WHERE condition), we don’t need to specify it.

If we had to run this query over the whole seats table, then we’d need it, so that, across each venue_id, the window is reset.

Ordering

In the query, the ORDER BY is specified at the window level:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS (ORDER BY y, x)

The window ordering is separate from the SELECT one. This is crucial! The behavior of this query:

SELECT
  id,
  x > LAG(x, 1, x - 1) OVER tzw + 1
    OR y != LAG(y, 1, y) OVER tzw
FROM seats
WHERE venue_id = @venue_id
WINDOW tzw AS ()
ORDER BY y, x

is unspecified. Let’s have a look at the manpage:

Query result rows are determined from the FROM clause, after WHERE, GROUP BY, and HAVING processing, and windowing execution occurs before ORDER BY, LIMIT, and SELECT DISTINCT.

Considerations

Abstractly speaking, in order to solve this class of problems, instead of representing each entry as as a function of the previous one, we calculate the state change for each entry, then sum the changes up.

Although more complex than the functionality it replaces, this solution is very solid. This approach though, may not be always possible, or at least easy, so that’s where the recursive CTE solution comes into play.

Modern approach #2: Recursive CTE

This approach requires a workaround due to a limitation in MySQL’s CTE functionality, but, on the other hand, it’s a generic, direct, solution, and as such, it doesn’t require any rethinking of the approach.

Let’s start from a the simplified version of the end query:

-- `p_` is for `Previous`, in order to make the conditions a bit more intuitive.
--
WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
  (
    SELECT id, venue_id, y, x, 1
    FROM seats
    WHERE venue_id = @venue_id
    ORDER BY y, x
    LIMIT 1
  )

  UNION ALL

  SELECT
    s.id, s.venue_id, s.y, s.x,
    p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
  FROM groupings, seats s
  WHERE s.venue_id = p_venue_id AND (s.y, s.x) > (p_y, p_x)
  ORDER BY s.venue_id, s.y, s.x
  LIMIT 1
)
SELECT * FROM groupings;

Bingo! This query is (relatively) simple, but most importantly, it expresses the grouping accumulating function in the simplest possible way:

p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)

-- the above is equivalent to:

@grouping := @grouping + 1 + (seats.x > @x + 1 OR seats.y != @y),
@y := seats.y,
@x := seats.x

Even for those who are not accustomed with CTEs, the logic is simple.

The initial row is the first seat of the venue, in order:

SELECT id, venue_id, y, x, 1
FROM seats
WHERE venue_id = @venue_id
ORDER BY y, x
LIMIT 1

In the recursive part, we proceed with the iteration:

SELECT
  s.id, s.venue_id, s.y, s.x,
  p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
FROM groupings, seats s
WHERE s.venue_id = p_venue_id AND (s.y, s.x) > (p_y, p_x)
ORDER BY s.venue_id, s.y, s.x
LIMIT 1

the WHERE condition, along with the ORDER BY and LIMIT clauses, simply find the next seat, that is, the one seat with the same venue id, which, in order of (venue_id, x, y), has greater (x, y) coordinates.

The s.venue_id part of the ordering is crucial! This allows us to use the index.

The SELECT clause takes care of:

performing the accumulation (computation of (p_)grouping),
passing the values of the current seat (s.id, s.venue_id, s.y, s.x) to the next cycle.

We select FROM groupings so that we fulfill the requirements for the CTE to be recursive.

What’s interesting here is that we use the recursive CTE essentially as iterator, via selection from the groupings table in the recursive subquery, while joining with seats, in order to find the data to work on.

The JOIN is formally a cross join, however, only one record is returned, due to the LIMIT clause.

Working version

Unfortunately, the above query doesn’t work because the ORDER BY clause is currently not supported in the recursive subquery; additionally, the semantics of the LIMIT as used here are not the intended ones, as they apply to the outermost query:

LIMIT is now supported […] The effect on the result set is the same as when using LIMIT in the outermost SELECT

However, it’s not a significant problem. Let’s have a look at the working version:

WITH RECURSIVE groupings (p_id, p_venue_id, p_y, p_x, p_grouping) AS
(
  (
    SELECT id, venue_id, y, x, 1
    FROM seats
    WHERE venue_id = @venue_id
    ORDER BY y, x
    LIMIT 1
  )

  UNION ALL

  SELECT
    s.id, s.venue_id, s.y, s.x,
    p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
  FROM groupings, seats s WHERE s.id = (
    SELECT si.id
    FROM seats si
    WHERE si.venue_id = p_venue_id AND (si.y, si.x) > (p_y, p_x)
    ORDER BY si.venue_id, si.y, si.x
    LIMIT 1
  )
)
SELECT * FROM groupings;

-- +-------+------+------+------------+
-- | p_id  | p_y  | p_x  | p_grouping |
-- +-------+------+------+------------+
-- | 24887 |    0 |    0 |          1 |
-- | 27186 |    0 |    1 |          2 |
-- | 29485 |    1 |    0 |          4 |
-- | 31784 |    1 |    2 |          6 |
-- | 34083 |    2 |    0 |          8 |
-- +-------+------+------+------------+

It’s a bit of shame having to use a subquery, but it works, and the boilerplate is minimal, as several clauses are required anyway.

Here, instead of performing the ordering and limiting, in the relation resulting from the join of groupings and seats, we do it in a subquery, and pass it to the outer query, which will consequently select only the target record.

Performance considerations

Let’s have a look at the query plan, using the EXPLAIN ANALYZE functionality:

mysql> EXPLAIN ANALYZE WITH RECURSIVE groupings [...]

-> Table scan on groupings  (actual time=0.000..0.001 rows=5 loops=1)
    -> Materialize recursive CTE groupings  (actual time=0.140..0.141 rows=5 loops=1)
        -> Limit: 1 row(s)  (actual time=0.019..0.019 rows=1 loops=1)
            -> Index lookup on seats using venue_id_y_x (venue_id=(@venue_id))  (cost=0.75 rows=5) (actual time=0.018..0.018 rows=1 loops=1)
        -> Repeat until convergence
            -> Nested loop inner join  (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
                -> Scan new records on groupings  (cost=2.73 rows=2) (actual time=0.001..0.001 rows=2 loops=2)
                -> Filter: (s.id = (select #5))  (cost=0.30 rows=1) (actual time=0.020..0.020 rows=1 loops=5)
                    -> Single-row index lookup on s using PRIMARY (id=(select #5))  (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
                    -> Select #5 (subquery in condition; dependent)
                        -> Limit: 1 row(s)  (actual time=0.007..0.008 rows=1 loops=9)
                            -> Filter: ((si.y,si.x) > (groupings.p_y,groupings.p_x))  (cost=0.75 rows=5) (actual time=0.007..0.007 rows=1 loops=9)
                                -> Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id)  (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)

The plan is very much as expected. The foundation of an optimal plan for this case, is in the index lookups:

-> Nested loop inner join  (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
-> Single-row index lookup on s using PRIMARY (id=(select #5))  (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
-> Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id)  (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)

which are paramount; if even an index scan is performed (in short, when the index entries are scanned linearly, instead of finding directly the desired one), the performance will tank.

Therefore, the requirements for this strategy to work, are that the related indexes are in place and are used by the optimizer very efficiently.

It’s expected that, in the future, if the restrictions are lifted, not having to use the subquery will make the task considerably simpler for the optimizer.

Alternative for suboptimal plans

For particular use cases where an optimal plan can’t be found, just use a temporary table:

CREATE TEMPORARY TABLE selected_seats (
  id INT NOT NULL PRIMARY KEY,
  y INT,
  x INT,
  UNIQUE (y, x)
)
SELECT id, y, x
FROM seats WHERE venue_id = @venue_id;

WITH RECURSIVE
groupings (p_id, p_y, p_x, p_grouping) AS
(
  (
    SELECT id, y, x, 1
    FROM seats
    WHERE venue_id = @venue_id
    ORDER BY y, x
    LIMIT 1
  )

  UNION ALL

  SELECT
    s.id, s.y, s.x,
    p_grouping + 1 + (s.x > p_x + 1 OR s.y != p_y)
  FROM groupings, seats s WHERE s.id = (
    SELECT ss.id
    FROM selected_seats ss
    WHERE (ss.y, ss.x) > (p_y, p_x)
    ORDER BY ss.y, ss.x
    LIMIT 1
    )
)
SELECT * FROM groupings;

Even if index scans are performed in this query, they’re very cheap, as the selected_seats table is very small.

Conclusion

I’m very pleased that a very effective but flawed workflow, can be replaced with clean (enough) functionalities, which have been brought by MySQL 8.0.

There are still new (underlying) functionalities in development in the 8.0 series, which therefore keeps proving to be a very strong release.

Happy recursion 😄

Storage and Indexed access of denormalized columns (arrays) on MySQL 8.0, via multi-valued indexes

2020-03-16T00:00:00+00:00

Another “missing and missed” functionality in MySQL is a data type for arrays.

While MySQL is not there yet, it’s now possible to cover a significant use case: storing denormalized columns (or arrays in general), and accessing them via index.

In this article I’ll give some context about denormalized data and indexes, including the workaround for such functionality on MySQL 5.7, and describe how this is (rather) cleanly accomplished on MySQL 8.0.

Terminology

Although B-trees are technically inverted indexes, in this context I’ll use the “inverted index” term to describe document-oriented indexes, like PostgreSQL’s GIN or InnoDB’s fulltext index, and I’ll refer to B-trees with their name.

Also, I won’t make any distinction between B-trees and B+trees, using only the “B-tree” term.

Storing and indexing arrays in MySQL 5.7: an approach, and problems

MySQL doesn’t have an array data type. This is a fundamental problem in architectures where storing denormalized rows is a requirement, for example, where MySQL is (also) used for data warehousing.

Storage and access are two sides of the same coin: missing optimal storage data structures for a certain class of data almost certainly implies the lack of optimal related algorithms; in this case, it translates to lack of (direct) indexing.

Storing arrays is not a big problem in itself: assuming simple data types, like integers, we can easily adopt the workaround of using a VARCHAR/TEXT column to store the values with an arbitrary separator (space is the most convenient), however, MySQL is (was) not designed to index this scenario.

Again, we can adopt another workaround: fulltext indexes. We can either set the InnoDB fulltext minimum token size to 1, but this has the downside of being a global setting, or pad the values, which works, although it’s suboptimal in terms of storage.

This is a working solution, if one really needs to: it has with the downsides of InnoDB’s fulltext indexes support, which are not few, but it’s good enough.

The MySQL 8.0 implementation: data type and index

MySQL can store arrays since v5.7, through the JSON data type:

-- Note how we're using the v8.0.19's new `ROW()` construct for inserting multiple rows.
--
CREATE TEMPORARY TABLE t_json_arrays(
  id      INT PRIMARY KEY AUTO_INCREMENT,
  c_array JSON NOT NULL
)
SELECT *
FROM (
  VALUES
    ROW("[1, 2, 3]"),
    ROW(JSON_ARRAY(4, 5, 6))
) v (c_array);

SELECT * FROM t_json_arrays;

-- +----+-----------+
-- | id | c_array   |
-- +----+-----------+
-- |  1 | [1, 2, 3] |
-- |  2 | [4, 5, 6] |
-- +----+-----------+

We can insert a JSON document (array) either as a string, or using the JSON_ARRAY function.

Some operators are available for accessing the data stored in the JSON document, e.g. ->:

-- Functionality for accessing JSON data
--
SELECT id, c_array -> "$[1]" `array_entry_1` FROM t_json_arrays;

-- +----+---------------+
-- | id | array_entry_1 |
-- +----+---------------+
-- |  1 | 2             |
-- |  2 | 5             |
-- +----+---------------+

However, indexing has been introduced only with v8.0.17, along with new search functionalities:

-- This is a functional index.
--
ALTER TABLE t_json_arrays ADD KEY ( (CAST(c_array -> '$' AS UNSIGNED ARRAY)) );

SELECT * FROM t_json_arrays WHERE 3 MEMBER OF (c_array);

-- +----+-----------+
-- | id | c_array   |
-- +----+-----------+
-- |  1 | [1, 2, 3] |
-- +----+-----------+

EXPLAIN FORMAT=TREE SELECT * FROM t_json_arrays WHERE 3 MEMBER OF (c_array -> '$');

-- -> Filter: json'3' member of (cast(json_extract(t_json_arrays.c_array,_utf8mb4'$') as unsigned array))  (cost=1.10 rows=1)
--     -> Index lookup on t_json_arrays using functional_index (cast(json_extract(t_json_arrays.c_array,_utf8mb4'$') as unsigned array)=json'3')  (cost=1.10 rows=1)

Note how the WHERE condition must replicate exactly the functional key part (in this case, c_array -> '$').

Performance expectations

According to the functionality worklog, the index is a slightly modified B-tree:

In general, multi-valued index is a regular functional index, with the exception that it requires additional handling under the hood on INSERT/UPDATE for multi-valued key parts.

SHOW INDEXES FROM t_json_arrays WHERE Key_name NOT LIKE 'PRIMARY'\G

-- *************************** 1. row ***************************
--      Table: t_json_arrays
--   Key_name: functional_index
-- Index_type: BTREE
-- [...]

Using a simple B-tree for this purpose has the specular opposite advantages and disadvantages of inverted indexes, the crucial difference being that the operations cost increases linearly with the size of the array stored.

This is because B-trees don’t have optimizations for large/batch insertions (inverted indexes are document-oriented, so it’s expected for insertions to be large); each array entry is one key in the index.

On the other hand, the DMLs cost is constant¹; there are no spikes caused by maintenance operations (ie. index merging.

Why multiple arrays can’t be indexed

An interesting point is that:

Only one multi-valued key part is allowed per index, to avoid exponential explosion. E.g if there would be two multi-valued key parts, and server would provide 10 values for each, SE would have to store 100 index records.

Why is that?

Because there are no convenient data structures for optimizing such case.

With the current data structure, the tuple [1, 2], [4, 5] would generate the index keys:

(1, 4),
(1, 5),
(2, 4),
(2, 5).

Suppose that we tackled the problem by reducing the keys to a composition of each value of the first array with the second array:

(1, 4, 5),
(2, 4, 5).

we couldn’t efficiently search in both arrays, since the index is only on the first element; for example, searching on:

1, 4

could only lookup for 1 entries, not for 4 ones.

Sounds familiar? This is essentially the leftmost string prefix search problem.

The arrays of each tuple can still be independently indexed; probably, such configuration could lead to the index merge intersection optimization.

How do I declare an ARRAY UNSIGNED column?

We’ve played with arrays storage and indexing; how about creating a column of UNSIGNED ARRAY data type?:

CREATE TEMPORARY TABLE t_json_arrays(
  id      INT PRIMARY KEY AUTO_INCREMENT,
  c_array UNSIGNED ARRAY NOT NULL
);

-- ERROR 1064 (42000): You have an error in your SQL syntax [...] near 'UNSIGNED ARRAY NOT NULL

Ouch! There is no currently such data type. Internally, everything is done via json; the worklog explains this:

[…] server creates virtual generated column using the typed array field (instead of a regular field) for a function for which is_returns_array() method returns true. This WL adds one such function - CAST(… AS … ARRAY).
The typed array field (Field_typed_array class) essentially is a JSON field, a descendant of Field_json, but it reports itself as a regular field which type is typed array element’s type. […]

Adding a new data type would require a considerable amount of work; the team’s resources are evidently focused on other functionalities, so they released a good-enough functionality, which in my opinion, is a balanced choice.

Conclusion

We’re very excited by the introduction of this data type, and we’re in the process of migrating the fulltext indexes used for pseudo-arrays, to JSON-based array columns/indexes; I think this is a very significant step in making MySQL a well-rounded RDBMS, and covers an important use case in applications of a certain size.

Footnotes

¹: Insertion cost in B-trees is not constant, however, the maintenance cost (rebalancing) is negligible in this context.

An introduction to Functional indexes in MySQL 8.0, and their gotchas

2020-03-10T00:00:00+00:00

Another interesting feature released with MySQL 8.0 is full support for functional indexes.

Although this is not a strictly new concept in the MySQL world (indexed generated columns provided the same functionality), I find it worth reviewing, through some applications, notes and considerations.

All in all, I’m not 100% bought into functional indexes (as opposed to indexed generated columns); I’ll elaborate on this over the course of the article.

As a natural fit, generated columns are included in the article; additionally, some constructs build on my previous article, in relation to the subject of CTEs.

Updated on 12/Mar/2020: Found another bug.

Contents:

Terminology

In this article I’ll use the term “Functional index” to the refer to indexes both with (8.0) and without (5.7) underlying generated columns.

Where I need to refer to the 8.0 version, I’ll use the term “Functional key part” (even if it may not be entirely appropriate).

Generated columns, and their application on JSON data

Before explaining the functional indexes, I’ll give a brief introduction to generated columns, since the latter are built on top of the former.

A generated column is a column whose content is a function of another column.

Virtual generated columns - the default type - take no storage; the alternative type, “stored”, actually store the data. In this article I’ll refer exclusively to the virtual ones.

The syntax is simple: in the most minimal form, the definition is AS ().

This is a sample table:

CREATE TEMPORARY TABLE t_generated_column
(
  id               INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters       JSON NOT NULL,
  parameter_serial CHAR(4) AS (parameters ->> '$.serial')
);

INSERT INTO t_generated_column (parameters)
VALUES
  ('{"serial": "foo0", "reserved": true}'),
  ('{"serial": "bar1", "reserved": false}'),
  ('{"serial": "baz2", "reserved": false}');

There are a few interesting concepts here.

First, the fact that a JSON column is used to store documents; we’re using MySQL as rudimentary document storage.
This is an interesting use case for generated columns (and likely, the original driver). On a complex enough application, at some point documents may be stored; if their usage is not sophisticated enough to require an external storage engine, MySQL can act as good enough tool for the job, in order to keep the system architecture as simple as possible.

The way the generated columns are defined, and work, is simple. In this case, the operator ->> (JSON inline path) is used, which is a shorthand for JSON_UNQUOTE(JSON_EXTRACT()). By default, JSON_EXTRACT includes quotes in the result (for strings), which we don’t require (in this context).

Finally, we can’t specify a NOT NULL constraint on the generated column - attempting to do so will return a syntax error.

Let’s have at look at how the data looks on SELECTion:

SELECT * FROM t_generated_column;

-- +----+---------------------------------------+------------------+
-- | id | parameters                            | parameter_serial |
-- +----+---------------------------------------+------------------+
-- |  1 | {"serial": "foo0", "reserved": true}  | foo0             |
-- |  2 | {"serial": "bar1", "reserved": false} | bar1             |
-- |  3 | {"serial": "baz2", "reserved": false} | baz2             |
-- +----+---------------------------------------+------------------+

Nice!

Functional indexes

Storing the data with the intention of unindexed access has definitely use cases, however, in applications where a significant part of the access to this data is performed at the DB layer, indexing will be crucial.

Generated columns can be indexed as any other column - in MySQL 5.7, this was the only way to build a functional index.

This is the previous table, with the index added and sample data:

CREATE TEMPORARY TABLE t_indexed_generated_column
(
  id               INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters       JSON NOT NULL,
  parameter_serial CHAR(4) AS (parameters ->> '$.serial'),
  KEY (parameter_serial)
)
WITH RECURSIVE counter (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM counter WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  CONCAT('{"serial": "', HEX(RANDOM_BYTES(2)), '"}') `parameters`
FROM counter;

ANALYZE TABLE t_indexed_generated_column;

Now we have a mean to address the JSON document via index (of course, limited to the specific field):

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_indexed_generated_column WHERE parameter_serial = 'CAFE';

-- -> Aggregate: count(0)
--     -> Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE')  (cost=1.10 rows=1)

The functionality above applies also to MySQL versions prior to 8.0, however, the latest version lifted a restriction: the backing generated column is not required anymore. A specific name is also given: “Functional key parts”, because indexes can now be composed of both functions and column references.

Behind the scenes, there’s nothing really new; appropriately, the engineers recycled the existing functionality, so that a functional indexes are backed by a hidden generated column.

Let’s create the table without the generated column, and fill it with random strings:

CREATE TEMPORARY TABLE t_functional_index
(
  id         INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters JSON NOT NULL,
  KEY ( (CAST(parameters ->> '$.serial' AS CHAR(4))) )
);

INSERT INTO t_functional_index (parameters)
WITH RECURSIVE counter (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM counter WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  CONCAT('{"serial": "', HEX(RANDOM_BYTES(2)), '"}') `parameters`
FROM counter;

ANALYZE TABLE t_functional_index;

The syntax is conceptually the same as generated columns - the function is wrapped by round brackets (the surrounding spaces are cosmetic).

Note that in this case, we must CAST the extracted value to CHAR, because we Cannot create a functional index on an expression that returns a BLOB or TEXT: the implicit function JSON_UNQUOTE return type is LONGTEXT.
We’re also hitting a limitation of functional indexes - while with normal indexes we could specify an index prefix (thus, converting the LONGTEXT into a (VAR)CHAR), this is not possible with functional indexes.

Now let’s test the index:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE parameters ->> '$.serial' = 'CAFE';

-- -> Aggregate: count(0)
--     -> Filter: (json_unquote(json_extract(t_functional_index.parameters,'$.serial')) = 'CAFE')  (cost=10384.20 rows=100312)
--         -> Table scan on t_functional_index  (cost=10384.20 rows=100312)

Nuts! A table scan. What happened?

JSON functional index gotchas

I’ll summarize here a few gotchas with JSON functional indexes. While the expression exactness is obvious, the other two aren’t [so much 😉].

Expression exactness

When using functional indexes, the match condition must be exact, in order for the index to be used. This is because MySQL needs to evaluates expressions in a general form, and, although some expressions can certainly be transformed (and some actually are, by the optimizer), a sensible design choice is to shift the burden to the developer, in some cases, including this one.

Let’s use a condition with the same function as the index definition:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE CAST(parameters ->> '$.serial' AS CHAR(4)) = 'CAFE';

-- -> Aggregate: count(0)
--    -> Index lookup on t_functional_index using functional_index (cast(json_unquote(json_extract(t_functional_index.parameters,_utf8mb4'$.serial')) as char(4) charset utf8mb4)='CAFE')  (cost=1.10 rows=1)

Even a minor change will make the optimizer discard the index:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_functional_index WHERE CAST(parameters ->> '$.serial' AS CHAR(5)) = 'CAFE';

-- -> Aggregate: count(0)
--     -> Filter: (cast(json_unquote(json_extract(t_functional_index.parameters,'$.serial')) as char(5) charset utf8mb4) = 'CAFE')  (cost=10384.20 rows=100312)
--         -> Table scan on t_functional_index  (cost=10384.20 rows=100312)

Inconsistent behavior between generated columns with index, and functional indexes

Interestingly, if we use the form generated column with index, in place of the functional index, the index will be used:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM t_indexed_generated_column WHERE parameters ->> '$.serial' = 'CAFE';

-- -> Aggregate: count(0)
--     -> Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE')  (cost=1.10 rows=1)

there is an inconsistency between a functional index and its generated column and index equivalent.

Let’s review the table definitions:

CREATE TEMPORARY TABLE t_indexed_generated_column
(
  id                 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters         JSON NOT NULL,
  parameter_serial   CHAR(4) AS (parameters ->> '$.serial'),
  KEY (parameter_serial)
);

CREATE TEMPORARY TABLE t_functional_index
(
  id         INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters JSON NOT NULL,
  KEY ( (CAST(parameters ->> '$.serial' AS CHAR(4))) )
);

There is no obvious reason for the optimizer not to use the functional index; it would definitely benefit from this improvement, in order for functional indexes to be a solid choice.

Encoding inconsistency based on the index usage

The combination of the CAST and JSON_UNQUOTE required in the context of functional indexes/generated columns has also another unintended effect: different results, based on the collation chosen by the query structure.

Let’s create a table with a generated column and an index:

CREATE TEMPORARY TABLE t_encoding_test
(
  id                INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  parameters        JSON NOT NULL,
  parameters_serial CHAR(4) AS (CAST(parameters ->> '$.serial' AS CHAR(4))),
  KEY (parameters_serial)
)
SELECT '{"serial": "CAFE"}' `parameters`;

If a query uses the index indirectly (here we query on parameters, but the optimizer automatically uses the index on parameters_serial), we get a case insensitive search:

SELECT COUNT(*) FROM t_encoding_test WHERE parameters ->> '$.serial' = 'CAFe';

-- +----------+
-- | COUNT(*) |
-- +----------+
-- |        1 |
-- +----------+

this happens because the CAST function used to build the index, is associated to the system collation, which is case insensitive (by default, utf8mb4_0900_ai_ci).

However, if the index is not used:

SELECT COUNT(*) FROM t_encoding_test USE INDEX () WHERE parameters ->> '$.serial' = 'CAFe';

-- +----------+
-- | COUNT(*) |
-- +----------+
-- |        0 |
-- +----------+

the record is not matched! This is because the ->> operator uses JSON_UNQUOTE, whose hardcoded collation is utf8mb4_bin, which is case insensitive.

For more details, see the MySQL manpage or even the worklog.

An example of functional index with dates

Let’s take another example, and test the index:

CREATE TEMPORARY TABLE date_functional_index
(
  id         INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  created_at DATETIME NOT NULL,
  INDEX ( (DATE(created_at)) )
);

INSERT INTO date_functional_index (created_at)
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 100K) */
  NOW() - INTERVAL (90 * RAND()) DAY `created_at`
FROM sequence;

ANALYZE TABLE date_functional_index;

(There are two issues in relation to this test; the details are given below)

Let’s test the index access:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM date_functional_index WHERE DATE(created_at) = CURDATE();

-- -> Aggregate: count(0)
--     -> Index lookup on date_functional_index using functional_index (cast(date_functional_index.created_at as date)=curdate())  (cost=668.80 rows=608)

Works as expected; with this data type, we don’t need to deal with BLOBs and/or collations.

Gotcha: JOINs don’t use functional key parts

How about joins?

EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
  SELECT CURDATE() - INTERVAL 90 DAY
  UNION ALL
  SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT d, COUNT(id)
FROM
  dates_range
  LEFT JOIN date_functional_index ON d = DATE(created_at)
GROUP BY d;

-- -> Table scan on 
--     -> Aggregate using temporary table
--         -> Nested loop left join
--             -> Table scan on dates_range
--                 -> [...]
--             -> Filter: (dates_range.d = cast(date_functional_index.created_at as date))  (cost=3429.97 rows=100649)
--                 -> Table scan on date_functional_index  (cost=3429.97 rows=100649)

Ouch! The index is not used; this is definitely something that needs to be considered.

Indexes on generated columns exhibit the same behavior, however, we can perform the join against the generated column, whose index is then used by the optimizer:

CREATE TEMPORARY TABLE date_generated_column_functional_index
(
  id              INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  created_at      DATETIME NOT NULL,
  created_at_date DATE AS (DATE(created_at)),
  INDEX (created_at_date)
)
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 100K) */
  NOW() - INTERVAL (90 * RAND()) DAY `created_at`
FROM sequence;

ANALYZE TABLE date_generated_column_functional_index;

EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
  SELECT CURDATE() - INTERVAL 90 DAY
  UNION ALL
  SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT d, COUNT(id)
FROM
  dates_range
  LEFT JOIN date_generated_column_functional_index ON d = created_at_date
GROUP BY d;

-- -> Table scan on 
--     -> Aggregate using temporary table
--         -> Nested loop left join
--             -> Table scan on dates_range
--                 -> [...]
--             -> Index lookup on date_generated_column_functional_index using created_at_date (created_at_date=dates_range.d)  (cost=36.18 rows=1026)

Therefore, it’s not possible to use functional key parts with JOINs at all, while it’s possible with indexed generated columns. This makes functional key parts less appealing, when considering the overall design.

I’ve filed this as feature request.

Bugs

Bug on `CREATE TABLE ... SELECT`

In some of the previous queries I’ve used CREATE TABLE + INSERT instead of CREATE TABLE ... SELECT. Why?

Because of a bug:

CREATE TEMPORARY TABLE bug_functional_index (
  sold_on DATETIME NOT NULL,
  INDEX sold_on_date ((DATE(sold_on)))
)
SELECT NOW() `sold_on`;

-- ERROR 3105 (HY000): The value specified for generated column '3351ae78dcbae4f473d53aebdc350681' in table 'bug_functional_index' is not allowed.

the above should work, considering split form works ok:

CREATE TEMPORARY TABLE bug_functional_index (
  sold_on DATETIME NOT NULL,
  INDEX sold_on_date ((DATE(sold_on)))
);

INSERT INTO bug_functional_index VALUES (NOW());

-- Query OK, 1 row affected (0,00 sec)

I’ve reported this to the MySQL bug tracker.

Bug on `LOAD DATA INFILE`

There is also an additional bug: LOAD DATA INFILE statements will fail, if the columns are not explicitly specified:

echo '[]' > /tmp/test_data.csv

mysql <<'SQL'
  CREATE SCHEMA IF NOT EXISTS tmp;

  CREATE TEMPORARY TABLE tmp.issue_load_data_on_functional_index
  (
    json_col JSON,
    KEY json_col ( (CAST(json_col -> '$' AS UNSIGNED ARRAY)) )
  );

  LOAD DATA INFILE '/tmp/test_data.csv' INTO TABLE tmp.issue_load_data_on_functional_index;
SQL

# ERROR 1261 (01000) at line 9: Row 1 doesn't contain data for all columns

The workaround is to explicitly specify the columns:

LOAD DATA INFILE '/tmp/test_data.csv' INTO TABLE tmp.issue_load_data_on_functional_index (json_col);

I’ve reported this bug as well.

Conclusion

I’m not bought into functional key parts.

While I find functional indexes an important functionality of solid, modern, RDBMSs, I think that the functional key parts feature itself needs some time to mature, especially considering that indexed generated columns can do the same work (with some exceptions, e.g. multi-valued indexing).

Now moving on to another new 8.0 interesting feature (window functions!) 😄

Generating sequences/ranges, via MySQL 8.0’s Common Table Expressions (CTEs)

2020-03-09T00:00:00+00:00

A long-time missing (and missed) functionality in MySQL, is sequences/ranges.

As of MySQL 8.0, this functionality is still not supported in a general sense, however, it’s now possible to generate a sequence to be used within a single query.

In this article, I’ll give a brief introduction to CTEs, and explain how to build different sequence generators; additionally, I’ll introduce the new (cool) MySQL 8.0 query hint SET_VAR, and a pinch of virtual columns and functional indexes (“functional key parts”, another MySQL 8.0 feature).

Contents:

A brief introduction to Common Table Expressions (CTEs)

Roughly, Common Table Expressions (CTEs) can be thought as ephemeral views or temporary tables.

CTEs bring very significant advantages, one of the most important being recursion, which, barring hacks, wasn’t supported before.

The simplest syntax is:

WITH <cte_name> (<colums>) AS
(
  <cte_query>
)
<main_query>

for example¹:

CREATE TABLE line_items(
  item_number INT UNSIGNED PRIMARY KEY,
  item_total  DECIMAL(8,2) NOT NULL,
  order_number INT UNSIGNED NOT NULL
);

INSERT INTO line_items VALUES
  (1, 10, 1),
  (2, 10, 1),
  (3, 15, 2)
;

WITH order_totals(order_number, order_total) AS
(
  SELECT order_number, SUM(item_total) `order_total`
  FROM line_items
  GROUP BY order_number
)
SELECT item_number, item_total, order_number, order_total
FROM line_items
     JOIN order_totals USING (order_number)
;

-- +-------------+------------+--------------+-------------+
-- | item_number | item_total | order_number | order_total |
-- +-------------+------------+--------------+-------------+
-- |           1 |      10.00 |            1 |       20.00 |
-- |           2 |      10.00 |            1 |       20.00 |
-- |           3 |      15.00 |            2 |       15.00 |
-- +-------------+------------+--------------+-------------+

The syntax is intuitive; in this example, it’s used very much like a temporary table, with the advantage that no cleanup (DROP TEMPORARY TABLE) is needed.

Recursive CTEs, and generating a linear sequence of integers

If one has to create a table filled with integers, say, as an example for a blog post 😉, the common approach is to use extended INSERTs (the form that stores multiple rows in one statement).

We can accomplish this more elegantly with a CTE, specifically, with a recursive one.

The syntax of recursive CTEs is:

WITH RECURSIVE <cte_name> (<colums>) AS
(
  <base_case_query>
  UNION ALL
  <recursive_step_query> -- invoke the CTE here!
)
<main_query>

The concept we apply here is to simulate iteration via recursion (more on this later).

Straight to the generator!:

-- Create a table with the integers in the range [0, 10].
--
CREATE TABLE int_sequence
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 <= 10
)
SELECT n
FROM sequence;

The table creation syntax may be slightly odd - one may expect CREATE TABLE to be below the WITH clause - but the working is straightforward.

When the SELECT invokes the CTE:

the first row returned is the base case (SELECT 0);
from the second onward, one row for each recursive step is returned.

This is all in all, simple. However, something important to pay attention to, is the termination condition: WHERE n + 1 <= 0. Why not using WHERE n <= ...?

Because this is a part where, it’s easy to do a fencepost error. Let’s see the wrong case:

-- Attempt to select the integers in the range [0, 10], the wrong way.
--
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n <= 10
)
SELECT n
FROM sequence;

What happens here is that one confuses the returned row with the last verified condition. On the two last steps,

n = 10;
the condition is verified;
SELECT n + 1 is executed, returning 11;
n = 11;
the condition is not verified;
recursion terminates.

Now, two alternatives are the conditions WHERE n <= 9 or WHERE n < 10; while they are correct, they may be less intuitive than WHERE n + 1 <= 10, which mimicks the SELECTed expression.

I’ll conclude with two final notes.

First, we’re using recursion as a way of performing iteration; this is subject to the same criticism of teaching recursion via Fibonacci series: it can arguably be considered as an overengineered/underperforming solution to a problem.

I don’t take any position in this case, however, my personal order of increasing elegance for filling a table with a series of numbers is:

using an extended INSERT,
using a recursive CTE,
using a sequence generator.

Since MySQL doesn’t provide 3., I’m happy to use 2. 😬.

The second note is more interesting, and I’ll highlight it with a dedicated section.

Per-statement variables setting

MySQL limits by default the number of recursions 1000, via the cte_max_recursion_depth sysvar.

Now, if we want to generate a long sequence, we should:

set the variable,
execute the statement,
reset the variable.

This procedure consists of three statements, which is of course inconvenient. What do we do?

Enters the scene the Per-statement variables setting.

This is a lesser known MySQL 8.0 new feature, that comes very handy where needed.

In short, SET_VAR is a query hint, that allows one or more variables to be set exclusively within the scope of a statement.

In this case, if we want to generate a 1M numbers sequence, we set cte_max_recursion_depth:

-- Select the integers in the range [0, 1000000].
--
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 <= 1000000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  n
FROM sequence;

(I’ve actually opened a bug suggesting to include this function in the CTE manpage.)

Generating a sequence of random integers

If we want to create random numbers, we use RAND()² and SELECT only the associated expression:

-- Create a table with 1000 random integers in the range [0, 65536).
--
CREATE TABLE random_int_sequence
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 < 1000
)
SELECT FLOOR(65536 * RAND()) `rand_n`
FROM sequence;

Generating a characters interval

Nothing prohibits us from generating a sequence of characters; in this case, we’ll use the CHAR() and ORD() functions to increment the current value:

CREATE TABLE random_char_sequence
WITH RECURSIVE sequence (c) AS
(
  SELECT 'A'
  UNION ALL
  SELECT CHAR(ORD(c) + 1 USING ASCII) FROM sequence WHERE CHAR(ORD(c) + 1 USING ASCII) <= 'Z'
)
SELECT c
FROM sequence;

Generating a dates interval

Finally, we’ll generate a dates interval.

In this section, it’s worth mentioning an interesting usage. Suppose one is reporting monthly sales. Is this query correct?:

-- Underlying table structure.
--
-- CREATE TABLE line_items(
--   id INT    UNSIGNED PRIMARY KEY,
--   total     DECIMAL(8,2) NOT NULL,
--   sold_on   DATETIME NOT NULL 
-- );

SELECT YEAR(sold_on) `sale_year`, MONTH(sold_on) `sale_month`, SUM(total) `month_sales`
FROM line_items
GROUP BY sale_year, sale_month;

The answer is: it depends on the requirements.

If the requirement is that all the months must be displayed, one may miss rows for months when there are no sales.

A solution is to use a sequence with all the months in the required interval, and (left) join the CTE with the table.

Let’s prepare some data (via CTE, of course! 😉), for a few months (except the current):

CREATE TABLE line_items(
  id INT       UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  total        DECIMAL(8,2) NOT NULL,
  sold_on      DATETIME NOT NULL,
  sold_on_date DATE AS (DATE(sold_on)),
  KEY (sold_on_date)
)
WITH RECURSIVE sequence (n) AS
(
  SELECT 0
  UNION ALL
  SELECT n + 1 FROM sequence WHERE n + 1 < 100000
)
SELECT /*+ SET_VAR(cte_max_recursion_depth = 1M) */
  CAST(20 * RAND() AS DECIMAL) `total`,
  NOW() - INTERVAL DAYOFMONTH(CURDATE()) DAY - INTERVAL (100 * RAND()) DAY `sold_on`
FROM sequence;

There are a couple of interesting concepts here:

The first is that by using NOW() - INTERVAL DAYOFMONTH(CURDATE()) DAY as base, we ensure that we don’t store any sales for the current month.

The second is that, in order to perform an efficient left join, a functional index is required; there are a few considerations about this subject, which I’ll leave to a separate article.

Additionally, note that float INTERVALs are rounded (but it’s irrelevant in this context).

Now we can query!

WITH RECURSIVE dates_range (d) AS
(
  SELECT CURDATE() - INTERVAL 124 DAY
  UNION ALL
  SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT YEAR(d) `sales_year`, MONTH(d) `sales_month`, SUM(total) `month_total_sales`
FROM
  dates_range
  LEFT JOIN line_items ON d = sold_on_date
GROUP BY sales_year, sales_month
ORDER BY sales_year, sales_month;

-- +------------+-------------+-------------------+
-- | sales_year | sales_month | month_total_sales |
-- +------------+-------------+-------------------+
-- |       2019 |          11 |          27895.00 |
-- |       2019 |          12 |         331700.00 |
-- |       2020 |           1 |         335775.00 |
-- |       2020 |           2 |         306289.00 |
-- |       2020 |           3 |              NULL |
-- +------------+-------------+-------------------+

Excellent. The current month is displaying, as intended, even if it has no sales.

Let’s check the optimizer plan (note that I’ve removed the ORDER BY clause for simplicity):

EXPLAIN FORMAT=TREE
WITH RECURSIVE dates_range (d) AS
(
  SELECT CURDATE() - INTERVAL 124 DAY
  UNION ALL
  SELECT d + INTERVAL 1 DAY FROM dates_range WHERE d + INTERVAL 1 day <= CURDATE()
)
SELECT YEAR(d) `sales_year`, MONTH(d) `sales_month`, SUM(total) `month_total_sales`
FROM
  dates_range
  LEFT JOIN line_items ON d = sold_on_date
GROUP BY sales_year, sales_month\G

-- *************************** 1. row ***************************
-- EXPLAIN: -> Table scan on 
--     -> Aggregate using temporary table
--         -> Nested loop left join
--             -> Table scan on dates_range
--                 -> Materialize recursive CTE dates_range
--                     -> Rows fetched before execution
--                     -> Repeat until convergence
--                         -> Filter: ((dates_range.d + interval 1 day) <= (curdate()))  (cost=2.73 rows=2)
--                             -> Scan new records on dates_range  (cost=2.73 rows=2)
--             -> Index lookup on line_items using sold_on_date (sold_on_date=dates_range.d)  (cost=0.28 rows=1)

The plan has a few interesting points, but they are left to the reader, since they are out of the scope of this article.

Conclusion

MySQL 8.0 brought many, very interesting, features. Although sequences/generator are still not fully supported, we can use the (very flexible) CTEs to cover a part of the use cases.

Happy querying with MySQL 8.0!

Footnotes

¹: Please note that real-world schemas are generally designed differently, and this example has been written with simplicity in mind instead. ²: Remember that RAND() is not a cryptographically secure function.

PreFOSDEM talk: Upgrading from MySQL 5.7 to MySQL 8.0

2020-02-23T00:00:00+00:00

In this post I’ll expand on the subject of my MySQL pre-FOSDEM talk: what dbadmins need to know and do, when upgrading from MySQL 5.7 to 8.0.

I’ve already published two posts on two specific issues; in this article, I’ll give the complete picture.

As usual, I’ll use this post to introduce tooling concepts that may be useful in generic system administration.

The presentation code is hosted on a GitHub repository (including the the source files and the output slides in PDF format), and on Slideshare.

Contents:

Summary of issues, and scope

The following are the basic issues to handle when migrating:

the new charset/collation utf8mb4/utf8mb4_0900_ai_ci;
the trailing whitespace is handled differently;
GROUP BY is not sorted anymore by default;
the information schema is now cached (by default);
incompatibility with schema migration tools.

Of course, the larger the scale, the more aspects will need to be considered; for example, large-scale write-bound systems may need to handle:

changes in dirty page cleaning parameters and design;
(new) data dictionary contention;
and so on.

In this article, I’ll only deal with what can be reasonably considered the lowest common denominator of all the migrations.

Requirements

All the SQL examples are executed on MySQL 8.0.

The new default character set/collation: `utf8mb4`/`utf8mb4_0900_ai_ci`

Summary

References:

MySQL introduces a new collation - utf8mb4_0900_ai_ci. Why?

Basically, it’s an improved version of the general_ci version - it supports Unicode 9.0, it irons out a few issues, and it’s faster.

The collation utf8(mb4)_general_ci wasn’t entirely correct; a typical example is Å:

-- Å = U+212B
SELECT "sÅverio" = "saverio" COLLATE utf8mb4_general_ci;
-- +--------+
-- | result |
-- +--------+
-- |      0 |
-- +--------+

SELECT "sÅverio" = "saverio"; -- Default (COLLATE utf8mb4_0900_ai_ci);
-- +--------+
-- | result |
-- +--------+
-- |      1 |
-- +--------+

From this, you can also guess what ai_ci means: accent insensitive/case insensitive.

So, what’s the problem?

Legacy.

Technically, utf8mb4 has been available in MySQL for a long time. At least a part of the industry started the migration long before, and publicly documented the process.

However, by that time, only utf8mb4_general_ci was available. Therefore, a vast amount of documentation around suggests to move to such collation.

While this is not an issue per se, is it a big issue when considering that the two collations are incompatible.

Tooling: MySQL RLIKE

For people who like (and frequently use) them, regular expressions are a fundamental tool.

In particular when performing administration tasks (using them in an application for data matching is a different topic), they can streamline some queries, avoiding lengthy concatenations of conditions.

In particular, I find it practical as a sophisticated SHOW supplement.

SHOW , in MySQL, supports LIKE, however, it’s fairly limited in functionality, for example:

SHOW GLOBAL VARIABLES LIKE 'character_set%'
-- +--------------------------+-------------------------------------------------------------------------+
-- | Variable_name            | Value                                                                   |
-- +--------------------------+-------------------------------------------------------------------------+
-- | character_set_client     | utf8mb4                                                                 |
-- | character_set_connection | utf8mb4                                                                 |
-- | character_set_database   | utf8mb4                                                                 |
-- | character_set_filesystem | binary                                                                  |
-- | character_set_results    | utf8mb4                                                                 |
-- | character_set_server     | utf8mb4                                                                 |
-- | character_set_system     | utf8                                                                    |
-- | character_sets_dir       | /home/saverio/local/mysql-8.0.19-linux-glibc2.12-x86_64/share/charsets/ |
-- +--------------------------+-------------------------------------------------------------------------+

Let’s turbocharge it!

Let’s get all the meaningful charset-related variables, but not one more, in a single swoop:

SHOW GLOBAL VARIABLES WHERE Variable_name RLIKE '^(character_set|collation)_' AND Variable_name NOT RLIKE 'system|data';
-- +--------------------------+--------------------+
-- | Variable_name            | Value              |
-- +--------------------------+--------------------+
-- | character_set_client     | utf8mb4            |
-- | character_set_connection | utf8mb4            |
-- | character_set_results    | utf8mb4            |
-- | character_set_server     | utf8mb4            |
-- | collation_connection     | utf8mb4_general_ci |
-- | collation_server         | utf8mb4_general_ci |
-- +--------------------------+--------------------+

Nice. The first regex reads: “string starting with (^) either character_set or collation”, and followed by _. Note that if we don’t group character_set and collation (via (…)), the ^ metacharacter applies only to the first.

How the charset parameters work

Character set and collation are a very big deal, because changing them in this case requires to literally (in a literal sense 😉) rebuild the entire database - all the records (and related indexes) including strings will need to be rebuilt.

In order to understand the concepts, let’s have a look at the MySQL server settings again; I’ll reorder and explain them.

Literals sent by the client are assumed to be in the following charset:

character_set_client (default: utf8mb4)

after, they’re converted and processed by the server, to:

character_set_connection (default: utf8mb4)
collation_connection (default: utf8mb4_0900_ai_ci)

The above settings are crucial, as literals are a foundation for exchanging data with the server. For example, when an ORM inserts data in a database, it creates an INSERT with a set of literals.

When the database system sends the results, it sends them in the following charset:

character_set_results (default: utf8mb4)

Literals are not the only foundation. Database objects are the other side of the coin. Base defaults for database objects (e.g. the databases) use:

character_set_server (default: utf8mb4)
collation_server (default: utf8mb4_0900_ai_ci)

String, and comparison, properties

Some developers would define a string as a stream of bytes; this is not entirely correct.

To be exact, a string is a stream of bytes associated to a character set.

Now, this concept applies to strings in isolation. How about operations on sets of strings, e.g. comparisons?

In a similar way, we need another concept: the “collation”.

A collation is a set of rules that defines how strings are sorted, which is required to perform comparisons.

In a database system, a collation is associated to objects and literal, both through system and specific defaults: a column, for example, will have its own collation, while a literal will use the default, if not specified.

But when comparing two strings with different collations, how is it decided which collation to use?

Enter the “Collation coercibility”.

Collation coercion, and issues `general` <> `0900_ai`

Reference: Collation Coercibility in Expressions

Coercibility is a property of collations, which defines the priority of collations in the context of a comparison.

MySQL has seven coercibility values:

0: An explicit COLLATE clause (not coercible at all) 1: The concatenation of two strings with different collations 2: The collation of a column or a stored routine parameter or local variable 3: A “system constant” (the string returned by functions such as USER() or VERSION()) 4: The collation of a literal 5: The collation of a numeric or temporal value 6: NULL or an expression that is derived from NULL

it’s not necessary to know them by heart, since their ordering makes sense, but it’s important to know how the main ones work in the context of a migration:

how columns will compare against literals;
how columns will compare against each other.

What we want to know is what happens in the workflow of a migration, in particular, if we:

start migrating the charset/collation defaults;
then, we slowly migrate the columns.

Comparisons utf8_general_ci column <> literals

Let’s create a table with all the related collations:

CREATE TABLE chartest (
  c3_gen CHAR(1) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci,
  c4_gen CHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
  c4_900 CHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci
);

INSERT INTO chartest VALUES('ä', 'ä', 'ä');

Note how we insert characters in the Basic Multilingual Plane) (BMP, essentially, the one supported by utf8mb3) - we’re simulating a database where we only changed the defaults, not the data.

Let’s compare with BMP utf8mb4:

SELECT c3_gen = 'ä' `result` FROM chartest;
-- +--------+
-- | result |
-- +--------+
-- |      1 |
-- +--------+

Nice; it works. Coercion values:

column: 2 # => wins
literal implicit: 4

More critical: we compare against a character in the Supplementary Multilingual Plane (SMP, essentially, one added by utf8mb4), with explicit collation:

SELECT c3_gen = '🍕' COLLATE utf8mb4_0900_ai_ci `result` FROM chartest;
-- +--------+
-- | result |
-- +--------+
-- |      0 |
-- +--------+

Coercion values:

column: 2
literal explicit: 0 # => wins

MySQL converts the first value and uses the explicit collation.

Most critical: compare against a character in the SMP, without implicit collation:

SELECT c3_gen = '🍕' `result` FROM chartest;
ERROR 1267 (HY000): Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation '='

WAT!!

Weird?

Well, this is because:

column: 2 # => wins
literal implicit: 4

MySQL tries to coerce the charset/collation to the column’s one, and fails!

This gives a clear indication to the migration: do not allow SMP characters in the system, until the entire dataset has been migrated.

Comparisons utf8_general_ci column <> columns

Now, let’s see what happens between columns!

SELECT COUNT(*) FROM chartest a JOIN chartest b ON a.c3_gen = b.c4_gen;
-- +----------+
-- | COUNT(*) |
-- +----------+
-- |        1 |
-- +----------+

SELECT COUNT(*) FROM chartest a JOIN chartest b ON a.c3_gen = b.c4_900;
-- +----------+
-- | COUNT(*) |
-- +----------+
-- |        1 |
-- +----------+

SELECT COUNT(*) FROM chartest a JOIN chartest b ON a.c4_gen = b.c4_900;
ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_general_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,IMPLICIT) for operation '='

Ouch. BIG OUCH!

Why?

This is what happens to people who migrated, referring to obsolete documentation, to utf8mb4_general_ci - they can’t easily migrate to the new collation.

Summary of the migration path

The migration path outlined:

update the defaults to the new charset/collation;
don’t allow SMP characters in the application;
gradually convert the tables/columns;
now allow everything you want 😄.

is viable for production systems.

The new collation doesn’t pad anymore

There’s another unexpected property of the new collation.

Let’s simulate MySQL 5.7:

-- Not exact, but close enough
--
SELECT '' = _utf8' ' COLLATE utf8_general_ci;
-- +---------------------------------------+
-- | '' = _utf8' ' COLLATE utf8_general_ci |
-- +---------------------------------------+
-- |                                     1 |
-- +---------------------------------------+

How does this work on MySQL 8.0?:

-- Current (8.0):
--
SELECT '' = ' ';
-- +----------+
-- | '' = ' ' |
-- +----------+
-- |        0 |
-- +----------+

Ouch!

Where does this behavior come from? Let’s get some more info from the collations (with a regular expression, of course 😉):

SHOW COLLATION WHERE Collation RLIKE 'utf8mb4_general_ci|utf8mb4_0900_ai_ci';
-- +--------------------+---------+-----+---------+----------+---------+---------------+
-- | Collation          | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
-- +--------------------+---------+-----+---------+----------+---------+---------------+
-- | utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |
-- | utf8mb4_general_ci | utf8mb4 |  45 |         | Yes      |       1 | PAD SPACE     |
-- +--------------------+---------+-----+---------+----------+---------+---------------+

Hmmmm 🤔. Let’s have a look at the formal rules from the SQL (2003) standard (section 8.2):

3) The comparison of two character strings is determined as follows:

a) Let CS be the collation […]

b) If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the pad character is chosen based on CS. If CS has the NO PAD characteristic, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a space.

In other words: the new collation does not pad.

This is not a big deal. Just, before migrating, trim the data, and make 100% sure that new instances are not introduced by the application before the migration is completed.

Triggers

Triggers are fairly easy to handle, as they can be dropped/rebuilt with the new settings - just make sure to consider comparisons inside the trigger body.

Sample of a trigger (edited):

SHOW CREATE TRIGGER enqueue_comments_update_instance_event\G

-- SQL Original Statement:
CREATE TRIGGER `enqueue_comments_update_instance_event`
AFTER UPDATE ON `comments`
FOR EACH ROW
trigger_body: BEGIN
  SET @changed_fields := NULL;

  IF NOT (OLD.description <=> NEW.description COLLATE utf8_bin AND CHAR_LENGTH(OLD.description) <=> CHAR_LENGTH(NEW.description)) THEN
    SET @changed_fields := CONCAT_WS(',', @changed_fields, 'description');
  END IF;

  IF @changed_fields IS NOT NULL THEN
    SET @old_values := NULL;
    SET @new_values := NULL;

    INSERT INTO instance_events(created_at, instance_type, instance_id, operation, changed_fields, old_values, new_values)
    VALUES(NOW(), 'Comment', NEW.id, 'UPDATE', @changed_fields, @old_values, @new_values);
  END IF;
END
--   character_set_client: utf8mb4
--   collation_connection: utf8mb4_0900_ai_ci
--     Database Collation: utf8mb4_0900_ai_ci

As you see, a trigger has associated charset/collation settings. This is because, differently from a statement, it’s not sent by a client, so it needs to keep its own settings.

In the trigger above, dropping/recreating in the context of a system with the new default works, however, it’s not enough - there’s a comparison in the body!

Conclusion: don’t forget to look inside the triggers. Or better, make sure you have a solid test suite 😉.

We’ve been long time users of MySQL triggers. They make a wonderful callback system.

When a system grows, it’s increasingly hard (tipping into the unmaintainable) to maintain application-level callbacks. Triggers will never miss any database update, and with a logic like the above, a queue processor can process the database changes.

Behavior with indexes

Now that we’ve examined the compatibility, let’s examine the performance aspect.

Indexes are still usable cross-charset, due to automatic conversion performed by MySQL. The point to be aware of is that the values are converted after being read from the index.

Let’s create test tables:

CREATE TABLE indextest3 (
  c3 CHAR(1) CHARACTER SET utf8,
  KEY (c3)
);

INSERT INTO indextest3 VALUES ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'), ('i'), ('j'), ('k'), ('l'), ('m');

CREATE TABLE indextest4 (
  c4 CHAR(1) CHARACTER SET utf8mb4,
  KEY (c4)
);

INSERT INTO indextest4 SELECT * FROM indextest3;

Querying against a constant yields interesting results:

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM indextest4 WHERE c4 = _utf8'n'\G
-- -> Aggregate: count(0)
--     -> Filter: (indextest4.c4 = 'n')  (cost=0.35 rows=1)
--         -> Index lookup on indextest4 using c4 (c4='n')  (cost=0.35 rows=1)

MySQL recognizes that n is a valid utf8mb4 character, and matches it directly.

Against a column with index:

EXPLAIN SELECT COUNT(*) FROM indextest3 JOIN indextest4 ON c3 = c4;
-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+
-- | id | select_type | table      | partitions | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                    |
-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+
-- |  1 | SIMPLE      | indextest3 | NULL       | index | NULL          | c3   | 4       | NULL |   13 |   100.00 | Using index              |
-- |  1 | SIMPLE      | indextest4 | NULL       | ref   | c4            | c4   | 5       | func |    1 |   100.00 | Using where; Using index |
-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+

EXPLAIN FORMAT=TREE SELECT COUNT(*) FROM indextest3 JOIN indextest4 ON c3 = c4\G
--  -> Aggregate: count(0)
--     -> Nested loop inner join  (cost=6.10 rows=13)
--         -> Index scan on indextest3 using c3  (cost=1.55 rows=13)
--         -> Filter: (convert(indextest3.c3 using utf8mb4) = indextest4.c4)  (cost=0.26 rows=1)
--             -> Index lookup on indextest4 using c4 (c4=convert(indextest3.c3 using utf8mb4))  (cost=0.26 rows=1)

MySQL is using the index, so all good. However, what’s the func?

It simply tell us that the value used against the index is the result of a function. In this case, MySQL is converting the charset for us (convert(indextest3.c3 using utf8mb4)).

This is another crucial consideration for a migration - indexes will still be effective. Of course, (very) complex queries will need to be carefully examined, but there are the grounds for a smooth transition.

Consequences of the increase in (potential) size of char columns

Reference: The CHAR and VARCHAR Types

One concept to be aware of, although unlikely to hit real-world application, is that utf8mb4 characters will take up to 33% more.

In storage terms, databases need to know what’s the maximum limit of the data they handle. This means that even if a string will take the same space both in utf8mb3 and utf8mb4, MySQL needs to know what’s the maximum space it can take.

The InnoDB index limit is 3072 bytes in MySQL 8.0; generally speaking, this is large enough not to care.

Remember!:

[VAR]CHAR(n) refers to the number of characters; therefore, the maximum requirement is 4 * n bytes, but
TEXT fields refer to the number of bytes.

Information schema statistics caching

Reference: The INFORMATION_SCHEMA STATISTICS Table

Up to MySQL 5.7, information_schema statistics are updated real-time. In MySQL 8.0, statistics are cached, and updated only every 24 hours (by default).

In web applications, this affects only very specific use cases, but it’s important to know if one’s application is subject to this new behavior (our application was).

Let’s see the effects of this:

CREATE TABLE ainc (id INT AUTO_INCREMENT PRIMARY KEY);

-- On the first query, the statistics are generated.
--
SELECT TABLE_NAME, AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'ainc';
-- +------------+----------------+
-- | TABLE_NAME | AUTO_INCREMENT |
-- +------------+----------------+
-- | ainc       |           NULL |
-- +------------+----------------+

INSERT INTO ainc VALUES ();

SELECT TABLE_NAME, AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'ainc';
-- +------------+----------------+
-- | TABLE_NAME | AUTO_INCREMENT |
-- +------------+----------------+
-- | ainc       |           NULL |
-- +------------+----------------+

Ouch! The cached values are returned.

How about SHOW CREATE TABLE?

SHOW CREATE TABLE ainc\G
-- CREATE TABLE `ainc` (
--   `id` int NOT NULL AUTO_INCREMENT,
--   PRIMARY KEY (`id`)
-- ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

This command is always up to date.

How to update the statistics? By using ANALYZE TABLE:

ANALYZE TABLE ainc;

SELECT TABLE_NAME, AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'ainc';
-- +------------+----------------+
-- | TABLE_NAME | AUTO_INCREMENT |
-- +------------+----------------+
-- | ainc       |              2 |
-- +------------+----------------+

There you go. Let’s find out the related setting:

SHOW GLOBAL VARIABLES LIKE '%stat%exp%';
-- +---------------------------------+-------+
-- | Variable_name                   | Value |
-- +---------------------------------+-------+
-- | information_schema_stats_expiry | 86400 |
-- +---------------------------------+-------+

Developers who absolutely need to revert to the pre-8.0 behavior can set this value to 0.

GROUP BY not sorted anymore by default (+tooling)

Up to MySQL 5.7, GROUP BY’s result was sorted.

This was unnecessary - optimization-seeking developers used ORDER BY NULL in order to spare the sort, however, accidentally or not, some relied on it.

Those who relied on it are unfortunately required to scan the codebase. There isn’t a one-size-fits-all solution, and in this case, writing an automated solution may not be worth the time of manually inspecting the occurrences, however, this doesn’t prevent the Unix tools to help 😄

Let’s simulate a coding standard where ORDER BY is always on the line after GROUP BY, if present:

cat > /tmp/test_groupby_1 << SQL
  GROUP BY col1
  -- ends here

  GROUP BY col2
  ORDER BY col2

  GROUP BY col3
  -- ends here

  GROUP BY col4
SQL

cat > /tmp/test_groupby_2 << SQL

  GROUP BY col5
  ORDER BY col5
SQL

A basic version would be a simple grep scan with 1 line After each GROUP BY match:

$ grep -A 1 'GROUP BY' /tmp/test_groupby_*
/tmp/test_groupby_1:  GROUP BY col1
/tmp/test_groupby_1-  -- ends here
--
/tmp/test_groupby_1:  GROUP BY col2
/tmp/test_groupby_1-  ORDER BY col2
--
/tmp/test_groupby_1:  GROUP BY col3
/tmp/test_groupby_1-  -- ends here
--
/tmp/test_groupby_1:  GROUP BY col4
--
/tmp/test_groupby_2:  GROUP BY col5
/tmp/test_groupby_2-  ORDER BY col5

However, with some basic scripting, we can display only the GROUP BYs matching the criteria:

# First, we make Perl speak english: `-MEnglish`, which enables `$ARG` (among the other things).
#
# The logic is simple: we print the current line if the previous line matched /GROUP BY/, and the
# current doesn't match /ORDER BY/; after, we store the current line as `$previous`.
#
perl -MEnglish -ne 'print "$ARGV: $previous $ARG" if $previous =~ /GROUP BY/ && !/ORDER BY/; $previous = $ARG' /tmp/test_groupby_*

# As next step, we automatically open all the files matching the criteria, in an editor:
#
# - `-l`: adds the newline automatically;
# - `$ARGV`: is the filename (which we print instead of the match);
# - `unique`: if a file has more matches, the filename will be printed more than once - with
#    `unique`, we remove duplicates; this is optional though, as editors open each file(name) only
#    once;
# - `xargs`: send the filenames as parameters to the command (in this case, `code`, from Visual Studio
#    Code).
#
perl -MEnglish -lne 'print $ARGV if $previous =~ /GROUP BY/ && !/ORDER BY/; $previous = $ARG' /tmp/test_groupby_* | uniq | xargs code

There is another approach: an inverted regular expression match:

# Match lines with `GROUP BY`, followed by a line _not_ matching `ORDER BY`.
# Reference: https://stackoverflow.com/a/406408.
#
grep -zP 'GROUP BY .+\n((?!ORDER BY ).)*\n' /tmp/test_groupby_*

This is, however, freaky, and as regular expressions in general, has a high risk of hairpulling (of course, this is up to the developer’s judgement). It will be the subject of a future article, though, because I find it is a very interesting case.

Schema migration tools incompatibility

This is an easily missed problem! Some tools may not support MySQL 8.0.

There’s a known showstopper bug on the latest Gh-ost release, which prevents operations from succeeding on MySQL 8.0.

As a workaround, one case use trigger-based tools, like pt-online-schema-change v3.1.1 or v3.0.x (but v3.1.0 is broken!) or Facebook’s OnlineSchemaChange.

Obsolete Mac Homebrew default collation

When MySQL is installed via Homebrew (as of January 2020), the default collation is utf8mb4_general_ci.

There are a couple of solution to this problem.

Modify the formula, and recompile the binaries

A simple thing to do is to correct the Homebrew formula, and recompile the binaries.

For illustrative purposes, as part of this solution, I use the so-called “flip-flop” operator, which is something frowned upon… by people not using it 😉. As one can observe in fact, for the target use cases, it’s very convenient.

# Find out the formula location
#
$ mysql_formula_filename=$(brew formula mysql)

# Out of curiosity, let's print the relevant section.
#
# Flip-flop operator (` .. `): it matches *everything* between lines matching two conditions, in this case:
#
# - start: a line matching `/args = /`;
# - end: a line matching `/\]/` (a closing square bracket, which needs to be escaped, since it's a regex metacharacter).
#
$ perl -ne 'print if /args = / .. /\]/' "$(mysql_formula_filename)"
   args = %W[
     -DFORCE_INSOURCE_BUILD=1
     -DCOMPILATION_COMMENT=Homebrew
     -DDEFAULT_CHARSET=utf8mb4
     -DDEFAULT_COLLATION=utf8mb4_general_ci
     -DINSTALL_DOCDIR=share/doc/#{name}
     -DINSTALL_INCLUDEDIR=include/mysql
     -DINSTALL_INFODIR=share/info
     -DINSTALL_MANDIR=share/man
     -DINSTALL_MYSQLSHAREDIR=share/mysql
     -DINSTALL_PLUGINDIR=lib/plugin
     -DMYSQL_DATADIR=#{datadir}
     -DSYSCONFDIR=#{etc}
     -DWITH_BOOST=boost
     -DWITH_EDITLINE=system
     -DWITH_SSL=yes
     -DWITH_PROTOBUF=system
     -DWITH_UNIT_TESTS=OFF
     -DENABLED_LOCAL_INFILE=1
     -DWITH_INNODB_MEMCACHED=ON
   ]

# Fix it!
#
$ perl -i.bak -ne 'print unless /CHARSET|COLLATION/' "$(mysql_formula_filename)"

# Now recompile and install the formula
#
$ brew install --build-from-source mysql

Ignore the client encoding on handshake

An alternative solution is for the server to ignore the client encoding on handshake.

When configured this way, the server will impose on the clients the the default character set/collation.

In order to apply this solution, add character-set-client-handshake = OFF to the server configuration.

Good practice for (major/minor) upgrades: comparing the system variables

A very good practice when performing (major/minor) upgrades is to compare the system variables, in order to spot differences that may have an impact.

The MySQL Parameters website gives a visual overview of the differences between versions.

For example, the URL https://mysql-params.tmtms.net/mysqld/?vers=5.7.29,8.0.19&diff=true shows the differences between the system variables of v5.7.29 and v8.0.19.

Conclusion

The migration to MySQL 8.0 at Ticketsolve has been one of the smoothest, historically speaking.

This is a bit of a paradox, because we never had to rewrite our entire database for an upgrade, however, with sufficient knowledge of what to expect, we didn’t hit any significant bump (in particular, nothing unexpected in the optimizer department, which is usually critical).

Considering the main issues and their migration requirements:

the new charset/collation defaults are not mandatory, and the migration can be performed ahead of time and in stages;
the trailing whitespace just requires the data to be checked and cleaned;
the GROUP BY clauses can be inspected and updated ahead of time;
the information schema caching is regulated by a setting;
Gh-ost may be missed, but in worst case, there are valid comparable tools.

the conclusion is that the preparation work can be entirely done before the upgrade, and subsequently perform it with reasonable expectations of low risk.

Happy migration 😄

Summary of trailing spaces handling in MySQL, with version 8.0 upgrade considerations

2019-07-09T00:00:00+00:00

Fairly recently, we’ve upgraded to MySQL 8; it’s been a relatively smooth transition, however, some minor differences needed to be handled. One of them is the behavior of trailing spaces.

Trailing spaces are a (not in a good way) surprising, but also widely covered argument. This article gives a short overview, and relates it to how this affects people upgrading to MySQL 8.0.

Contents:

Premises/Requirements

In this article I’m going to analyze only the VARCHAR data type behavior, as I’d like to keep the article concise. Interested readers can find information in the links provided.

As of MySQL 8.0, utf8 is an alias to utf8mb3 (MySQL 5.7’s underlying standard); using utf8/utf8mb3 will generate warnings when running some statements on an 8.0 server, which can be ignored in the context of this article.

The reader needs to have an idea of what a collation is (in short: a set of rules for comparing strings).

The MySQL version used, and required to run the article content, is 8.0.

Behavior in different contexts

Comparison (`=`) predicate (1)

The comparison (=) predicate specification is defined independently of its context, therefore, it behaves the same both in the select list (SELECT ...) and the search condition (WHERE ...).

Let’s start observing the MySQL 5.7 typical behavior:

CREATE TABLE test_comparison_ps (
  id INT PRIMARY KEY AUTO_INCREMENT,
  str VARCHAR(10) CHARSET utf8
);

INSERT INTO test_comparison_ps (str) VALUES(''), (' ');

SET NAMES utf8 COLLATE utf8_general_ci; # set the connection charset/collation

SELECT id, CONCAT('<', str, '>') `qstr`, str = '' , str = ' ' FROM test_comparison_ps;

# +----+------+----------+-----------+
# | id | qstr | str = '' | str = ' ' |
# +----+------+----------+-----------+
# |  1 | <>   |        1 |         1 |
# |  2 | < >  |        1 |         1 |
# +----+------+----------+-----------+

They’re all equal! This matches the typical outlook that “MySQL removes all the trailing spaces”.

But why so? Who’s responsible?

Inspecting the collations

According to the SQL standard, trailing spaces are not removed on storage and retrieval. In MySQL, this is a responsibility of the storage engine, in this case InnoDB; from the related manpage, we read:

Trailing spaces are not truncated from VARCHAR columns.

It turns out, the responsible is the collation. In this case, utf8_general_ci, the default collation of the default MySQL 5.7 charset, does not pad the strings during comparison.

How do we know how comparisons behave in relateion to padding? Let’s ask the information schema:

SELECT COLLATION_NAME, PAD_ATTRIBUTE FROM information_schema.collations WHERE COLLATION_NAME RLIKE 'utf8(mb4)?_(general|0900_ai)_ci';
/*
+--------------------+---------------+
| COLLATION_NAME     | PAD_ATTRIBUTE |
+--------------------+---------------+
| utf8_general_ci    | PAD SPACE     | # 5.7 default
| utf8mb4_general_ci | PAD SPACE     | # utf8mb4 default in MySQL 5.7
| utf8mb4_0900_ai_ci | NO PAD        | # 8.0 default
+--------------------+---------------+
*/

From the manpages page 1 and page 2:

The pad attribute determines how trailing spaces are treated for comparison of nonbinary strings (CHAR, VARCHAR, and TEXT values):

For PAD SPACE collations, trailing spaces are insignificant in comparisons; strings are compared without regard to any trailing spaces.

NO PAD collations treat spaces at the end of strings like any other character.

The following are the formal rules from the SQL (2003) standard (section 8.2):

3) The comparison of two character strings is determined as follows:

a) Let CS be the collation as determined by Subclause 9.13, “Collation determination”, for the declared types of the two character strings.

b) If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the pad character is chosen based on CS. If CS has the NO PAD characteristic, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a .

c) The result of the comparison of X and Y is given by the collation CS.

d) Depending on the collation, two strings may compare as equal even if they are of different lengths or contain different sequences of characters. When any of the operations MAX, MIN, and DISTINCT reference a grouping column, and the UNION, EXCEPT, and INTERSECT operators refer to character strings, the specific value selected by these operations from a set of such equal values is implementation- dependent.

the crucial point is b).

Comparison (`=`) predicate (2)

Now we can go back, and observe a different collation - utf8mb4_0900_ai_ci, MySQL 8.0 default:

CREATE TABLE test_comparison_np (
  id INT PRIMARY KEY AUTO_INCREMENT,
  str VARCHAR(10) CHARSET utf8mb4
);

INSERT INTO test_comparison_np (str) VALUES(''), (' ');

SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci; # behave like a standard MySQL 8.0 installation

SELECT id, CONCAT('<', str, '>') `qstr`, str = '' , str = ' ' FROM test_comparison_np;
/*
+----+------+----------+-----------+
| id | qstr | str = '' | str = ' ' |
+----+------+----------+-----------+
|  1 | <>   |        1 |         0 |
|  2 | < >  |        0 |         1 |
+----+------+----------+-----------+
*/

… so MySQL doesn’t “remove all the trailing spaces” after all.

`LIKE` predicate

Let’s see how the LIKE predicate behaves:

CREATE TABLE test_like (
  id INT PRIMARY KEY AUTO_INCREMENT,
  str VARCHAR(10) CHARSET utf8
);

INSERT INTO test_like (str) VALUES(''), (' ');

SET NAMES utf8 COLLATE utf8_general_ci;

SELECT id, CONCAT('<', str, '>') `qstr`, str LIKE '' , str LIKE ' ' FROM test_like;
/*
+----+------+-------------+--------------+
| id | qstr | str LIKE '' | str LIKE ' ' |
+----+------+-------------+--------------+
|  1 | <>   |           1 |            0 |
|  2 | < >  |           0 |            1 |
+----+------+-------------+--------------+
*/

Yikes! LIKE does not perform padding, even on a PAD SPACE collation such as utf8_general_ci.

LIKE has some semantic differences from =, which are confusing (for example, when dealing with JSON), however, they’re expected.

Therefore, as long as we keep in mind that LIKE differs from =, we are less likely to make mistakes.

Unique indexes

Let’s see how unique indexes behave:

CREATE TABLE test_unique_index (
  id INT PRIMARY KEY AUTO_INCREMENT,
  str_ps VARCHAR(10) CHARSET utf8 COLLATE utf8_general_ci,
  str_np VARCHAR(10) CHARSET utf8mb4 COLLATE utf8mb4_0900_ai_ci
);

INSERT INTO test_unique_index (str_ps, str_np) VALUES('', ''), (' ', ' ');

ALTER TABLE test_unique_index ADD UNIQUE (str_ps);

-- ERROR 1062 (23000): Duplicate entry '' for key 'str_ps'

ALTER TABLE test_unique_index ADD UNIQUE (str_np);

-- Query OK, 0 rows affected (0,02 sec)

Unique indexes behave like the comparison predicate; this makes sense, since comparison is the core operation they’re associated to.

`DISTINCT` predicate

Let’s see the effects of the DISTINCT predicate:

CREATE TABLE test_distinct (
  id INT PRIMARY KEY AUTO_INCREMENT,
  str VARCHAR(10) CHARSET utf8
);

INSERT INTO test_distinct (str) VALUES(''), (' ');

SET NAMES utf8 COLLATE utf8_general_ci;

SELECT DISTINCT str FROM test_distinct;
/*
+------+
| str  |
+------+
|      | # ''
|      | # ' '
+------+
*/

Very confusing: DISTINCT does not perform padding.

This is something to keep in mind.

`GROUP BY` clause

Finally, the GROUP BY clause:

CREATE TABLE group_by (
  id INT PRIMARY KEY AUTO_INCREMENT,
  str VARCHAR(10) CHARSET utf8
);

INSERT INTO group_by (str) VALUES(''), (' ');

SET NAMES utf8 COLLATE utf8_general_ci;

SELECT DISTINCT str FROM group_by;

/*
+------+
| str  |
+------+
|      | # ''
|      | # ' '
+------+
*/

Very confusing, again, although in a way, we could have expected this, since RDBMSs, in some cases, can process DISTINCT and GROUP BY the same way.

Conclusion

All in all, the padding rules in MySQL are not so confusing, but one needs to be aware of them - and I haven’t even explored the CHAR data type.

In my opinion, they’re not worth the hassle, so MySQL 8.0’s behavior is a very welcome simplification. Time to update the database! 😄

Text processing experiments for finding the MySQL configuration files

2019-06-12T00:00:00+00:00

When it comes to configuring MySQL, a fundamental step is to find out which configuration files the MySQL server reads.

The operation itself is simple, however, if we want to script the operation, using text processing in a sharp way, it’s not immediate what the best solution is.

In this post I’ll explore the process of looking for a satisfying solution, going through grep, perl, and awk.

Contents:

Assumptions

For simplicity, we assume that the filenames returned by the mysqld commands, and the user home path, don’t require quoting (e.g. have spaces).

Input data (finding the configuration files read by MySQL)

Finding the configuration files is a simple operation:

$ mysqld --verbose --help

This yields a pages-long text, with all the command lines parameter and the server configuration; the relevant section is:

# ...
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
# ...

First step: grep+tail

A generic, manual, approach is to use grep to isolate the text:

$ mysqld --verbose --help | grep -A 1 "^Default options"
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf

Using the option -A (--after-context), we tell grep to print the given number of lines after the match.

Now we isolate the options line:

$ mysqld --verbose --help | grep -A 1 "^Default options" | tail -n 1
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf

Standard approach - we use tail -n 1 in order to print the last 1 line(s).

Second step: expanding the tilde

There’s a problem now; we need to expand the tilde (~).

Since the string ~/.my.cnf is the output of a command, it’s not expanded by the subshell; this simplified example fails:

$ ls -l $(echo '~/.my.cnf')
ls: cannot access '~/.my.cnf': No such file or directory

We’ll try search/replace the tilde with the home path ($HOME in any shell) via Perl:

$ mysqld --verbose --help | grep -A 1 "^Default options" | tail -n 1 | perl -pe "s/~/$HOME/g"
Unknown regexp modifier "/h" at -e line 1, at end of line
syntax error at -e line 1, at EOF
Execution of -e aborted due to compilation errors.

Yikes! What happened?

The problem is that $HOME, in my case /home/saverio, contains backslashes, which are interpolated by the shell, and ultimately interpreted by Perl; this is the simplified example:

$ echo perl -pe "s/~/$HOME/g"
perl -pe s/~//home/saverio/g

$ echo | perl -pe 's/~//home/saverio/g'
Unknown regexp modifier "/h" at -e line 1, at end of line
Execution of -e aborted due to compilation errors.

which causes the error previously raised.

Perl can access environment variables - this comes to our rescue:

$ echo '~/.my.cnf' | perl -pe 's/~/$ENV{"HOME"}/'
/home/saverio/.my.cnf

We now have the building blocks of a fully functional command:

$ mysqld --verbose --help | grep -A 1 "^Default options" | tail -n 1 | perl -pe 's/~/$ENV{"HOME"}/g'
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf /home/saverio/.my.cnf

Don’t forget the /g regex modifier! It tells Perl to replace all the occurrences of a pattern in each matching line, if there’s more than one match (per line).

Our task is now accomplished. Can we do better?

Final step: awk’s super powers

While the last revision of the command works, it contains way too many commands. Does the GNU toolbox have better tools?

Let’s see what awk offers.

Awk is a (Turing-complete!) programming language, dedicated to text-processing; hopefully, it includes built-in functions relevant to our task.

The ugliest part right now is to isolate the options string from the entire mysqld help. The logic required is:

find a matching line
print the line below

with grep, unfortunately we can’t just print the line below without printing the matching line. But we can with awk!:

$ mysqld --verbose --help | awk '/^Default options/ { getline; print }'
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf

Awk’s language is fortunately fairly intuitive.
We use pattern matching // to match the intended line, and for the matches we execute a block ({ ... }) that goes to the next line (getline) and then prints the current one (print).

Now, in the current revision, we still have two commands, awk and perl:

mysqld --verbose --help | awk '/^Default options/ { getline; print }' | perl -pe 's/~/$ENV{"HOME"}/g'

Let’s merge them! We use awk’s search and replace, and environment variables access:

$ mysqld --verbose --help | awk '/^Default options/ { getline; gsub("~", ENVIRON["HOME"]); print }'
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf /home/saverio/.my.cnf

Here we use the search and replace function (gsub(source[, destination[, how]]); how is not relevant to this article) and associative arrays applied to environment variables (ENVIRON[]).

Note that gsub is the global version of search/replace; it replaces all the occurrence in a string, like perl /g regex modifier.

Extra step: using the output

As extra step, we want to use the output. Say, let’s add a comment to the [mysqld] block:

$ perl -i -pe 's/^(\[mysqld\]\n)/# Server configuration group follows:\n$1/' $(mysqld --verbose --help | awk '/^Default options/ { getline; gsub("~", ENVIRON["HOME"]); print }') 2> /dev/null

We just ignore the errors (due to file(s) not found), by sending them to /dev/null.

Conclusion

Long ago, I thought that one could improve text processing tools with a straight read of educational material. Nowadays, I find much more effective (and pleasant) instead, to try finding out, when I have the opportunity, which are the most effective tools to a accomplish a task.

In this article we’ve done an iterative search of the best text processing tools for the given use case; we’ve found that awk compactly, yet intuitively, satisfies the requirements, and we’ve explored a few, interesting and useful, features along the way.

An in depth DBA’s guide to migrating a MySQL database from the `utf8` to the `utf8mb4` charset

2019-03-25T00:00:00+00:00

We’re in the process of upgrading our MySQL databases from v5.7 to v8.0; since one of the differences in v8.0 is that the default encoding changed from utf8 to utf8mb4, and we had the conversion in plan anyway, we anticipated it and performed it as preliminary step for the upgrade.

This post describes in depth the overall experience, including tooling and pitfalls, and related subjects.

Contents:

Introduction

utf8mb4 is the MySQL encoding that fully covers the UTF-8 standard. Up to MySQL 5.7, the default encoding is utf8; the name is somewhat misleading, as this is a variant with a maximum width of 3 bytes.

Although there’s no practical purpose nowadays in using 3-bytes rather than 4-bytes UTF-8, this choice was originally made for performance reasons.

From a practical perspective, not all the applications will benefit from the extra byte of width, whose most common use cases include emojis and mathematical letters, however, conforming to standards is a routine task in software engineering.

Since utf8mb4 is a superset of utf8, the conversion is relatively painless, however, it’s crucial to be aware of the implications of the procedure.

Migration plan: overview and considerations

It’s impossible to make a general plan, due to the different requirements of any use case; high traffic applications may for example require that no locking should be involved (ie. no ALTER TABLE), while low traffic/size applications may just do with a few ALTER TABLEs.

However, I’ll trace a granular set of steps that should cover the vast majority of the cases; GitHub’s gh-ost is used, therefore, there’s no table locking during the data conversion step.

The setup is assumed to be single-master; there are generally sophisticated multi-master strategies for schema updates, however, they are outside the scope of this article.

The only migration constraint set is that until the end of the migration, the user should not allow 4-byte characters into the database; this gives the certainty that any implicit conversion performed before the end of the migration will succeed.

Users can certainly lift this constraint, however, they must thoroughly analyze the application data flows, in order to be 100% sure that utf8mb4 strings including 4-byte characters won’t mingle with utf8 strings, as this will cause errors.

!! COLLATION WARNING !!

MySQL 8.0 changed the utf8mb4 default collation from utf8mb4_general_ci to utf8mb4_0900_ai_ci (for details, see here and here).

This has a very significant impact - if the utf8 update if performed on a MySQL 5.7 server, without specifying the collation, and then the server is upgraded to v8.0, the collation of all the data structures will not match the default.
Of course, in such case it’s possible to leave the system as is, however, it won’t be the standard (and the settings will need to be set accordingly, in order to ensure that new tables/columns will be created with the intended collation).

It’s crucial to be aware of this, because most of the online information about the utf8 conversion has been written when MySQL 8.0 was not released yet, so it holds the outdated assumption that the default utf8mb4 collation is utf8mb4_general_ci.

In the following sections, I’ll point out which configuration parameters are required, when performing the conversion on a 5.7 server.

Free step: connection configuration

The character set [from now on abbreviated as charset] and collation of a given string or database object (ultimately, a column), and the operation performed, are determined by one or more settings/properties at different levels:

connection (set by the database client, which in turn can be set by the application framework) settings;
database server settings;
trigger settings;
database -> table -> column properties;

For example:

when creating a database, the charset is defaulted to the one set in the database server configuration,
when creating a trigger, the connection will determine the charset,

and so on.

Additionally, MySQL server attempts to use a compatible combination charset+collation for incompatible charsets, overriding the configuration/settings.

In order to view the connection and database server settings, we can use this handy query:

SHOW VARIABLES WHERE Variable_name RLIKE '^(character_set|collation)_' AND Variable_name NOT RLIKE '_(database|filesystem|system)$';

some settings are skipped, as they’re unrelated or deprecated.

This is a table of the relevant entries:

Setting	New value	Notes	Server setting	Client setting
`character_set_client`	`utf8mb4`	data sent by the client		✓
`character_set_connection`	`utf8mb4`	server converts client data into this charset for processing		✓
`collation_connection`	`utf8mb4_0900_ai_ci`	server uses this collation for processing		✓
`character_set_results`	`utf8mb4`	data and metadata sent by the server		✓
`character_set_server`	`utf8mb4`	default (and fallback) charset for objects	✓
`collation_server`	`utf8mb4_0900_ai_ci`	default (and fallback) collation for objects	✓

Server settings are defined at the server level, and as such, they’re typically set in the server configuration file - this is required if we’re operating on MySQL 5.7 (since it uses utf8 by default).

Client settings are specified by the client on connection; typically, they’re set via the SET NAMES [COLLATE ] statement.
This command is invoked when the encoding/collation are configured by the application framework; in the case of Rails, the parameters are in database.yml:

# Typical structure
login:
  encoding: utf8mb4
  collation: utf8mb4_0900_ai_ci
  # ...

In Django, we add the following to settings.py:

# Typical structure
DATABASES = {
  'default': {
    'OPTIONS': {'charset': 'utf8mb4'},
    # ...
  }
}

The changes above will cause the following statement to be issued on the first connection:

SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci # Rails also sets other variables here.

Based on a brief look at the source code, there is no collation option in Django, so the COLLATE utf8mb4_0900_ai_ci won’t be specified in the SQL statement.

This step can be performed at the beginning or the end of the migration; the reason is explained in the next subsection.

How do charset settings affect database operations?

During the migration, with either utf8 or utf8mb4 connection settings, we’ll find data belonging to the other charset. Is this a problem?

First, an introduction to the the charset/collation settings is required.

Over the course of a database connection, the data (flow) is processed in several steps:

client data sent: it’s assumed to be in the format defined by character_set_client
server processing: converted to the format defined by character_set_connection (and compared using the collation_connection)
server results: sent in the format defined by character_set_results

All the above settings (unless explicitly set) are set automatically, according to the character_set_client settings, so we can really think of all of them as a single entity.

So, the core question is: for client data in a given format (utf8 or utf8mb4), will processing (comparison or storage) always succeed?

Fortunately, in our context, the answer is always yes.

When it comes to storage, the matter is pretty simple; MySQL will take care of “converting” the format. We’re safe here because by using 3-byte characters, we can convert without any problem from and to the other charset.

However, in this context, strings manipulation is not only about storage - comparison is the other aspect to consider. It’s time to introduce the concept of collation and the related rules.

Strings are compared according to a “collation”, which defines how the data is sorted and compared. Each charset has a default collation, which in MySQL is the case-insensitive one (utf8_general_ci and utf8mb4_general_ci/utf8mb4_0900_ai_ci).

Now, when collating strings of mixed type, will the operation succeed? The answer is… no, but yes!

The reason for the no is that, unlike storage, we can’t use a collation for two different charsets. However, MySQL comes to the rescue.

MySQL has a set of coercibility rules, which determine which collation to use in a given operation (or if an error should be raised).

The rules are quite a few, however, they’re consistently defined, so they’re easy to understand.

We’ll see a few relevant examples, where we’ll also introduce a few interesting SQL clauses:

we define a default collation for a column;
we use an “introducer” on a string literal;
we override the default collation of a string literal.

First example:

CREATE TEMPORARY TABLE test_table (
  utf8col CHAR(1) CHARACTER SET utf8 COLLATE utf8_bin
)
SELECT _utf8'ä' `utf8col`;

SELECT utf8col < _utf8mb4'🍕' COLLATE utf8mb4_bin `result` FROM test_table;
# +--------+
# | result |
# +--------+
# |      1 |
# +--------+

The relevant rules are:

An explicit COLLATE clause has a coercibility of 0 (not coercible at all)
The collation of a column or a stored routine parameter or local variable has a coercibility of 2

which rule the collation as utf8mb4_bin. Shouldn’t the utf8col value fail, due to being an utf8 value, which is not handled by the winning collation?

No! MySQL will automatically convert the value, making it compatible. This is equivalent to:

SELECT _utf8mb4'ä' < _utf8mb4'🍕' COLLATE utf8mb4_bin `result` FROM test_table;

Second example:

SET NAMES utf8mb4;

CREATE TEMPORARY TABLE test_table (
  utf8col CHAR(1) CHARACTER SET utf8 COLLATE utf8_bin
)
SELECT _utf8'ä' `utf8col`;

SELECT utf8col < 'ë' `result` FROM test_table;
# +--------+
# | result |
# +--------+
# |      1 |
# +--------+

The relevant rules are:

The collation of a column or a stored routine parameter or local variable has a coercibility of 2
The collation of a literal has a coercibility of 4

The collation will be utf8_bin. Since ë can be converted, there’s no problem.

Equivalent statement:

SELECT _utf8'ä' COLLATE utf8_bin < _utf8mb4'ë' `result` FROM test_table;

Final example:

CREATE TEMPORARY TABLE test_table (
  utf8col CHAR(1) CHARACTER SET utf8 COLLATE utf8_bin
)
SELECT _utf8'ä' `utf8col`;

SELECT utf8col < _utf8mb4'🍕' `result` FROM test_table;
ERROR 1267 (HY000): Illegal mix of collations (utf8_bin,IMPLICIT) and (utf8mb4_0900_ai_ci,COERCIBLE) for operation '<'

Error! What happened here?

The relevant rules and chosen collation are the same as the previous example, however, in this case, the pizza emoji (🍕) can’t be converted to utf8, therefore, the operation fails.

The conclusion is that as long as we use utf8 characters only during the migration, we’ll have no problem, as the only relevant case is the second example.

Step 2: Preparing the the `ALTER` statements

In this step we’ll prepare all the ALTER statements that will change the schema/table metadata, and the data.

The operations are performed on a development database with the same structure as production.

First, we convert the database default charset (both production and development):

ALTER SCHEMA production_schema CHARACTER SET=utf8mb4;

data is not changed - only the metadata.

Then, we convert all the table charset to utf8mb4:

mysqldump "$updating_schema" |
  perl -ne 'print "ALTER TABLE $1 CHARACTER SET utf8mb4;\n" if /CREATE TABLE (.*) /' |
  mysql "$updating_schema"

again, data is not changed. This operation will cause all the columns that don’t match the new charset (supposedly, all the existing character columns), to show the former (utf8) charset in their definition:

# before (simplified)

CREATE TABLE mytable (
  intcol INT,
  strcol CHAR(1),
  strcol2 CHAR(1)
);

# after

CREATE TEMPORARY TABLE mytable (
  intcol INT,
  strcol CHAR(1) CHARACTER SET utf8,
  strcol2 CHAR(1) CHARACTER SET utf8
) DEFAULT CHARSET=utf8mb4;

This allows us to write a straight conversion command:

mysqldump --no-data --skip-triggers "$updating_schema" |
  egrep '^CREATE TABLE|CHARACTER SET utf8\b' |
  perl -0777 -pe 's/(CREATE TABLE [^\n]+ \(\n)+CREATE/CREATE/g' | # remove tables without entries
  perl -0777 -pe 's/,?\n(CREATE|$)/;\n$1/g'  |                    # change comma of each last column def to semicolon (or add it)
  perl -pe 's/(CHARACTER SET utf8\b)/$1mb4/' |                    # change charset
  perl -pe 's/  `/  MODIFY `/' |                                  # add `MODIFY`
  perl -pe 's/^CREATE TABLE (.*) \(/ALTER TABLE $1/'              # convert `CREATE TABLE ... (` to `ALTER TABLE`

The output will consist of all the required ALTER TABLES, for example:

ALTER TABLE `mytable`
  MODIFY `strcol` char(1) CHARACTER SET utf8mb4 DEFAULT NULL,
  MODIFY `strcol2` char(1) CHARACTER SET utf8mb4 DEFAULT NULL;

Issue: Column/index size limits

A database engine needs to know the maximum length of the stored data, in this case, text, because the data structures are subject to limits.

In relation to the utf8 migration, the two related limits are:

the maximum length of a character column;
the number of prefix characters stored in an index.

In practice, something that may happen is that a table defined as such:

CREATE TABLE mytable (
  longcol varchar(21844) CHARACTER SET utf8
);

will cause an error when converting to utf8mb4:

ALTER TABLE mytable MODIFY longcol varchar(21844) CHARACTER SET utf8mb4;
ERROR 1074 (42000): Column length too big for column 'longcol' (max = 16383); use BLOB or TEXT instead

because of MySQL restriction of 65535 (2^16 - 1) bytes on the combined size of all the columns:

utf8: 21844 * 3 = 65532
utf8mb4: 21844 * 4 = 87376 # too much
utf8mb4: 16383 * 4 = 65532

The same limit applies to index prefixes, although in this case there are two limits, 767 and 3072, depending on the row format and the long prefix option.

The restriction specifications can be found in the MySQL manual.

If reducing the column width is not an option, the column will need to be converted to a TEXT data type.

Note that using very long character columns should be carefully evaluated. Advanced DBAs know the implications, however it’s worth mentioning that in relation to the topic of internal temporary tables, character columns larger than 512 characters cause on-disk tables to be used; large object columns (BLOB/TEXT) don’t have this problem from version 8.0.3 onwards (see MySQL manual).
Therefore, large object columns are suitable for a larger amount of use cases than they were in the past.

Issue: Triggers/Functions

Triggers and functions also require review.

Since they are executed outside the context of a connection, they carry their charset settings:

SHOW TRIGGERS\G
# [...]
# character_set_client: utf8
# collation_connection: utf8_general_ci
#   Database Collation: utf8_general_ci

On one hand, those properties can be executed at any point of the migration, as they act exactly as described in the connection configurations section.

On the other hand, we need to take care of explicit COLLATE clauses involving columns being converted, if present.

Suppose we have this statement:

  SET @column_updated := OLD.strcol <=> NEW.strcol COLLATE utf8_bin;

If we migrate the column to utf8, as soon as the ALTER TABLE completes, any operation associated to the trigger (eg. INSERT) will always fail, because the utf8_bin collation is not compatible with the new utf8mb4 charset.

The solution is fairly simple - the trigger needs to be dropped before the ALTER TABLE, and recreated after. This of course, can be a serious challenge for high-traffic websites.

Issue: Joins between columns with heterogeneous charsets

Inevitably, some tables will be converted before others; even assuming parallel conversion, it’s not possible (without locking) to synchronize the end of the conversion of a set of given tables.

This creates a problem for a specific case: JOINs between columns of heterogeneous charsets - in practice, between a utf8 column and an utf8mb4 one.

In theory, this shouldn’t be a problem in itself. Let’s see what MySQL does in this case; let’s create a couple of tables:

CREATE TABLE utf8_table (
  mb3col CHAR(1) CHARACTER SET utf8,
  KEY `mb3idx` (mb3col)
);

INSERT INTO utf8_table
VALUES ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'), ('i'), ('j'), ('k'), ('l'), ('m'),
       ('n'), ('o'), ('p'), ('q'), ('r'), ('s'), ('t'), ('u'), ('v'), ('w'), ('x'), ('y'), ('z');

CREATE TABLE utf8mb4_table (
  mb4col CHAR(1) CHARACTER SET utf8mb4,
  KEY `mb4idx` (mb4col)
);

INSERT INTO utf8mb4_table
VALUES ('a'), ('b'), ('c'), ('d'), ('e'), ('f'), ('g'), ('h'), ('i'), ('j'), ('k'), ('l'), ('m'),
       ('n'), ('o'), ('p'), ('q'), ('r'), ('s'), ('t'), ('u'), ('v'), ('w'), ('x'), ('y'), ('z'),
       ('🍕');

First, let’s see what happen for simple index scans.

EXPLAIN SELECT COUNT(*) FROM utf8mb4_table WHERE mb4col = _utf8'n';
# +----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+
# | id | select_type | table         | partitions | type | possible_keys | key    | key_len | ref   | rows | filtered | Extra       |
# +----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+
# |  1 | SIMPLE      | utf8mb4_table | NULL       | ref  | mb4idx        | mb4idx | 5       | const |    1 |   100.00 | Using index |
# +----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+

SHOW WARNINGS\G
# [...]
# Message: /* select#1 */ select count(0) AS `COUNT(*)` from `db`.`utf8mb4_table` where (`db`.`utf8mb4_table`.`mb4col` = 'n')

Interestingly, it seems that MySQL converts the data before it reaches the optimizer; this is valuable knowledge, because with the current constraint(s), we can rely on the indexes as much as before the migration start.

What happens with JOINs?

EXPLAIN SELECT COUNT(*) FROM utf8_table JOIN utf8mb4_table ON mb3col = mb4col;
# +----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+
# | id | select_type | table         | partitions | type  | possible_keys | key    | key_len | ref  | rows | filtered | Extra                    |
# +----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+
# |  1 | SIMPLE      | utf8_table    | NULL       | index | NULL          | mb3idx | 4       | NULL |   26 |   100.00 | Using index              |
# |  1 | SIMPLE      | utf8mb4_table | NULL       | ref   | mb4idx        | mb4idx | 5       | func |    1 |   100.00 | Using where; Using index |
# +----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+

What’s func?

SHOW WARNINGS\G
# Message: /* select#1 */ select count(0) AS `COUNT(*)` from `db`.`utf8_table` join `db`.`utf8mb4_table` where (convert(`db`.`utf8_table`.`mb3col` using utf8mb4) = `db`.`utf8mb4_table`.`mb4col`)

Very interesting; we see what MySQL does in this case: it iterates utf8_table.mb3col (specifically, it iterates the index mb3idx), and for each value, it converts it to utf8mb4, so that it can be sought it in the utf8mb4_table.mb4idx index.

Note that this is a simple case; more complex JOINs in the app should still be carefully reviewed.

Step 3: Altering the schema and tables

Now we can proceed to alter the production schema.

The schema encoding can be changed without any worry, as it’s not a locking operation (up to v5.7, database properties are stored in a separate file, db.opt).

The table changes are the “big deal”: we need to perform them without locking, and with an awareness of the implications.

In order to avoid table locking, we use gh-ost, which is easy to use and well-documented.

Generally speaking, each ALTER TABLE of the list generated in the previous step must be converted to a gh-ost command and executed.

For example, this DDL statement:

ALTER TABLE `mytable`
  MODIFY `strcol` char(1) CHARACTER SET utf8mb4 DEFAULT NULL,
  MODIFY `strcol2` char(1) CHARACTER SET utf8mb4 DEFAULT NULL;

needs to be performed as [simplified form]:

gh-ost --database="$production_schema" --table="mytable" --alter="
  CHARACTER SET utf8mb4,
  MODIFY `strcol` char(1) CHARACTER SET utf8mb4 DEFAULT NULL,
  MODIFY `strcol2` char(1) CHARACTER SET utf8mb4 DEFAULT NULL
"

This is a fairly simple procedure. Don’t forget to run ANALYZE TABLE on each table after it’s been rebuilt.

The problem that some users will have is triggers; gh-ost doesn’t support tables with triggers, so an alternative procedure needs to be applied by high-traffic websites using this functionality.

Warnings

Little gotchas to be aware of!

Other schemas

Don’t forget to convert the other schemas as well!

In particular, if you’re on AWS, the schema tmp will need to be converted. Forgetting to do so may cause errors if this database is used for temporary data operations that involve the main production database.

Always run `ANALYZE TABLE`

It’s crucial to always run an ANALYZE TABLE for each table rebuilt. Gh-ost builds tables via successive insert, and it’s good (MySQL) DBA practice to:

run ANALYZE TABLE after loading substantial data into an InnoDB table, or creating a new index for one

See the MySQL manual for more informations.

Don’t rush the `DROP TABLE`

Gh-ost doesn’t delete the old table after replacing it - it only renames it. Be very careful when deleting it; a straight DROP TABLE may flood the server with I/O.

Internally, we have a script for dropping large tables that first drops the indexes one by one, then deletes the records in chunks, and only at the end drops the (now empty) table.

Notes about Mathias Bynens’ post on the same subject

There’s a popular post about the same subject, by a V8 developer (Mathias Bynens).

A couple of concepts are worth considering:

# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

From this, it can be deduced that the author uses MyISAM, as InnoDB doesn’t support REPAIR TABLE (see the MySQL manual).

make sure to repair and optimize all databases and tables […] ran into some weird bugs where UPDATE statements didn’t have any effect, even though no errors were thrown

this is very likely a bug, and based on the previous point, it may be MyISAM related (or related to ALTER TABLE). MyISAM has been essentially abandoned for a long time, and we’ve experienced buggy behaviors as well (although not in the context of charsets), so it wouldn’t be a surprise; the post is also very old (2012).

We’re entirely on InnoDB, and we didn’t experience any issue when changing the charset via ALTER TABLE (small tables in our model have been done this way). It’s also worth considering that gh-ost alters tables by creating an empty table and slowly filling it, which is different from issuing an ALTER TABLE.

If somebody still wanted to do a rebuild of the table, note that OPTIMIZE TABLE performs a full rebuild followed by ANALYZE TABLE, so it’s not required to run the latter statement separately.

Conclusion

Considering that migrating a database to utf8mb4 implies literally rebuilding the entire database’s data, it’s been a ride with relatively few bumps.

The core issue is handling JOINs between columns being migrated; it may not be a trivial matter, but it’s possible to get deterministic behavior with a thorough analysis.

Projects planning to move to MySQL 8.0 are encouraged to perform this step ahead, to shift as many possible changes related to the upgrade ahead of the upgrade itself.

All in all, migrating to utf8mb4 is a very significant change, but knowing where to look at, it’s possible to perform it smoothly.

Footnotes

¹ Very likely, partial indexes are a fit solution to this problem, but they’re not supported by MySQL.

Dropping a database column in production without waiting time and/or schema-aware code, on a MySQL/Rails setup

2019-02-12T00:00:00+00:00

We recently had to drop a column in production, from a relatively large (order of 10⁷ records) table.

On modern MySQL setups, dropping a column doesn’t lock the table (it does, actually, but for a relatively short time), however, we wanted to improve a very typical Rails migration scenario in a few ways:

offloading the column dropping time from the deploy;
ensuring that in the time between the column is dropped and the app servers restarted, the app doesn’t raise errors due to the expectation that the column is present;
not overloading the database with I/O.

I’ll give the Gh-ost tool a brief introduction, and show how to fulfill the above requirements in a simple way, by using this tool and an ActiveRecord flag.

This workflow can be applied to almost any table alteration scenario.

Contents:

Gh-ost

Gh-ost is a relatively recent tool by GitHub, which allows online table modifications without locking.

Tools like gh-ost existed before - the first being mk-online-schema-change (now pt-online-schema-change), developed by Percona.

The Percona tool relies on triggers in order to achieve the objective, which is a good enough, stable, solution. However, there are a variety of reasons that (can) make the tool inadequate for high-load conditions.

Gh-ost introduced the novel idea of reading from the binary log (which logs all the write operation) in order to reproduce the writes on the temporary table.

Gh-ost can be run in different setups; this article will show the simplest one.

Setup and workflow

Existing configuration

Let’s assume the following table:

CREATE TABLE `customers` (
  --- column definitions
  `source_id` int(11) NOT NULL,
  -- index definitions
  KEY `index_customers_on_source_id` (`source_id`)
);

with the corresponding model:

class Customer < ApplicationRecord
  # model content
end

and migration:

class DropCustomersSourceId < ActiveRecord::Migration
  def change
    remove_column :customers, :source_id
  end
end

Configure ActiveRecord for ignoring the column, and performing the deploy

First, we tackle point #2. Let’s have a look at the stages of a typical deploy with migrations:

the deploy starts: various operations are performed, including copying the new codebase to a release directory, without the app servers actually (re)loading it;
the migrations are executed - in this case, with an underlying ALTER TABLE statement, which will take a long time;
the current release directory is linked to the new codebase, and the app servers (processes) are restarted;
other operations are performed.

The problem is that between the stages 2. and 3. (and also, depending on the app server configuration, during the processes restart), the app servers will have in memory the old version of the codebase, which expects customers.source_id to be present.

Although this time is relatively short, on a high-load environment, if a Customer instance is saved, the operation will fail, because ActiveRecord will include the column in the underlying INSERT.

In systems engineering, schema-aware code strategy is sometimes applied: essentially, writing code in the form “if the schema is foo, do bar, otherwise, do baz”.

In the case of a column drop, we have at our disposal a “cheap” schema-aware strategy: ignored_columns (see the Rails PR).

This directive makes ActiveRecord entirely ignore a column, so that the column can disappear at any time, without ActiveRecord noticing.

Let’s update the model:

class Customer < ApplicationRecord
  self.ignored_columns = %w(source_id)
  # model content
end

and the migration:

class DropcustomersSourceId < ActiveRecord::Migration[5.2]
  def change
    remove_column :customers, :source_id unless is_production_environment?
  end

  def is_production_environment?
    # choose strategy
  end
end

We can now perform the deploy; this time, the table column will not be dropped. After the deploy, we will use gh-ost, as outlined in the next section.

Using gh-ost to drop the column

Gh-ost is pretty straightforward to use. In this context it’s used in the simplest way possible, that is, running directly on master.

Note that there are many options available, including:

sharing the load with slaves,
regulating the I/O load,
not including the password in the command (for security reasons).

A summary document is available here; gh-ost has good documentation.

The sample command we use is:

$ GHOST_TABLE="customers"
$ GHOST_ALTER="DROP source_id"

$ gh-ost \
    --user="$GHOST_USER" --password="$GHOST_PASSWORD" --host="$GHOST_HOST" \
    --database="$GHOST_SCHEMA" --table="$GHOST_TABLE" --alter="$GHOST_ALTER" \
    --allow-on-master --exact-rowcount --verbose --execute

The options are clear; --exact-rowcount will trade a little execution time for more accurate progress estimation.

Gh-ost will create a temporary (in a logical, not SQL, sense) table, slowly fill it and update with original table updates, then swap (with negligible locking time) them.

A crucial detail is that gh-ost will leave the original table in the database, renamed (in this case, _customers_del).

Although there is an option to drop the table automatically, do not enable it or do not attempt to do it manually: dropping a large table creates a large amount of I/O, due to MySQL freeing the pool pages, which will likely halt the database system to a grind for some time. Instead, one should follow a progressive table drop workflow:

drop the indexes (optionally, individually);
delete the records in batches;
drop the (now empty) table.

Between each drop/deletion, SLEEP calls should be performed, in order to ensure that the writes are fully flushed.

Internally, we have a script for this, and it’s advised to find or develop something similar.

Of course, SLEEP can be replaced with sophisticated strategies (eg. relying on the server statistics to track the I/O), however, in our system, SLEEP is a perfectly adequate while simple strategy.

Remove the `ignored_columns` and redeploy

At this point, in production, Rails will be completely unaware of the existence (or not) of the column (being) dropped.

After the column is dropped, we can remove the Customer.ignored_columns directive, and deploy any time (or even wait for the next deploy).

Conclusion

We’ve been using gh-ost for a long time by now, and we’ve developed a surrounding tooling ecosystem.

Once one gets used to such workflows, it’s actually satisfying to perform “push-button” table alterations without any locking or performance drop in general, instead of being worried of the impact of (relatively) large-scale db operations.

Paraphrasing the typical joke:

Did you notice the downtime today during the migration?
WHAT!?! NO!
Exactly.

;-)

Saverio Miroddi | Mysql

Announcement: Added a separate feed for MySQL topics

Modern approaches to replacing accumulation user-defined variable hacks, via MySQL 8.0 Window functions and CTEs

Requirements and background

The problem

Setup

The old-school approach

Modern approach #1: Window functions

High-level logic

LAG() window function

Technical aspects

Named windows

PARTITION BY clause

Ordering

Considerations

Modern approach #2: Recursive CTE

Working version

Performance considerations

Alternative for suboptimal plans

Conclusion

Storage and Indexed access of denormalized columns (arrays) on MySQL 8.0, via multi-valued indexes

Terminology

Storing and indexing arrays in MySQL 5.7: an approach, and problems

The MySQL 8.0 implementation: data type and index

Performance expectations

Why multiple arrays can’t be indexed

How do I declare an ARRAY UNSIGNED column?

Conclusion

Footnotes

An introduction to Functional indexes in MySQL 8.0, and their gotchas

Terminology

Generated columns, and their application on JSON data

Functional indexes

JSON functional index gotchas

Expression exactness

Inconsistent behavior between generated columns with index, and functional indexes

Encoding inconsistency based on the index usage

An example of functional index with dates

Gotcha: JOINs don’t use functional key parts

Bugs

Bug on CREATE TABLE ... SELECT

Bug on LOAD DATA INFILE

Conclusion

Generating sequences/ranges, via MySQL 8.0’s Common Table Expressions (CTEs)

A brief introduction to Common Table Expressions (CTEs)

Recursive CTEs, and generating a linear sequence of integers

Per-statement variables setting

Generating a sequence of random integers

Generating a characters interval

Generating a dates interval

Conclusion

Footnotes

PreFOSDEM talk: Upgrading from MySQL 5.7 to MySQL 8.0

Summary of issues, and scope

Requirements

The new default character set/collation: utf8mb4/utf8mb4_0900_ai_ci

Summary

Tooling: MySQL RLIKE

How the charset parameters work

String, and comparison, properties

Collation coercion, and issues general <> 0900_ai

Comparisons utf8_general_ci column <> literals

Comparisons utf8_general_ci column <> columns

Summary of the migration path

The new collation doesn’t pad anymore

Triggers

Sort-of-related suggestion

Behavior with indexes

Consequences of the increase in (potential) size of char columns

Information schema statistics caching

GROUP BY not sorted anymore by default (+tooling)

Schema migration tools incompatibility

Obsolete Mac Homebrew default collation

Modify the formula, and recompile the binaries

Ignore the client encoding on handshake

Good practice for (major/minor) upgrades: comparing the system variables

Conclusion

Summary of trailing spaces handling in MySQL, with version 8.0 upgrade considerations

Premises/Requirements

Behavior in different contexts

`LAG()` window function

`PARTITION BY` clause

Bug on `CREATE TABLE ... SELECT`

Bug on `LOAD DATA INFILE`

The new default character set/collation: `utf8mb4`/`utf8mb4_0900_ai_ci`

Collation coercion, and issues `general` <> `0900_ai`

Comparison (`=`) predicate (1)

Comparison (`=`) predicate (2)

`LIKE` predicate

`DISTINCT` predicate

`GROUP BY` clause

Step 2: Preparing the the `ALTER` statements

Always run `ANALYZE TABLE`

Don’t rush the `DROP TABLE`

Remove the `ignored_columns` and redeploy