<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://saveriomiroddi.github.io/feed/mysql.xml" rel="self" type="application/atom+xml" /><link href="https://saveriomiroddi.github.io/" rel="alternate" type="text/html" /><updated>2024-09-11T12:22:11+00:00</updated><id>https://saveriomiroddi.github.io/feed/mysql.xml</id><title type="html">Saverio Miroddi | Mysql</title><subtitle>64K RAM SYSTEM &amp;nbsp;38911 BASIC BYTES FREE</subtitle><entry><title type="html">Announcement: Added a separate feed for MySQL topics</title><link href="https://saveriomiroddi.github.io/announcement-added-a-separate-feed-for-mysql-topics/" rel="alternate" type="text/html" title="Announcement: Added a separate feed for MySQL topics" /><published>2020-07-18T00:00:00+00:00</published><updated>2020-07-18T00:00:00+00:00</updated><id>https://saveriomiroddi.github.io/announcement-added-a-separate-feed-for-mysql-topics</id><content type="html" xml:base="https://saveriomiroddi.github.io/announcement-added-a-separate-feed-for-mysql-topics/"><![CDATA[<p>Yesterday (17/Jul/2020) I’ve added a separate feed for MySQL topics. It can be accessed at the address <a href="https://saveriomiroddi.github.io/feed/mysql.xml">https://saveriomiroddi.github.io/feed/mysql.xml</a> (also represented by the dolphin icon in the navigation bar of the blog).</p>]]></content><author><name></name></author><category term="mysql" /><category term="announcement" /><category term="mysql" /><summary type="html"><![CDATA[Yesterday (17/Jul/2020) I’ve added a separate feed for MySQL topics. It can be accessed at the address https://saveriomiroddi.github.io/feed/mysql.xml (also represented by the dolphin icon in the navigation bar of the blog).]]></summary></entry><entry><title type="html">Modern approaches to replacing accumulation user-defined variable hacks, via MySQL 8.0 Window functions and CTEs</title><link href="https://saveriomiroddi.github.io/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes/" rel="alternate" type="text/html" title="Modern approaches to replacing accumulation user-defined variable hacks, via MySQL 8.0 Window functions and CTEs" /><published>2020-06-06T00:00:00+00:00</published><updated>2020-06-06T00:00:00+00:00</updated><id>https://saveriomiroddi.github.io/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes</id><content type="html" xml:base="https://saveriomiroddi.github.io/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes/"><![CDATA[<p>A common MySQL strategy to perform updates with accumulating functions is to employ user-defined variables, using the <code class="language-plaintext highlighter-rouge">UPDATE [...] SET mycol = (@myvar := EXPRESSION(@myvar, mycol))</code> pattern.</p>

<p>This pattern though doesn’t play well with the optimizer (leading to non-deterministic behavior), so it has been deprecated. This left a sort of void, since the (relatively) sophisticated logic is now harder to reproduce, at least with the same simplicity.</p>

<p>In this article, I’ll have a look at two ways to apply such logic: using, canonically, window functions, and, a bit more creatively, using recursive CTEs.</p>

<ul>
  <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#requirements-and-background">Requirements and background</a></li>
  <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#the-problem">The problem</a></li>
  <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#setup">Setup</a></li>
  <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#the-old-school-approach">The old-school approach</a></li>
  <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#modern-approach-1-window-functions">Modern approach #1: Window functions</a>
    <ul>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#high-level-logic">High-level logic</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#lag-window-function"><code class="language-plaintext highlighter-rouge">LAG()</code> window function</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#technical-aspects">Technical aspects</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#named-windows">Named windows</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#partition-by-clause"><code class="language-plaintext highlighter-rouge">PARTITION BY</code> clause</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#ordering">Ordering</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#considerations">Considerations</a></li>
    </ul>
  </li>
  <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#modern-approach-2-recursive-cte">Modern approach #2: Recursive CTE</a>
    <ul>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#working-version">Working version</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#performance-considerations">Performance considerations</a></li>
      <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#alternative-for-suboptimal-plans">Alternative for suboptimal plans</a></li>
    </ul>
  </li>
  <li><a href="/Modern-approaches-to-replacing-accumulation-user-defined-variable-hacks-via-mysql-8.0-window-functions-and-ctes#conclusion">Conclusion</a></li>
</ul>

<h2 id="requirements-and-background">Requirements and background</h2>

<p>Although CTEs are fairly intuitive, I advise, to those unfamiliar with the subject, to read my <a href="/Generating-sequences-ranges-via-mysql-8.0-ctes/">previous post on the subject</a>.</p>

<p>The same principle applies to the window functions principles; I will break the query/concepts down, however, it’s advised to have at least an idea. There is a vast amount of literature about window functions (which is the reason why I haven’t written about them until now); pretty much all the tutorials use as example either corporate budgets, or populations/countries. Here instead, I’ll use a real-world case.</p>

<p>In relation to the software, MySQL 8.0.19 is convenient (but not required). All the statements need to be run in the same console, due to reusing <code class="language-plaintext highlighter-rouge">@venue_id</code>.</p>

<p>There is always an architectural dilemma between placing the logic at the application level as opposed as the database level. Although this is an appropriate debate, in this context the underlying assumption is that it’s <em>necessary</em> that the logic stays at the database level; a requirement for this can be, for example, speed, which has actually been our case.</p>

<h2 id="the-problem">The problem</h2>

<p>In this problem, we manage venue (theater) seats.</p>

<p>As a business requirement, we need to assign a “grouping”: an additional number representing each seat.</p>

<p>In order to set the grouping value:</p>

<ol>
  <li>start with grouping 0, and the top left seat;</li>
  <li>if there is a space between the previous and current seat, or if it’s a new row, increase the grouping by 2 (unless it’s the first absolute seat), otherwise, increase by 1;</li>
  <li>assign the grouping to the seat;</li>
  <li>move to the next seat in the same row, or to the next row (if the row is over), and iterate from point 2., until the seats are exhausted.</li>
</ol>

<p>In pseudocode:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>current_grouping = 0

for each row:
  for each number:
    if (is_there_a_space_after_last_seat or is_a_new_row) and is_not_the_first_seat:
      current_grouping += 2
    else
      current_grouping += 1

    seat.grouping = current_grouping
</code></pre></div></div>

<p>In practice, we want the setup on the left to have the corresponding values on the right:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  x→  0   1   2        0   1   2
y   ╭───┬───┬───╮    ╭───┬───┬───╮
↓ 0 │ x │ x │   │    │ 1 │ 2 │   │
    ├───┼───┼───┤    ├───┼───┼───┤
  1 │ x │   │ x │    │ 4 │   │ 6 │
    ├───┼───┼───┤    ├───┼───┼───┤
  2 │ x │   │   │    │ 8 │   │   │
    ╰───┴───┴───╯    ╰───┴───┴───╯
</code></pre></div></div>

<h2 id="setup">Setup</h2>

<p>Let’s use a minimalist design for the underlying table:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">seats</span> <span class="p">(</span>
  <span class="n">id</span>         <span class="nb">INT</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">venue_id</span>   <span class="nb">INT</span><span class="p">,</span>
  <span class="n">y</span>          <span class="nb">INT</span><span class="p">,</span>
  <span class="n">x</span>          <span class="nb">INT</span><span class="p">,</span>
  <span class="nv">`row`</span>      <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">16</span><span class="p">),</span>
  <span class="n">number</span>     <span class="nb">INT</span><span class="p">,</span>
  <span class="nv">`grouping`</span> <span class="nb">INT</span><span class="p">,</span>
  <span class="k">UNIQUE</span> <span class="n">venue_id_y_x</span> <span class="p">(</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>We won’t need the <code class="language-plaintext highlighter-rouge">row</code>/<code class="language-plaintext highlighter-rouge">number</code> columns, however, on the other hand, we don’t want to use a table whose records are fully contained in an index, in order to be closer to a real-world setting.</p>

<p>Based on the diagram of the previous section, the seat coordinates are, in the form <code class="language-plaintext highlighter-rouge">(y, x)</code>:</p>

<ul>
  <li>(0, 0), (0, 1)</li>
  <li>(1, 0), (1, 2)</li>
  <li>(2, 0)</li>
</ul>

<p>Note that we’re using <code class="language-plaintext highlighter-rouge">y</code> as first coordinate, because it makes it easier to reason in terms of rows.</p>

<p>We’re going to load a large enough number of records, in order to make sure the optimizer doesn’t take unexpected shortcuts. We use recursive CTEs, of course 😉:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">seats</span><span class="p">(</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="nv">`row`</span><span class="p">,</span> <span class="n">number</span><span class="p">)</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">venue_ids</span> <span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">id</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">venue_ids</span> <span class="k">WHERE</span> <span class="n">id</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mi">100000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="cm">/*+ SET_VAR(cte_max_recursion_depth = 1M) */</span>
  <span class="n">v</span><span class="p">.</span><span class="n">id</span><span class="p">,</span>
  <span class="k">c</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="k">c</span><span class="p">.</span><span class="n">x</span><span class="p">,</span>
  <span class="nb">CHAR</span><span class="p">(</span><span class="n">ORD</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span> <span class="o">+</span> <span class="n">FLOOR</span><span class="p">(</span><span class="n">RAND</span><span class="p">()</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span> <span class="k">USING</span> <span class="n">ASCII</span><span class="p">)</span> <span class="nv">`row`</span><span class="p">,</span>
  <span class="n">FLOOR</span><span class="p">(</span><span class="n">RAND</span><span class="p">()</span> <span class="o">*</span> <span class="mi">3</span><span class="p">)</span> <span class="nv">`number`</span>
<span class="k">FROM</span> <span class="n">venue_ids</span> <span class="n">v</span>
     <span class="k">JOIN</span> <span class="p">(</span>
       <span class="k">VALUES</span>
         <span class="k">ROW</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
         <span class="k">ROW</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
         <span class="k">ROW</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
         <span class="k">ROW</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
         <span class="k">ROW</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
     <span class="p">)</span> <span class="k">c</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">;</span>

<span class="k">ANALYZE</span> <span class="k">TABLE</span> <span class="n">seats</span><span class="p">;</span>
</code></pre></div></div>

<p>A couple of notes:</p>

<ol>
  <li>we’re using the CTEs in a (hopefully!) interesting way - each cycle represents a venue id, but since we want multiple seats to be generated for each venue (cycle), we cross join with a table including the seats data;</li>
  <li>we’re using the v8.0.19’s row constructor (<code class="language-plaintext highlighter-rouge">VALUES ROW()...</code>) in order to represent a (joinable) table without actually creating it;</li>
  <li>we generate random <code class="language-plaintext highlighter-rouge">row</code>/<code class="language-plaintext highlighter-rouge">number</code> data, as they’re filler;</li>
  <li>for simplicity, no tweaks have been applied (e.g. data types are wider than needed, the indexes are added before the records are inserted, etc.).</li>
</ol>

<h2 id="the-old-school-approach">The old-school approach</h2>

<p>The old-school solution is very straightforward:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SET</span> <span class="o">@</span><span class="n">venue_id</span> <span class="o">=</span> <span class="mi">5000</span><span class="p">;</span> <span class="c1">-- arbitrary venue id; any (stored) id will do</span>

<span class="k">SET</span> <span class="o">@</span><span class="k">grouping</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="k">SET</span> <span class="o">@</span><span class="n">y</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="k">SET</span> <span class="o">@</span><span class="n">x</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>

<span class="k">WITH</span> <span class="n">seat_groupings</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="nv">`grouping`</span><span class="p">,</span> <span class="n">tmp_y</span><span class="p">,</span> <span class="n">tmp_x</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span>
    <span class="n">id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span>
    <span class="o">@</span><span class="k">grouping</span> <span class="p">:</span><span class="o">=</span> <span class="o">@</span><span class="k">grouping</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">seats</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="o">@</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">seats</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="o">@</span><span class="n">y</span><span class="p">),</span>
    <span class="o">@</span><span class="n">y</span> <span class="p">:</span><span class="o">=</span> <span class="n">seats</span><span class="p">.</span><span class="n">y</span><span class="p">,</span>
    <span class="o">@</span><span class="n">x</span> <span class="p">:</span><span class="o">=</span> <span class="n">seats</span><span class="p">.</span><span class="n">x</span>
  <span class="k">FROM</span> <span class="n">seats</span>
  <span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
  <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span>
<span class="p">)</span>
<span class="k">UPDATE</span>
  <span class="n">seats</span> <span class="n">s</span>
  <span class="k">JOIN</span> <span class="n">seat_groupings</span> <span class="n">sg</span> <span class="k">USING</span> <span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">SET</span> <span class="n">s</span><span class="p">.</span><span class="k">grouping</span> <span class="o">=</span> <span class="n">sg</span><span class="p">.</span><span class="k">grouping</span>
<span class="p">;</span>

<span class="c1">-- Query OK, 5 rows affected, 3 warnings (0,00 sec)</span>
</code></pre></div></div>

<p>Nice and easy (but keep in mind the warnings)!</p>

<p>A little side note: I’m taking advantage of boolean arithmetic properties here; specifically, the following statements are equivalent:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">seats</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="o">@</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">seats</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="o">@</span><span class="n">y</span> <span class="nv">`increment`</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="n">IF</span> <span class="p">(</span>
  <span class="n">seats</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="o">@</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">seats</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="o">@</span><span class="n">y</span><span class="p">,</span>
  <span class="mi">1</span><span class="p">,</span>
  <span class="mi">0</span>
<span class="p">)</span> <span class="nv">`increment`</span><span class="p">;</span>
</code></pre></div></div>

<p>some people find it intuitive, some don’t - it’s a matter of taste; since it’s clarified now, for compactness purposes, I will use it for the rest of the article.</p>

<p>Let’s see the outcome:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="nv">`grouping`</span> <span class="k">FROM</span> <span class="n">seats</span> <span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">;</span>

<span class="c1">-- +-------+------+------+----------+</span>
<span class="c1">-- | id    | y    | x    | grouping |</span>
<span class="c1">-- +-------+------+------+----------+</span>
<span class="c1">-- | 24887 |    0 |    0 |        1 |</span>
<span class="c1">-- | 27186 |    0 |    1 |        2 |</span>
<span class="c1">-- | 29485 |    1 |    0 |        4 |</span>
<span class="c1">-- | 31784 |    1 |    2 |        6 |</span>
<span class="c1">-- | 34083 |    2 |    0 |        8 |</span>
<span class="c1">-- +-------+------+------+----------+</span>
</code></pre></div></div>

<p>This approach is ideal!</p>

<p>It has just a “small” defect: it may work… or not.</p>

<p>The reason is that the query optimizer doesn’t necessarily evaluate left to right, so the assignment operations (<code class="language-plaintext highlighter-rouge">:=</code>) may be evaluated out of order, causing the result to be wrong. This is a problem typically experienced after MySQL upgrades.</p>

<p>As of MySQL 8.0, this functionality is indeed deprecated:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- To be run immediately after the UPDATE.</span>
<span class="c1">--</span>
<span class="k">SHOW</span> <span class="n">WARNINGS</span><span class="err">\</span><span class="k">G</span>
<span class="c1">-- *************************** 1. row ***************************</span>
<span class="c1">--   Level: Warning</span>
<span class="c1">--    Code: 1287</span>
<span class="c1">-- Message: Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'.</span>
<span class="c1">-- [...]</span>
</code></pre></div></div>

<p>Let’s fix this!</p>

<h2 id="modern-approach-1-window-functions">Modern approach #1: Window functions</h2>

<p>Window functions have been a long-awaited functionality in the MySQL world.</p>

<p>Generally speaking, the “rolling” nature of window functions fits very well accumulating functions. However, some complex accumulating functions require the results of the latest expression to be available, which is something window functions don’t support, since they work on a column basis.</p>

<p>This doesn’t mean that the problem can’t be solved, rather, than it needs to be re-thought.</p>

<p>In this case, we split the problem in two concepts; we think the grouping value for each seat as the sum of two values:</p>

<ul>
  <li>the sequence number of each seat, and</li>
  <li>the cumulative value of the increments of all the seats up to the current one.</li>
</ul>

<p>Those familiar with window functions will recognize the patterns here 🙂</p>

<p>The sequence number of each seat is a built-in function:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ROW_NUMBER</span><span class="p">()</span> <span class="n">OVER</span> <span class="o">&lt;</span><span class="k">window</span><span class="o">&gt;</span>
</code></pre></div></div>

<p>The cumulative value is where things get interesting. In order to accomplish this task, we perform two steps:</p>

<ol>
  <li>we calculate each seat increment, and put it on a table (or CTE),</li>
  <li>then, for each seat, we use a window function to sum the increments up to that seat.</li>
</ol>

<p>Let’s see the SQL:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span>
<span class="n">increments</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">increment</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span>
    <span class="n">id</span><span class="p">,</span>
    <span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span>
  <span class="k">FROM</span> <span class="n">seats</span>
  <span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
  <span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span>
  <span class="n">s</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span>
  <span class="n">ROW_NUMBER</span><span class="p">()</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="k">SUM</span><span class="p">(</span><span class="k">increment</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="nv">`grouping`</span>
<span class="k">FROM</span> <span class="n">seats</span> <span class="n">s</span>
     <span class="k">JOIN</span> <span class="n">increments</span> <span class="n">i</span> <span class="k">USING</span> <span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">;</span>

<span class="c1">-- +-------+---+---+----------+</span>
<span class="c1">-- | id    | y | x | grouping |</span>
<span class="c1">-- +-------+---+---+----------+</span>
<span class="c1">-- | 24887 | 0 | 0 |        1 |</span>
<span class="c1">-- | 27186 | 0 | 1 |        2 |</span>
<span class="c1">-- | 29485 | 1 | 0 |        4 |</span>
<span class="c1">-- | 31784 | 1 | 2 |        6 |</span>
<span class="c1">-- | 34083 | 2 | 1 |        8 |</span>
<span class="c1">-- +-------+---+---+----------+</span>
</code></pre></div></div>

<p>Nice!</p>

<p>(Note that for simplicity, I’ll omit the <code class="language-plaintext highlighter-rouge">UPDATE</code> from now on.)</p>

<p>Let’s review the query.</p>

<h3 id="high-level-logic">High-level logic</h3>

<p>The CTE (edited):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
  <span class="n">id</span><span class="p">,</span>
  <span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="nv">`increment`</span>
<span class="k">FROM</span> <span class="n">seats</span>
<span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
<span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">;</span>

<span class="c1">-- +-------+-----------+</span>
<span class="c1">-- | id    | increment |</span>
<span class="c1">-- +-------+-----------+</span>
<span class="c1">-- | 24887 |         0 |</span>
<span class="c1">-- | 27186 |         0 |</span>
<span class="c1">-- | 29485 |         1 |</span>
<span class="c1">-- | 31784 |         1 |</span>
<span class="c1">-- | 34083 |         1 |</span>
<span class="c1">-- +-------+-----------+</span>
</code></pre></div></div>

<p>calculates the increments for each seat, compared to the previous (more on <code class="language-plaintext highlighter-rouge">LAG()</code> later). It works purely on each record and the previous; it’s not cumulative.</p>

<p>Now, in order to calculate the cumulative increments, we just use a window function to compute the sum, for and up to each seat:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- (CTE here...)</span>
<span class="k">SELECT</span>
  <span class="n">s</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span>
  <span class="n">ROW_NUMBER</span><span class="p">()</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="nv">`pos.`</span><span class="p">,</span>
  <span class="k">SUM</span><span class="p">(</span><span class="k">increment</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="nv">`cum.incr.`</span>
<span class="k">FROM</span> <span class="n">seats</span> <span class="n">s</span>
     <span class="k">JOIN</span> <span class="n">increments</span> <span class="n">i</span> <span class="k">USING</span> <span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">);</span>

<span class="c1">-- +-------+---+---+------+-----------+</span>
<span class="c1">-- | id    | y | x | pos. | cum.incr. | (grouping)</span>
<span class="c1">-- +-------+---+---+------+-----------+</span>
<span class="c1">-- | 24887 | 0 | 0 |    1 |         0 | = 1 + 0 (curr.)</span>
<span class="c1">-- | 27186 | 0 | 1 |    2 |         0 | = 2 + 0 (#24887) + 0 (curr.)</span>
<span class="c1">-- | 29485 | 1 | 0 |    3 |         1 | = 3 + 0 (#24887) + 0 (#27186) + 1 (curr.)</span>
<span class="c1">-- | 31784 | 1 | 2 |    4 |         2 | = 4 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (curr.)</span>
<span class="c1">-- | 34083 | 2 | 1 |    5 |         3 | = 5 + 0 (#24887) + 0 (#27186) + 1 (#29485) + 1 (#31784)↵</span>
<span class="c1">-- +-------+---+---+------+-----------+     + 1 (curr.)</span>
</code></pre></div></div>

<h3 id="lag-window-function"><code class="language-plaintext highlighter-rouge">LAG()</code> window function</h3>

<p>The <code class="language-plaintext highlighter-rouge">LAG</code> function, in the simplest form (<code class="language-plaintext highlighter-rouge">LAG(x)</code>), returns the previous value of the given column. A typical nuisance of window functions is to deal with the first record(s) in the window - since there is no previous record, they return NULL. With LAG, we can specify the value we want as third parameter:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="c1">-- defaults to `x -1`</span>
<span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>     <span class="c1">-- defaults to `y`</span>
</code></pre></div></div>

<p>By specifying the defaults above, we make sure that the very first seat in the window will be treated by the logic as adjacent to the previous one (<code class="language-plaintext highlighter-rouge">x - 1</code>) and in the same row (<code class="language-plaintext highlighter-rouge">y</code>).</p>

<p>The alternative to defaults is typically <code class="language-plaintext highlighter-rouge">IFNULL</code>, which is very intrusive, especially considering the relative complexity of the expression:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Both valid. And both ugly!</span>
<span class="c1">--</span>
<span class="n">IFNULL</span><span class="p">(</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">IFNULL</span><span class="p">(</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="k">FALSE</span><span class="p">)</span> <span class="k">OR</span> <span class="n">IFNULL</span><span class="p">(</span><span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span><span class="p">,</span> <span class="k">FALSE</span><span class="p">)</span>
</code></pre></div></div>

<p>The second <code class="language-plaintext highlighter-rouge">LAG()</code> parameter is the number of positions to go back in the window; <code class="language-plaintext highlighter-rouge">1</code> is the previous, which is also the default value.</p>

<h3 id="technical-aspects">Technical aspects</h3>

<h3 id="named-windows">Named windows</h3>

<p>In this query, we’re using multiple times the same window. The following queries are formally equivalent:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
  <span class="n">id</span><span class="p">,</span>
  <span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span>
<span class="k">FROM</span> <span class="n">seats</span>
<span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
<span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">);</span>

<span class="k">SELECT</span>
  <span class="n">id</span><span class="p">,</span>
  <span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">seats</span>
<span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span><span class="p">;</span>
</code></pre></div></div>

<p>However, the latter may cause a suboptimal plan (which I’ve experienced, at least in the past); the optimizer may treat the windows as independent, and iterate them separately.<br />
For this reason, I advise to always use named windows, at least when there are duplicated ones.</p>

<h3 id="partition-by-clause"><code class="language-plaintext highlighter-rouge">PARTITION BY</code> clause</h3>

<p>Typically, window functions are executed over a partition, which in this case would be:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
  <span class="n">id</span><span class="p">,</span>
  <span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span>
<span class="k">FROM</span> <span class="n">seats</span>
<span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
<span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">(</span><span class="k">PARTITION</span> <span class="k">BY</span> <span class="n">venue_id</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">);</span> <span class="c1">-- here!</span>
</code></pre></div></div>

<p>Since the window matches the full set of records (which is filtered by the <code class="language-plaintext highlighter-rouge">WHERE</code> condition), we don’t need to specify it.</p>

<p>If we had to run this query over the whole <code class="language-plaintext highlighter-rouge">seats</code> table, then we’d need it, so that, across each <code class="language-plaintext highlighter-rouge">venue_id</code>, the window is reset.</p>

<h3 id="ordering">Ordering</h3>

<p>In the query, the <code class="language-plaintext highlighter-rouge">ORDER BY</code> is specified at the window level:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
  <span class="n">id</span><span class="p">,</span>
  <span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span>
<span class="k">FROM</span> <span class="n">seats</span>
<span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
<span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
</code></pre></div></div>

<p>The window ordering is separate from the <code class="language-plaintext highlighter-rouge">SELECT</code> one. This is crucial! The behavior of this query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
  <span class="n">id</span><span class="p">,</span>
  <span class="n">x</span> <span class="o">&gt;</span> <span class="n">LAG</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">OR</span> <span class="n">y</span> <span class="o">!=</span> <span class="n">LAG</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="n">OVER</span> <span class="n">tzw</span>
<span class="k">FROM</span> <span class="n">seats</span>
<span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
<span class="k">WINDOW</span> <span class="n">tzw</span> <span class="k">AS</span> <span class="p">()</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span>
</code></pre></div></div>

<p>is unspecified. Let’s have a look at the <a href="https://dev.mysql.com/doc/refman/8.0/en/window-functions-usage.html">manpage</a>:</p>

<blockquote>
  <p>Query result rows are determined from the FROM clause, after WHERE, GROUP BY, and HAVING processing, and windowing execution occurs before ORDER BY, LIMIT, and SELECT DISTINCT.</p>
</blockquote>

<h3 id="considerations">Considerations</h3>

<p>Abstractly speaking, in order to solve this class of problems, instead of representing each entry as as a function of the previous one, we calculate the state change for each entry, then sum the changes up.</p>

<p>Although more complex than the functionality it replaces, this solution is very solid. This approach though, may not be always possible, or at least easy, so that’s where the recursive CTE solution comes into play.</p>

<h2 id="modern-approach-2-recursive-cte">Modern approach #2: Recursive CTE</h2>

<p>This approach requires a workaround due to a limitation in MySQL’s CTE functionality, but, on the other hand, it’s a generic, direct, solution, and as such, it doesn’t require any rethinking of the approach.</p>

<p>Let’s start from a the simplified version of the end query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- `p_` is for `Previous`, in order to make the conditions a bit more intuitive.</span>
<span class="c1">--</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">groupings</span> <span class="p">(</span><span class="n">p_id</span><span class="p">,</span> <span class="n">p_venue_id</span><span class="p">,</span> <span class="n">p_y</span><span class="p">,</span> <span class="n">p_x</span><span class="p">,</span> <span class="n">p_grouping</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">venue_id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="mi">1</span>
    <span class="k">FROM</span> <span class="n">seats</span>
    <span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
    <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span>
    <span class="k">LIMIT</span> <span class="mi">1</span>
  <span class="p">)</span>

  <span class="k">UNION</span> <span class="k">ALL</span>

  <span class="k">SELECT</span>
    <span class="n">s</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span><span class="p">,</span>
    <span class="n">p_grouping</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">p_x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="n">p_y</span><span class="p">)</span>
  <span class="k">FROM</span> <span class="n">groupings</span><span class="p">,</span> <span class="n">seats</span> <span class="n">s</span>
  <span class="k">WHERE</span> <span class="n">s</span><span class="p">.</span><span class="n">venue_id</span> <span class="o">=</span> <span class="n">p_venue_id</span> <span class="k">AND</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span><span class="p">)</span> <span class="o">&gt;</span> <span class="p">(</span><span class="n">p_y</span><span class="p">,</span> <span class="n">p_x</span><span class="p">)</span>
  <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">s</span><span class="p">.</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span>
  <span class="k">LIMIT</span> <span class="mi">1</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">groupings</span><span class="p">;</span>
</code></pre></div></div>

<p>Bingo! This query is (relatively) simple, but most importantly, it expresses the grouping accumulating function in the simplest possible way:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p_grouping</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">p_x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="n">p_y</span><span class="p">)</span>

<span class="c1">-- the above is equivalent to:</span>

<span class="o">@</span><span class="k">grouping</span> <span class="p">:</span><span class="o">=</span> <span class="o">@</span><span class="k">grouping</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">seats</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="o">@</span><span class="n">x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">seats</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="o">@</span><span class="n">y</span><span class="p">),</span>
<span class="o">@</span><span class="n">y</span> <span class="p">:</span><span class="o">=</span> <span class="n">seats</span><span class="p">.</span><span class="n">y</span><span class="p">,</span>
<span class="o">@</span><span class="n">x</span> <span class="p">:</span><span class="o">=</span> <span class="n">seats</span><span class="p">.</span><span class="n">x</span>
</code></pre></div></div>

<p>Even for those who are not accustomed with CTEs, the logic is simple.</p>

<p>The initial row is the first seat of the venue, in order:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">venue_id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="mi">1</span>
<span class="k">FROM</span> <span class="n">seats</span>
<span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span>
<span class="k">LIMIT</span> <span class="mi">1</span>
</code></pre></div></div>

<p>In the recursive part, we proceed with the iteration:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>
  <span class="n">s</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span><span class="p">,</span>
  <span class="n">p_grouping</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">p_x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="n">p_y</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">groupings</span><span class="p">,</span> <span class="n">seats</span> <span class="n">s</span>
<span class="k">WHERE</span> <span class="n">s</span><span class="p">.</span><span class="n">venue_id</span> <span class="o">=</span> <span class="n">p_venue_id</span> <span class="k">AND</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span><span class="p">)</span> <span class="o">&gt;</span> <span class="p">(</span><span class="n">p_y</span><span class="p">,</span> <span class="n">p_x</span><span class="p">)</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">s</span><span class="p">.</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span>
<span class="k">LIMIT</span> <span class="mi">1</span>
</code></pre></div></div>

<p>the <code class="language-plaintext highlighter-rouge">WHERE</code> condition, along with the <code class="language-plaintext highlighter-rouge">ORDER BY</code> and <code class="language-plaintext highlighter-rouge">LIMIT</code> clauses, simply find the next seat, that is, the one seat with the same venue id, which, in order of <code class="language-plaintext highlighter-rouge">(venue_id, x, y)</code>, has greater <code class="language-plaintext highlighter-rouge">(x, y)</code> coordinates.</p>

<p>The <code class="language-plaintext highlighter-rouge">s.venue_id</code> part of the ordering is crucial! This allows us to use the index.</p>

<p>The <code class="language-plaintext highlighter-rouge">SELECT</code> clause takes care of:</p>

<ul>
  <li>performing the accumulation (computation of <code class="language-plaintext highlighter-rouge">(p_)grouping</code>),</li>
  <li>passing the values of the current seat (<code class="language-plaintext highlighter-rouge">s.id, s.venue_id, s.y, s.x</code>) to the next cycle.</li>
</ul>

<p>We select <code class="language-plaintext highlighter-rouge">FROM groupings</code> so that we fulfill the requirements for the CTE to be recursive.</p>

<p>What’s interesting here is that we use the recursive CTE essentially as iterator, via selection from the <code class="language-plaintext highlighter-rouge">groupings</code> table in the recursive subquery, while joining with <code class="language-plaintext highlighter-rouge">seats</code>, in order to find the data to work on.</p>

<p>The JOIN is formally a cross join, however, only one record is returned, due to the <code class="language-plaintext highlighter-rouge">LIMIT</code> clause.</p>

<h3 id="working-version">Working version</h3>

<p>Unfortunately, the above query doesn’t work because the <code class="language-plaintext highlighter-rouge">ORDER BY</code> clause is currently not supported in the recursive subquery; additionally, the semantics of the <code class="language-plaintext highlighter-rouge">LIMIT</code> as used here are not the intended ones, as they <a href="https://dev.mysql.com/doc/refman/8.0/en/with.html#common-table-expressions-recursive-examples">apply to the outermost query</a>:</p>

<blockquote>
  <p>LIMIT is now supported […] The effect on the result set is the same as when using LIMIT in the outermost SELECT</p>
</blockquote>

<p>However, it’s not a significant problem. Let’s have a look at the working version:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">groupings</span> <span class="p">(</span><span class="n">p_id</span><span class="p">,</span> <span class="n">p_venue_id</span><span class="p">,</span> <span class="n">p_y</span><span class="p">,</span> <span class="n">p_x</span><span class="p">,</span> <span class="n">p_grouping</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">venue_id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="mi">1</span>
    <span class="k">FROM</span> <span class="n">seats</span>
    <span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
    <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span>
    <span class="k">LIMIT</span> <span class="mi">1</span>
  <span class="p">)</span>

  <span class="k">UNION</span> <span class="k">ALL</span>

  <span class="k">SELECT</span>
    <span class="n">s</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span><span class="p">,</span>
    <span class="n">p_grouping</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">p_x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="n">p_y</span><span class="p">)</span>
  <span class="k">FROM</span> <span class="n">groupings</span><span class="p">,</span> <span class="n">seats</span> <span class="n">s</span> <span class="k">WHERE</span> <span class="n">s</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">si</span><span class="p">.</span><span class="n">id</span>
    <span class="k">FROM</span> <span class="n">seats</span> <span class="n">si</span>
    <span class="k">WHERE</span> <span class="n">si</span><span class="p">.</span><span class="n">venue_id</span> <span class="o">=</span> <span class="n">p_venue_id</span> <span class="k">AND</span> <span class="p">(</span><span class="n">si</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">si</span><span class="p">.</span><span class="n">x</span><span class="p">)</span> <span class="o">&gt;</span> <span class="p">(</span><span class="n">p_y</span><span class="p">,</span> <span class="n">p_x</span><span class="p">)</span>
    <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">si</span><span class="p">.</span><span class="n">venue_id</span><span class="p">,</span> <span class="n">si</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">si</span><span class="p">.</span><span class="n">x</span>
    <span class="k">LIMIT</span> <span class="mi">1</span>
  <span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">groupings</span><span class="p">;</span>

<span class="c1">-- +-------+------+------+------------+</span>
<span class="c1">-- | p_id  | p_y  | p_x  | p_grouping |</span>
<span class="c1">-- +-------+------+------+------------+</span>
<span class="c1">-- | 24887 |    0 |    0 |          1 |</span>
<span class="c1">-- | 27186 |    0 |    1 |          2 |</span>
<span class="c1">-- | 29485 |    1 |    0 |          4 |</span>
<span class="c1">-- | 31784 |    1 |    2 |          6 |</span>
<span class="c1">-- | 34083 |    2 |    0 |          8 |</span>
<span class="c1">-- +-------+------+------+------------+</span>
</code></pre></div></div>

<p>It’s a bit of shame having to use a subquery, but it works, and the boilerplate is minimal, as several clauses are required anyway.</p>

<p>Here, instead of performing the ordering and limiting, in the relation resulting from the join of <code class="language-plaintext highlighter-rouge">groupings</code> and <code class="language-plaintext highlighter-rouge">seats</code>, we do it in a subquery, and pass it to the outer query, which will consequently select only the target record.</p>

<h3 id="performance-considerations">Performance considerations</h3>

<p>Let’s have a look at the query plan, using the <code class="language-plaintext highlighter-rouge">EXPLAIN ANALYZE</code> functionality:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mysql&gt; EXPLAIN ANALYZE WITH RECURSIVE groupings [...]

-&gt; Table scan on groupings  (actual time=0.000..0.001 rows=5 loops=1)
    -&gt; Materialize recursive CTE groupings  (actual time=0.140..0.141 rows=5 loops=1)
        -&gt; Limit: 1 row(s)  (actual time=0.019..0.019 rows=1 loops=1)
            -&gt; Index lookup on seats using venue_id_y_x (venue_id=(@venue_id))  (cost=0.75 rows=5) (actual time=0.018..0.018 rows=1 loops=1)
        -&gt; Repeat until convergence
            -&gt; Nested loop inner join  (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
                -&gt; Scan new records on groupings  (cost=2.73 rows=2) (actual time=0.001..0.001 rows=2 loops=2)
                -&gt; Filter: (s.id = (select #5))  (cost=0.30 rows=1) (actual time=0.020..0.020 rows=1 loops=5)
                    -&gt; Single-row index lookup on s using PRIMARY (id=(select #5))  (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
                    -&gt; Select #5 (subquery in condition; dependent)
                        -&gt; Limit: 1 row(s)  (actual time=0.007..0.008 rows=1 loops=9)
                            -&gt; Filter: ((si.y,si.x) &gt; (groupings.p_y,groupings.p_x))  (cost=0.75 rows=5) (actual time=0.007..0.007 rows=1 loops=9)
                                -&gt; Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id)  (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)
</code></pre></div></div>

<p>The plan is very much as expected. The foundation of an optimal plan for this case, is in the index lookups:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-&gt; Nested loop inner join  (cost=3.43 rows=2) (actual time=0.017..0.053 rows=2 loops=2)
-&gt; Single-row index lookup on s using PRIMARY (id=(select #5))  (cost=0.30 rows=1) (actual time=0.014..0.014 rows=1 loops=5)
-&gt; Index lookup on si using venue_id_y_x (venue_id=groupings.p_venue_id)  (cost=0.75 rows=5) (actual time=0.006..0.006 rows=4 loops=9)
</code></pre></div></div>

<p>which are paramount; if even an index scan is performed (in short, when the index entries are scanned linearly, instead of finding directly the desired one), the performance will tank.</p>

<p>Therefore, the requirements for this strategy to work, are that the related indexes are in place <em>and</em> are used by the optimizer very efficiently.</p>

<p>It’s expected that, in the future, if the restrictions are lifted, not having to use the subquery will make the task considerably simpler for the optimizer.</p>

<h3 id="alternative-for-suboptimal-plans">Alternative for suboptimal plans</h3>

<p>For particular use cases where an optimal plan can’t be found, just use a temporary table:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">selected_seats</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">y</span> <span class="nb">INT</span><span class="p">,</span>
  <span class="n">x</span> <span class="nb">INT</span><span class="p">,</span>
  <span class="k">UNIQUE</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span>
<span class="k">FROM</span> <span class="n">seats</span> <span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span><span class="p">;</span>

<span class="k">WITH</span> <span class="k">RECURSIVE</span>
<span class="n">groupings</span> <span class="p">(</span><span class="n">p_id</span><span class="p">,</span> <span class="n">p_y</span><span class="p">,</span> <span class="n">p_x</span><span class="p">,</span> <span class="n">p_grouping</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="mi">1</span>
    <span class="k">FROM</span> <span class="n">seats</span>
    <span class="k">WHERE</span> <span class="n">venue_id</span> <span class="o">=</span> <span class="o">@</span><span class="n">venue_id</span>
    <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span>
    <span class="k">LIMIT</span> <span class="mi">1</span>
  <span class="p">)</span>

  <span class="k">UNION</span> <span class="k">ALL</span>

  <span class="k">SELECT</span>
    <span class="n">s</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">s</span><span class="p">.</span><span class="n">x</span><span class="p">,</span>
    <span class="n">p_grouping</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">+</span> <span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">x</span> <span class="o">&gt;</span> <span class="n">p_x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">OR</span> <span class="n">s</span><span class="p">.</span><span class="n">y</span> <span class="o">!=</span> <span class="n">p_y</span><span class="p">)</span>
  <span class="k">FROM</span> <span class="n">groupings</span><span class="p">,</span> <span class="n">seats</span> <span class="n">s</span> <span class="k">WHERE</span> <span class="n">s</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="p">(</span>
    <span class="k">SELECT</span> <span class="n">ss</span><span class="p">.</span><span class="n">id</span>
    <span class="k">FROM</span> <span class="n">selected_seats</span> <span class="n">ss</span>
    <span class="k">WHERE</span> <span class="p">(</span><span class="n">ss</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">ss</span><span class="p">.</span><span class="n">x</span><span class="p">)</span> <span class="o">&gt;</span> <span class="p">(</span><span class="n">p_y</span><span class="p">,</span> <span class="n">p_x</span><span class="p">)</span>
    <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">ss</span><span class="p">.</span><span class="n">y</span><span class="p">,</span> <span class="n">ss</span><span class="p">.</span><span class="n">x</span>
    <span class="k">LIMIT</span> <span class="mi">1</span>
    <span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">groupings</span><span class="p">;</span>
</code></pre></div></div>

<p>Even if index scans are performed in this query, they’re very cheap, as the <code class="language-plaintext highlighter-rouge">selected_seats</code> table is very small.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I’m very pleased that a very effective but flawed workflow, can be replaced with clean (enough) functionalities, which have been brought by MySQL 8.0.</p>

<p>There are still new (underlying) functionalities in development in the 8.0 series, which therefore keeps proving to be a very strong release.</p>

<p>Happy recursion 😄</p>]]></content><author><name></name></author><category term="mysql" /><category term="databases" /><category term="indexes" /><category term="innodb" /><category term="mysql" /><summary type="html"><![CDATA[A common MySQL strategy to perform updates with accumulating functions is to employ user-defined variables, using the UPDATE [...] SET mycol = (@myvar := EXPRESSION(@myvar, mycol)) pattern. This pattern though doesn’t play well with the optimizer (leading to non-deterministic behavior), so it has been deprecated. This left a sort of void, since the (relatively) sophisticated logic is now harder to reproduce, at least with the same simplicity. In this article, I’ll have a look at two ways to apply such logic: using, canonically, window functions, and, a bit more creatively, using recursive CTEs. Requirements and background The problem Setup The old-school approach Modern approach #1: Window functions High-level logic LAG() window function Technical aspects Named windows PARTITION BY clause Ordering Considerations Modern approach #2: Recursive CTE Working version Performance considerations Alternative for suboptimal plans Conclusion]]></summary></entry><entry><title type="html">Storage and Indexed access of denormalized columns (arrays) on MySQL 8.0, via multi-valued indexes</title><link href="https://saveriomiroddi.github.io/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes/" rel="alternate" type="text/html" title="Storage and Indexed access of denormalized columns (arrays) on MySQL 8.0, via multi-valued indexes" /><published>2020-03-16T00:00:00+00:00</published><updated>2020-03-16T00:00:00+00:00</updated><id>https://saveriomiroddi.github.io/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes</id><content type="html" xml:base="https://saveriomiroddi.github.io/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes/"><![CDATA[<p>Another “missing and missed” functionality in MySQL is a data type for arrays.</p>

<p>While MySQL is not there yet, it’s now possible to cover a significant use case: storing denormalized columns (or arrays in general), and accessing them via index.</p>

<p>In this article I’ll give some context about denormalized data and indexes, including the workaround for such functionality on MySQL 5.7, and describe how this is (rather) cleanly accomplished on MySQL 8.0.</p>

<ul>
  <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#terminology">Terminology</a></li>
  <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#storing-and-indexing-arrays-in-mysql-57-an-approach-and-problems">Storing and indexing arrays in MySQL 5.7: an approach, and problems</a></li>
  <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#the-mysql-80-implementation-data-type-and-index">The MySQL 8.0 implementation: data type and index</a></li>
  <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#performance-expectations">Performance expectations</a>
    <ul>
      <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#why-multiple-arrays-cant-be-indexed">Why multiple arrays can’t be indexed</a></li>
    </ul>
  </li>
  <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#how-do-i-declare-an-array-unsigned-column">How do I declare an ARRAY UNSIGNED column?</a></li>
  <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#conclusion">Conclusion</a></li>
  <li><a href="/Storage-and-indexed-access-of-denormalized-columns-arrays-on-mysql-8.0-via-multi-valued-indexes#footnotes">Footnotes</a></li>
</ul>

<h2 id="terminology">Terminology</h2>

<p>Although B-trees are technically inverted indexes, in this context I’ll use the “inverted index” term to describe document-oriented indexes, like PostgreSQL’s GIN or InnoDB’s fulltext index, and I’ll refer to B-trees with their name.</p>

<p>Also, I won’t make any distinction between B-trees and B+trees, using only the “B-tree” term.</p>

<h2 id="storing-and-indexing-arrays-in-mysql-57-an-approach-and-problems">Storing and indexing arrays in MySQL 5.7: an approach, and problems</h2>

<p>MySQL doesn’t have an array data type. This is a fundamental problem in architectures where storing denormalized rows is a requirement, for example, where MySQL is (also) used for data warehousing.</p>

<p>Storage and access are two sides of the same coin: missing optimal storage data structures for a certain class of data almost certainly implies the lack of optimal related algorithms; in this case, it translates to lack of (direct) indexing.</p>

<p>Storing arrays is not a big problem in itself: assuming simple data types, like integers, we can easily adopt the workaround of using a VARCHAR/TEXT column to store the values with an arbitrary separator (space is the most convenient), however, MySQL is (was) not designed to index this scenario.</p>

<p>Again, we can adopt another workaround: fulltext indexes. We can either set the <a href="https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_ft_min_token_size">InnoDB fulltext minimum token size</a> to 1, but this has the downside of being a global setting, or pad the values, which works, although it’s suboptimal in terms of storage.</p>

<p>This is a working solution, if one really needs to: it has with the downsides of InnoDB’s fulltext indexes support, which are not few, but it’s good enough.</p>

<h2 id="the-mysql-80-implementation-data-type-and-index">The MySQL 8.0 implementation: data type and index</h2>

<p>MySQL can store arrays since v5.7, through the JSON data type:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Note how we're using the v8.0.19's new `ROW()` construct for inserting multiple rows.</span>
<span class="c1">--</span>
<span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_json_arrays</span><span class="p">(</span>
  <span class="n">id</span>      <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">c_array</span> <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="p">(</span>
  <span class="k">VALUES</span>
    <span class="k">ROW</span><span class="p">(</span><span class="nv">"[1, 2, 3]"</span><span class="p">),</span>
    <span class="k">ROW</span><span class="p">(</span><span class="n">JSON_ARRAY</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="p">)</span> <span class="n">v</span> <span class="p">(</span><span class="n">c_array</span><span class="p">);</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t_json_arrays</span><span class="p">;</span>

<span class="c1">-- +----+-----------+</span>
<span class="c1">-- | id | c_array   |</span>
<span class="c1">-- +----+-----------+</span>
<span class="c1">-- |  1 | [1, 2, 3] |</span>
<span class="c1">-- |  2 | [4, 5, 6] |</span>
<span class="c1">-- +----+-----------+</span>
</code></pre></div></div>

<p>We can insert a JSON document (array) either as a string, or using the <code class="language-plaintext highlighter-rouge">JSON_ARRAY</code> function.</p>

<p>Some operators are available for accessing the data stored in the JSON document, e.g. <code class="language-plaintext highlighter-rouge">-&gt;</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Functionality for accessing JSON data</span>
<span class="c1">--</span>
<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">c_array</span> <span class="o">-&gt;</span> <span class="nv">"$[1]"</span> <span class="nv">`array_entry_1`</span> <span class="k">FROM</span> <span class="n">t_json_arrays</span><span class="p">;</span>

<span class="c1">-- +----+---------------+</span>
<span class="c1">-- | id | array_entry_1 |</span>
<span class="c1">-- +----+---------------+</span>
<span class="c1">-- |  1 | 2             |</span>
<span class="c1">-- |  2 | 5             |</span>
<span class="c1">-- +----+---------------+</span>
</code></pre></div></div>

<p>However, indexing has been introduced only with v8.0.17, along with new search functionalities:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- This is a functional index.</span>
<span class="c1">--</span>
<span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">t_json_arrays</span> <span class="k">ADD</span> <span class="k">KEY</span> <span class="p">(</span> <span class="p">(</span><span class="k">CAST</span><span class="p">(</span><span class="n">c_array</span> <span class="o">-&gt;</span> <span class="s1">'$'</span> <span class="k">AS</span> <span class="nb">UNSIGNED</span> <span class="n">ARRAY</span><span class="p">))</span> <span class="p">);</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t_json_arrays</span> <span class="k">WHERE</span> <span class="mi">3</span> <span class="n">MEMBER</span> <span class="k">OF</span> <span class="p">(</span><span class="n">c_array</span><span class="p">);</span>

<span class="c1">-- +----+-----------+</span>
<span class="c1">-- | id | c_array   |</span>
<span class="c1">-- +----+-----------+</span>
<span class="c1">-- |  1 | [1, 2, 3] |</span>
<span class="c1">-- +----+-----------+</span>

<span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t_json_arrays</span> <span class="k">WHERE</span> <span class="mi">3</span> <span class="n">MEMBER</span> <span class="k">OF</span> <span class="p">(</span><span class="n">c_array</span> <span class="o">-&gt;</span> <span class="s1">'$'</span><span class="p">);</span>

<span class="c1">-- -&gt; Filter: json'3' member of (cast(json_extract(t_json_arrays.c_array,_utf8mb4'$') as unsigned array))  (cost=1.10 rows=1)</span>
<span class="c1">--     -&gt; Index lookup on t_json_arrays using functional_index (cast(json_extract(t_json_arrays.c_array,_utf8mb4'$') as unsigned array)=json'3')  (cost=1.10 rows=1)</span>
</code></pre></div></div>

<p>Note how the <code class="language-plaintext highlighter-rouge">WHERE</code> condition <em>must</em> replicate exactly the functional key part (in this case, <code class="language-plaintext highlighter-rouge">c_array -&gt; '$'</code>).</p>

<h2 id="performance-expectations">Performance expectations</h2>

<p>According to the <a href="https://dev.mysql.com/worklog/task/?id=8955#tabs-8955-4">functionality worklog</a>, the index is a slightly modified B-tree:</p>

<blockquote>
  <p>In general, multi-valued index is a regular functional index, with the exception that it requires additional handling under the hood on INSERT/UPDATE for multi-valued key parts.</p>
</blockquote>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="n">INDEXES</span> <span class="k">FROM</span> <span class="n">t_json_arrays</span> <span class="k">WHERE</span> <span class="n">Key_name</span> <span class="k">NOT</span> <span class="k">LIKE</span> <span class="s1">'PRIMARY'</span><span class="err">\</span><span class="k">G</span>

<span class="c1">-- *************************** 1. row ***************************</span>
<span class="c1">--      Table: t_json_arrays</span>
<span class="c1">--   Key_name: functional_index</span>
<span class="c1">-- Index_type: BTREE</span>
<span class="c1">-- [...]</span>
</code></pre></div></div>

<p>Using a simple B-tree for this purpose has the specular opposite advantages and disadvantages of inverted indexes, the crucial difference being that the operations cost increases linearly with the size of the array stored.</p>

<p>This is because B-trees don’t have optimizations for large/batch insertions (inverted indexes are document-oriented, so it’s expected for insertions to be large); each array entry is one key in the index.</p>

<p>On the other hand, the DMLs cost is constant<a href="#footnote01">¹</a>; there are no spikes caused by maintenance operations (ie. <a href="https://www.postgresql.org/docs/current/gin-implementation.html#GIN-FAST-UPDATE">index merging</a>.</p>

<h3 id="why-multiple-arrays-cant-be-indexed">Why multiple arrays can’t be indexed</h3>

<p>An interesting point is that:</p>

<blockquote>
  <p>Only one multi-valued key part is allowed per index, to avoid exponential explosion. E.g if there would be two multi-valued key parts, and server would provide 10 values for each, SE would have to store 100 index records.</p>
</blockquote>

<p>Why is that?</p>

<p>Because there are no convenient data structures for optimizing such case.</p>

<p>With the current data structure, the tuple <code class="language-plaintext highlighter-rouge">[1, 2], [4, 5]</code> would generate the index keys:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">(1, 4)</code>,</li>
  <li><code class="language-plaintext highlighter-rouge">(1, 5)</code>,</li>
  <li><code class="language-plaintext highlighter-rouge">(2, 4)</code>,</li>
  <li><code class="language-plaintext highlighter-rouge">(2, 5)</code>.</li>
</ul>

<p>Suppose that we tackled the problem by reducing the keys to a composition of each value of the first array with the second array:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">(1, 4, 5)</code>,</li>
  <li><code class="language-plaintext highlighter-rouge">(2, 4, 5)</code>.</li>
</ul>

<p>we couldn’t efficiently search in both arrays, since the index is only on the first element; for example, searching on:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">1, 4</code></li>
</ul>

<p>could only lookup for <code class="language-plaintext highlighter-rouge">1</code> entries, not for <code class="language-plaintext highlighter-rouge">4</code> ones.</p>

<p>Sounds familiar? This is essentially the leftmost string prefix search problem.</p>

<p>The arrays of each tuple can still be independently indexed; probably, such configuration could lead to the <a href="https://dev.mysql.com/doc/refman/8.0/en/index-merge-optimization.html#index-merge-intersection">index merge intersection optimization</a>.</p>

<h2 id="how-do-i-declare-an-array-unsigned-column">How do I declare an ARRAY UNSIGNED column?</h2>

<p>We’ve played with arrays storage and indexing; how about creating a column of UNSIGNED ARRAY data type?:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_json_arrays</span><span class="p">(</span>
  <span class="n">id</span>      <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">c_array</span> <span class="nb">UNSIGNED</span> <span class="n">ARRAY</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>

<span class="c1">-- ERROR 1064 (42000): You have an error in your SQL syntax [...] near 'UNSIGNED ARRAY NOT NULL</span>
</code></pre></div></div>

<p>Ouch! There is no currently such data type. Internally, everything is done via json; the worklog explains this:</p>

<blockquote>
  <p>[…] server creates virtual generated column using the typed array field (instead of a regular field) for a function for which is_returns_array() method returns true. This WL adds one such function - CAST(… AS … ARRAY).<br />
The typed array field (Field_typed_array class) essentially is a JSON field, a descendant of Field_json, but it reports itself as a regular field which type is typed array element’s type. […]</p>
</blockquote>

<p>Adding a new data type would require a considerable amount of work; the team’s resources are evidently focused on other functionalities, so they released a good-enough functionality, which in my opinion, is a balanced choice.</p>

<h2 id="conclusion">Conclusion</h2>

<p>We’re very excited by the introduction of this data type, and we’re in the process of migrating the fulltext indexes used for pseudo-arrays, to JSON-based array columns/indexes; I think this is a very significant step in making MySQL a well-rounded RDBMS, and covers an important use case in applications of a certain size.</p>

<h2 id="footnotes">Footnotes</h2>

<p><a name="footnote01">¹</a>: Insertion cost in B-trees is not constant, however, the maintenance cost (rebalancing) is negligible in this context.</p>]]></content><author><name></name></author><category term="mysql" /><category term="databases" /><category term="data_types" /><category term="indexes" /><category term="innodb" /><category term="mysql" /><summary type="html"><![CDATA[Another “missing and missed” functionality in MySQL is a data type for arrays. While MySQL is not there yet, it’s now possible to cover a significant use case: storing denormalized columns (or arrays in general), and accessing them via index. In this article I’ll give some context about denormalized data and indexes, including the workaround for such functionality on MySQL 5.7, and describe how this is (rather) cleanly accomplished on MySQL 8.0. Terminology Storing and indexing arrays in MySQL 5.7: an approach, and problems The MySQL 8.0 implementation: data type and index Performance expectations Why multiple arrays can’t be indexed How do I declare an ARRAY UNSIGNED column? Conclusion Footnotes]]></summary></entry><entry><title type="html">An introduction to Functional indexes in MySQL 8.0, and their gotchas</title><link href="https://saveriomiroddi.github.io/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas/" rel="alternate" type="text/html" title="An introduction to Functional indexes in MySQL 8.0, and their gotchas" /><published>2020-03-10T00:00:00+00:00</published><updated>2020-03-25T15:31:00+00:00</updated><id>https://saveriomiroddi.github.io/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas</id><content type="html" xml:base="https://saveriomiroddi.github.io/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas/"><![CDATA[<p>Another interesting feature released with MySQL 8.0 is full support for functional indexes.</p>

<p>Although this is not a strictly new concept in the MySQL world (indexed generated columns provided the same functionality), I find it worth reviewing, through some applications, notes and considerations.</p>

<p>All in all, I’m not 100% bought into functional indexes (as opposed to indexed generated columns); I’ll elaborate on this over the course of the article.</p>

<p>As a natural fit, generated columns are included in the article; additionally, some constructs build on <a href="/Generating-sequences-ranges-via-mysql-8.0-ctes/">my previous article</a>, in relation to the subject of CTEs.</p>

<p><em>Updated on 12/Mar/2020: Found another bug.</em></p>

<p>Contents:</p>

<ul>
  <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#terminology">Terminology</a></li>
  <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#generated-columns-and-their-application-on-json-data">Generated columns, and their application on JSON data</a></li>
  <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#functional-indexes">Functional indexes</a></li>
  <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#json-functional-index-gotchas">JSON functional index gotchas</a>
    <ul>
      <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#expression-exactness">Expression exactness</a></li>
      <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#inconsistent-behavior-between-generated-columns-with-index-and-functional-indexes">Inconsistent behavior between generated columns with index, and functional indexes</a></li>
      <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#encoding-inconsistency-based-on-the-index-usage">Encoding inconsistency based on the index usage</a></li>
    </ul>
  </li>
  <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#an-example-of-functional-index-with-dates">An example of functional index with dates</a>
    <ul>
      <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#gotcha-joins-dont-use-functional-key-parts">Gotcha: JOINs don’t use functional key parts</a></li>
    </ul>
  </li>
  <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#bugs">Bugs</a>
    <ul>
      <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#bug-on-create-table--select">Bug on <code class="language-plaintext highlighter-rouge">CREATE TABLE ... SELECT</code></a></li>
      <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#bug-on-load-data-infile">Bug on <code class="language-plaintext highlighter-rouge">LOAD DATA INFILE</code></a></li>
    </ul>
  </li>
  <li><a href="/An-introduction-to-functional-indexes-in-mysql-8.0-and-their-gotchas#conclusion">Conclusion</a></li>
</ul>

<h2 id="terminology">Terminology</h2>

<p>In this article I’ll use the term “Functional index” to the refer to indexes both with (8.0) and without (5.7) underlying generated columns.</p>

<p>Where I need to refer to the 8.0 version, I’ll use the term “Functional key part” (even if it may not be entirely appropriate).</p>

<h2 id="generated-columns-and-their-application-on-json-data">Generated columns, and their application on JSON data</h2>

<p>Before explaining the functional indexes, I’ll give a brief introduction to generated columns, since the latter are built on top of the former.</p>

<p>A generated column is a column whose content is a function of another column.</p>

<p>Virtual generated columns - the default type - take no storage; the alternative type, “stored”, actually store the data. In this article I’ll refer exclusively to the virtual ones.</p>

<p>The syntax is simple: in the most minimal form, the definition is <code class="language-plaintext highlighter-rouge">&lt;column_name&gt; &lt;data_type&gt; AS (&lt;function&gt;)</code>.</p>

<p>This is a sample table:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_generated_column</span>
<span class="p">(</span>
  <span class="n">id</span>               <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="k">parameters</span>       <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">parameter_serial</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t_generated_column</span> <span class="p">(</span><span class="k">parameters</span><span class="p">)</span>
<span class="k">VALUES</span>
  <span class="p">(</span><span class="s1">'{"serial": "foo0", "reserved": true}'</span><span class="p">),</span>
  <span class="p">(</span><span class="s1">'{"serial": "bar1", "reserved": false}'</span><span class="p">),</span>
  <span class="p">(</span><span class="s1">'{"serial": "baz2", "reserved": false}'</span><span class="p">);</span>
</code></pre></div></div>

<p>There are a few interesting concepts here.</p>

<p>First, the fact that a JSON column is used to store documents; we’re using MySQL as rudimentary document storage.<br />
This is an interesting use case for generated columns (and likely, the original driver). On a complex enough application, at some point documents may be stored; if their usage is not sophisticated enough to require an external storage engine, MySQL can act as good enough tool for the job, in order to keep the system architecture as simple as possible.</p>

<p>The way the generated columns are defined, and work, is simple. In this case, the operator <a href="https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#operator_json-inline-path"><code class="language-plaintext highlighter-rouge">-&gt;&gt;</code> (JSON inline path)</a> is used, which is a shorthand for <code class="language-plaintext highlighter-rouge">JSON_UNQUOTE(JSON_EXTRACT())</code>. By default, <code class="language-plaintext highlighter-rouge">JSON_EXTRACT</code> includes quotes in the result (for strings), which we don’t require (in this context).</p>

<p>Finally, we can’t specify a <code class="language-plaintext highlighter-rouge">NOT NULL</code> constraint on the generated column - attempting to do so will return a syntax error.</p>

<p>Let’s have at look at how the data looks on <code class="language-plaintext highlighter-rouge">SELECT</code>ion:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">t_generated_column</span><span class="p">;</span>

<span class="c1">-- +----+---------------------------------------+------------------+</span>
<span class="c1">-- | id | parameters                            | parameter_serial |</span>
<span class="c1">-- +----+---------------------------------------+------------------+</span>
<span class="c1">-- |  1 | {"serial": "foo0", "reserved": true}  | foo0             |</span>
<span class="c1">-- |  2 | {"serial": "bar1", "reserved": false} | bar1             |</span>
<span class="c1">-- |  3 | {"serial": "baz2", "reserved": false} | baz2             |</span>
<span class="c1">-- +----+---------------------------------------+------------------+</span>
</code></pre></div></div>

<p>Nice!</p>

<h2 id="functional-indexes">Functional indexes</h2>

<p>Storing the data with the intention of unindexed access has definitely use cases, however, in applications where a significant part of the access to this data is performed at the DB layer, indexing will be crucial.</p>

<p>Generated columns can be indexed as any other column - in MySQL 5.7, this was the only way to build a functional index.</p>

<p>This is the previous table, with the index added and sample data:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_indexed_generated_column</span>
<span class="p">(</span>
  <span class="n">id</span>               <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="k">parameters</span>       <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">parameter_serial</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span><span class="p">),</span>
  <span class="k">KEY</span> <span class="p">(</span><span class="n">parameter_serial</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">counter</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">counter</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mi">100000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="cm">/*+ SET_VAR(cte_max_recursion_depth = 1M) */</span>
  <span class="n">CONCAT</span><span class="p">(</span><span class="s1">'{"serial": "'</span><span class="p">,</span> <span class="n">HEX</span><span class="p">(</span><span class="n">RANDOM_BYTES</span><span class="p">(</span><span class="mi">2</span><span class="p">)),</span> <span class="s1">'"}'</span><span class="p">)</span> <span class="nv">`parameters`</span>
<span class="k">FROM</span> <span class="n">counter</span><span class="p">;</span>

<span class="k">ANALYZE</span> <span class="k">TABLE</span> <span class="n">t_indexed_generated_column</span><span class="p">;</span>
</code></pre></div></div>

<p>Now we have a mean to address the JSON document via index (of course, limited to the specific field):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">t_indexed_generated_column</span> <span class="k">WHERE</span> <span class="n">parameter_serial</span> <span class="o">=</span> <span class="s1">'CAFE'</span><span class="p">;</span>

<span class="c1">-- -&gt; Aggregate: count(0)</span>
<span class="c1">--     -&gt; Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE')  (cost=1.10 rows=1)</span>
</code></pre></div></div>

<p>The functionality above applies also to MySQL versions prior to 8.0, however, the latest version lifted a restriction: the backing generated column is not required anymore. A specific name is also given: “Functional key parts”, because indexes can now be composed of both functions and column references.</p>

<p>Behind the scenes, there’s nothing really new; appropriately, the engineers recycled the existing functionality, so that a functional indexes are backed by a hidden generated column.</p>

<p>Let’s create the table without the generated column, and fill it with random strings:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_functional_index</span>
<span class="p">(</span>
  <span class="n">id</span>         <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="k">parameters</span> <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="k">KEY</span> <span class="p">(</span> <span class="p">(</span><span class="k">CAST</span><span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="k">AS</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">)))</span> <span class="p">)</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">t_functional_index</span> <span class="p">(</span><span class="k">parameters</span><span class="p">)</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">counter</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">counter</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mi">100000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="cm">/*+ SET_VAR(cte_max_recursion_depth = 1M) */</span>
  <span class="n">CONCAT</span><span class="p">(</span><span class="s1">'{"serial": "'</span><span class="p">,</span> <span class="n">HEX</span><span class="p">(</span><span class="n">RANDOM_BYTES</span><span class="p">(</span><span class="mi">2</span><span class="p">)),</span> <span class="s1">'"}'</span><span class="p">)</span> <span class="nv">`parameters`</span>
<span class="k">FROM</span> <span class="n">counter</span><span class="p">;</span>

<span class="k">ANALYZE</span> <span class="k">TABLE</span> <span class="n">t_functional_index</span><span class="p">;</span>
</code></pre></div></div>

<p>The syntax is conceptually the same as generated columns - the function is wrapped by round brackets (the surrounding spaces are cosmetic).</p>

<p>Note that in this case, we must <code class="language-plaintext highlighter-rouge">CAST</code> the extracted value to <code class="language-plaintext highlighter-rouge">CHAR</code>, because we <code class="language-plaintext highlighter-rouge">Cannot create a functional index on an expression that returns a BLOB or TEXT</code>: the implicit function <code class="language-plaintext highlighter-rouge">JSON_UNQUOTE</code> return type is <code class="language-plaintext highlighter-rouge">LONGTEXT</code>.<br />
We’re also hitting a limitation of functional indexes - while with normal indexes we could specify an index prefix (thus, converting the <code class="language-plaintext highlighter-rouge">LONGTEXT</code> into a <code class="language-plaintext highlighter-rouge">(VAR)CHAR</code>), this is not possible with functional indexes.</p>

<p>Now let’s test the index:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">t_functional_index</span> <span class="k">WHERE</span> <span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="o">=</span> <span class="s1">'CAFE'</span><span class="p">;</span>

<span class="c1">-- -&gt; Aggregate: count(0)</span>
<span class="c1">--     -&gt; Filter: (json_unquote(json_extract(t_functional_index.parameters,'$.serial')) = 'CAFE')  (cost=10384.20 rows=100312)</span>
<span class="c1">--         -&gt; Table scan on t_functional_index  (cost=10384.20 rows=100312)</span>
</code></pre></div></div>

<p>Nuts! A table scan. What happened?</p>

<h2 id="json-functional-index-gotchas">JSON functional index gotchas</h2>

<p>I’ll summarize here a few gotchas with JSON functional indexes. While the expression exactness is obvious, the other two aren’t [so much 😉].</p>

<h3 id="expression-exactness">Expression exactness</h3>

<p>When using functional indexes, the match condition must be exact, in order for the index to be used. This is because MySQL needs to evaluates expressions in a general form, and, although some expressions can certainly be transformed (and some actually are, by the optimizer), a sensible design choice is to shift the burden to the developer, in some cases, including this one.</p>

<p>Let’s use a condition with the same function as the index definition:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">t_functional_index</span> <span class="k">WHERE</span> <span class="k">CAST</span><span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="k">AS</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">))</span> <span class="o">=</span> <span class="s1">'CAFE'</span><span class="p">;</span>

<span class="c1">-- -&gt; Aggregate: count(0)</span>
<span class="c1">--    -&gt; Index lookup on t_functional_index using functional_index (cast(json_unquote(json_extract(t_functional_index.parameters,_utf8mb4'$.serial')) as char(4) charset utf8mb4)='CAFE')  (cost=1.10 rows=1)</span>
</code></pre></div></div>

<p>Even a minor change will make the optimizer discard the index:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">t_functional_index</span> <span class="k">WHERE</span> <span class="k">CAST</span><span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="k">AS</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">5</span><span class="p">))</span> <span class="o">=</span> <span class="s1">'CAFE'</span><span class="p">;</span>

<span class="c1">-- -&gt; Aggregate: count(0)</span>
<span class="c1">--     -&gt; Filter: (cast(json_unquote(json_extract(t_functional_index.parameters,'$.serial')) as char(5) charset utf8mb4) = 'CAFE')  (cost=10384.20 rows=100312)</span>
<span class="c1">--         -&gt; Table scan on t_functional_index  (cost=10384.20 rows=100312)</span>
</code></pre></div></div>

<h3 id="inconsistent-behavior-between-generated-columns-with-index-and-functional-indexes">Inconsistent behavior between generated columns with index, and functional indexes</h3>

<p>Interestingly, if we use the form generated column with index, in place of the functional index, the index <em>will</em> be used:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">t_indexed_generated_column</span> <span class="k">WHERE</span> <span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="o">=</span> <span class="s1">'CAFE'</span><span class="p">;</span>

<span class="c1">-- -&gt; Aggregate: count(0)</span>
<span class="c1">--     -&gt; Index lookup on t_indexed_generated_column using parameter_serial (parameter_serial='CAFE')  (cost=1.10 rows=1)</span>
</code></pre></div></div>
<p>there is an inconsistency between a functional index and its generated column and index equivalent.</p>

<p>Let’s review the table definitions:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_indexed_generated_column</span>
<span class="p">(</span>
  <span class="n">id</span>                 <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="k">parameters</span>         <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">parameter_serial</span>   <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span><span class="p">),</span>
  <span class="k">KEY</span> <span class="p">(</span><span class="n">parameter_serial</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_functional_index</span>
<span class="p">(</span>
  <span class="n">id</span>         <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="k">parameters</span> <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="k">KEY</span> <span class="p">(</span> <span class="p">(</span><span class="k">CAST</span><span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="k">AS</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">)))</span> <span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>There is no obvious reason for the optimizer not to use the functional index; it would definitely benefit from this improvement, in order for functional indexes to be a solid choice.</p>

<h3 id="encoding-inconsistency-based-on-the-index-usage">Encoding inconsistency based on the index usage</h3>

<p>The combination of the <code class="language-plaintext highlighter-rouge">CAST</code> and <code class="language-plaintext highlighter-rouge">JSON_UNQUOTE</code> required in the context of functional indexes/generated columns has also another unintended effect: different results, based on the collation chosen by the query structure.</p>

<p>Let’s create a table with a generated column and an index:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">t_encoding_test</span>
<span class="p">(</span>
  <span class="n">id</span>                <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="k">parameters</span>        <span class="n">JSON</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">parameters_serial</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="k">AS</span> <span class="p">(</span><span class="k">CAST</span><span class="p">(</span><span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="k">AS</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">4</span><span class="p">))),</span>
  <span class="k">KEY</span> <span class="p">(</span><span class="n">parameters_serial</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="s1">'{"serial": "CAFE"}'</span> <span class="nv">`parameters`</span><span class="p">;</span>
</code></pre></div></div>

<p>If a query uses the index indirectly (here we query on <code class="language-plaintext highlighter-rouge">parameters</code>, but the optimizer automatically uses the index on <code class="language-plaintext highlighter-rouge">parameters_serial</code>), we get a case insensitive search:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">t_encoding_test</span> <span class="k">WHERE</span> <span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="o">=</span> <span class="s1">'CAFe'</span><span class="p">;</span>

<span class="c1">-- +----------+</span>
<span class="c1">-- | COUNT(*) |</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- |        1 |</span>
<span class="c1">-- +----------+</span>
</code></pre></div></div>

<p>this happens because the <code class="language-plaintext highlighter-rouge">CAST</code> function used to build the index, is associated to the system collation, which is case insensitive (by default, <code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code>).</p>

<p>However, if the index is not used:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">t_encoding_test</span> <span class="n">USE</span> <span class="k">INDEX</span> <span class="p">()</span> <span class="k">WHERE</span> <span class="k">parameters</span> <span class="o">-&gt;&gt;</span> <span class="s1">'$.serial'</span> <span class="o">=</span> <span class="s1">'CAFe'</span><span class="p">;</span>

<span class="c1">-- +----------+</span>
<span class="c1">-- | COUNT(*) |</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- |        0 |</span>
<span class="c1">-- +----------+</span>
</code></pre></div></div>

<p>the record is not matched! This is because the <code class="language-plaintext highlighter-rouge">-&gt;&gt;</code> operator uses <code class="language-plaintext highlighter-rouge">JSON_UNQUOTE</code>, whose hardcoded collation is <code class="language-plaintext highlighter-rouge">utf8mb4_bin</code>, which is case insensitive.</p>

<p>For more details, see the MySQL <a href="https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-functional-key-parts">manpage</a> or even the <a href="https://dev.mysql.com/worklog/task/?id=1075#Usage_of_CAST_in_functional_index">worklog</a>.</p>

<h2 id="an-example-of-functional-index-with-dates">An example of functional index with dates</h2>

<p>Let’s take another example, and test the index:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">date_functional_index</span>
<span class="p">(</span>
  <span class="n">id</span>         <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">created_at</span> <span class="nb">DATETIME</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="k">INDEX</span> <span class="p">(</span> <span class="p">(</span><span class="nb">DATE</span><span class="p">(</span><span class="n">created_at</span><span class="p">))</span> <span class="p">)</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">date_functional_index</span> <span class="p">(</span><span class="n">created_at</span><span class="p">)</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mi">100000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="cm">/*+ SET_VAR(cte_max_recursion_depth = 100K) */</span>
  <span class="n">NOW</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="p">(</span><span class="mi">90</span> <span class="o">*</span> <span class="n">RAND</span><span class="p">())</span> <span class="k">DAY</span> <span class="nv">`created_at`</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>

<span class="k">ANALYZE</span> <span class="k">TABLE</span> <span class="n">date_functional_index</span><span class="p">;</span>
</code></pre></div></div>

<p>(There are two issues in relation to this test; the details are given below)</p>

<p>Let’s test the index access:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">date_functional_index</span> <span class="k">WHERE</span> <span class="nb">DATE</span><span class="p">(</span><span class="n">created_at</span><span class="p">)</span> <span class="o">=</span> <span class="n">CURDATE</span><span class="p">();</span>

<span class="c1">-- -&gt; Aggregate: count(0)</span>
<span class="c1">--     -&gt; Index lookup on date_functional_index using functional_index (cast(date_functional_index.created_at as date)=curdate())  (cost=668.80 rows=608)</span>
</code></pre></div></div>

<p>Works as expected; with this data type, we don’t need to deal with BLOBs and/or collations.</p>

<h3 id="gotcha-joins-dont-use-functional-key-parts">Gotcha: JOINs don’t use functional key parts</h3>

<p>How about joins?</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">dates_range</span> <span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">CURDATE</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="mi">90</span> <span class="k">DAY</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">DAY</span> <span class="k">FROM</span> <span class="n">dates_range</span> <span class="k">WHERE</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">day</span> <span class="o">&lt;=</span> <span class="n">CURDATE</span><span class="p">()</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">d</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">FROM</span>
  <span class="n">dates_range</span>
  <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">date_functional_index</span> <span class="k">ON</span> <span class="n">d</span> <span class="o">=</span> <span class="nb">DATE</span><span class="p">(</span><span class="n">created_at</span><span class="p">)</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">d</span><span class="p">;</span>

<span class="c1">-- -&gt; Table scan on &lt;temporary&gt;</span>
<span class="c1">--     -&gt; Aggregate using temporary table</span>
<span class="c1">--         -&gt; Nested loop left join</span>
<span class="c1">--             -&gt; Table scan on dates_range</span>
<span class="c1">--                 -&gt; [...]</span>
<span class="c1">--             -&gt; Filter: (dates_range.d = cast(date_functional_index.created_at as date))  (cost=3429.97 rows=100649)</span>
<span class="c1">--                 -&gt; Table scan on date_functional_index  (cost=3429.97 rows=100649)</span>
</code></pre></div></div>

<p>Ouch! The index is not used; this is definitely something that needs to be considered.</p>

<p>Indexes on generated columns exhibit the same behavior, however, we can perform the join against the generated column, whose index is then used by the optimizer:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">date_generated_column_functional_index</span>
<span class="p">(</span>
  <span class="n">id</span>              <span class="nb">INT</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">created_at</span>      <span class="nb">DATETIME</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">created_at_date</span> <span class="nb">DATE</span> <span class="k">AS</span> <span class="p">(</span><span class="nb">DATE</span><span class="p">(</span><span class="n">created_at</span><span class="p">)),</span>
  <span class="k">INDEX</span> <span class="p">(</span><span class="n">created_at_date</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mi">100000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="cm">/*+ SET_VAR(cte_max_recursion_depth = 100K) */</span>
  <span class="n">NOW</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="p">(</span><span class="mi">90</span> <span class="o">*</span> <span class="n">RAND</span><span class="p">())</span> <span class="k">DAY</span> <span class="nv">`created_at`</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>

<span class="k">ANALYZE</span> <span class="k">TABLE</span> <span class="n">date_generated_column_functional_index</span><span class="p">;</span>

<span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">dates_range</span> <span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">CURDATE</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="mi">90</span> <span class="k">DAY</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">DAY</span> <span class="k">FROM</span> <span class="n">dates_range</span> <span class="k">WHERE</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">day</span> <span class="o">&lt;=</span> <span class="n">CURDATE</span><span class="p">()</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">d</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">FROM</span>
  <span class="n">dates_range</span>
  <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">date_generated_column_functional_index</span> <span class="k">ON</span> <span class="n">d</span> <span class="o">=</span> <span class="n">created_at_date</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">d</span><span class="p">;</span>

<span class="c1">-- -&gt; Table scan on &lt;temporary&gt;</span>
<span class="c1">--     -&gt; Aggregate using temporary table</span>
<span class="c1">--         -&gt; Nested loop left join</span>
<span class="c1">--             -&gt; Table scan on dates_range</span>
<span class="c1">--                 -&gt; [...]</span>
<span class="c1">--             -&gt; Index lookup on date_generated_column_functional_index using created_at_date (created_at_date=dates_range.d)  (cost=36.18 rows=1026)</span>
</code></pre></div></div>

<p>Therefore, it’s not possible to use functional key parts with JOINs at all, while it’s possible with indexed generated columns. This makes functional key parts less appealing, when considering the overall design.</p>

<p>I’ve filed this as <a href="https://bugs.mysql.com/bug.php?id=98937">feature request</a>.</p>

<h2 id="bugs">Bugs</h2>

<h3 id="bug-on-create-table--select">Bug on <code class="language-plaintext highlighter-rouge">CREATE TABLE ... SELECT</code></h3>

<p>In some of the previous queries I’ve used <code class="language-plaintext highlighter-rouge">CREATE TABLE</code> + <code class="language-plaintext highlighter-rouge">INSERT</code> instead of <code class="language-plaintext highlighter-rouge">CREATE TABLE ... SELECT</code>. Why?</p>

<p>Because of a bug:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">bug_functional_index</span> <span class="p">(</span>
  <span class="n">sold_on</span> <span class="nb">DATETIME</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="k">INDEX</span> <span class="n">sold_on_date</span> <span class="p">((</span><span class="nb">DATE</span><span class="p">(</span><span class="n">sold_on</span><span class="p">)))</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">NOW</span><span class="p">()</span> <span class="nv">`sold_on`</span><span class="p">;</span>

<span class="c1">-- ERROR 3105 (HY000): The value specified for generated column '3351ae78dcbae4f473d53aebdc350681' in table 'bug_functional_index' is not allowed.</span>
</code></pre></div></div>

<p>the above should work, considering split form works ok:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">bug_functional_index</span> <span class="p">(</span>
  <span class="n">sold_on</span> <span class="nb">DATETIME</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="k">INDEX</span> <span class="n">sold_on_date</span> <span class="p">((</span><span class="nb">DATE</span><span class="p">(</span><span class="n">sold_on</span><span class="p">)))</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">bug_functional_index</span> <span class="k">VALUES</span> <span class="p">(</span><span class="n">NOW</span><span class="p">());</span>

<span class="c1">-- Query OK, 1 row affected (0,00 sec)</span>
</code></pre></div></div>

<p>I’ve <a href="https://bugs.mysql.com/bug.php?id=98896">reported this</a> to the MySQL bug tracker.</p>

<h3 id="bug-on-load-data-infile">Bug on <code class="language-plaintext highlighter-rouge">LOAD DATA INFILE</code></h3>

<p>There is also an additional bug: <code class="language-plaintext highlighter-rouge">LOAD DATA INFILE</code> statements will fail, if the columns are not explicitly specified:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s1">'[]'</span> <span class="o">&gt;</span> /tmp/test_data.csv

mysql <span class="o">&lt;&lt;</span><span class="sh">'</span><span class="no">SQL</span><span class="sh">'
  CREATE SCHEMA IF NOT EXISTS tmp;

  CREATE TEMPORARY TABLE tmp.issue_load_data_on_functional_index
  (
    json_col JSON,
    KEY json_col ( (CAST(json_col -&gt; '</span><span class="nv">$'</span><span class="sh"> AS UNSIGNED ARRAY)) )
  );

  LOAD DATA INFILE '/tmp/test_data.csv' INTO TABLE tmp.issue_load_data_on_functional_index;
</span><span class="no">SQL

</span><span class="c"># ERROR 1261 (01000) at line 9: Row 1 doesn't contain data for all columns</span>
</code></pre></div></div>

<p>The workaround is to explicitly specify the columns:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">LOAD</span> <span class="k">DATA</span> <span class="n">INFILE</span> <span class="s1">'/tmp/test_data.csv'</span> <span class="k">INTO</span> <span class="k">TABLE</span> <span class="n">tmp</span><span class="p">.</span><span class="n">issue_load_data_on_functional_index</span> <span class="p">(</span><span class="n">json_col</span><span class="p">);</span>
</code></pre></div></div>

<p>I’ve <a href="https://bugs.mysql.com/bug.php?id=98925">reported this bug</a> as well.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I’m not bought into functional key parts.</p>

<p>While I find functional indexes an important functionality of solid, modern, RDBMSs, I think that the functional key parts feature itself needs some time to mature, especially considering that indexed generated columns can do the same work (with some exceptions, e.g. multi-valued indexing).</p>

<p>Now moving on to another new 8.0 interesting feature (window functions!) 😄</p>]]></content><author><name></name></author><category term="mysql" /><category term="databases" /><category term="indexes" /><category term="mysql" /><summary type="html"><![CDATA[Another interesting feature released with MySQL 8.0 is full support for functional indexes. Although this is not a strictly new concept in the MySQL world (indexed generated columns provided the same functionality), I find it worth reviewing, through some applications, notes and considerations. All in all, I’m not 100% bought into functional indexes (as opposed to indexed generated columns); I’ll elaborate on this over the course of the article. As a natural fit, generated columns are included in the article; additionally, some constructs build on my previous article, in relation to the subject of CTEs. Updated on 12/Mar/2020: Found another bug. Contents: Terminology Generated columns, and their application on JSON data Functional indexes JSON functional index gotchas Expression exactness Inconsistent behavior between generated columns with index, and functional indexes Encoding inconsistency based on the index usage An example of functional index with dates Gotcha: JOINs don’t use functional key parts Bugs Bug on CREATE TABLE ... SELECT Bug on LOAD DATA INFILE Conclusion]]></summary></entry><entry><title type="html">Generating sequences/ranges, via MySQL 8.0’s Common Table Expressions (CTEs)</title><link href="https://saveriomiroddi.github.io/Generating-sequences-ranges-via-mysql-8.0-ctes/" rel="alternate" type="text/html" title="Generating sequences/ranges, via MySQL 8.0’s Common Table Expressions (CTEs)" /><published>2020-03-09T00:00:00+00:00</published><updated>2020-03-09T00:00:00+00:00</updated><id>https://saveriomiroddi.github.io/Generating-sequences-ranges-via-mysql-8.0-ctes</id><content type="html" xml:base="https://saveriomiroddi.github.io/Generating-sequences-ranges-via-mysql-8.0-ctes/"><![CDATA[<p>A long-time missing (and missed) functionality in MySQL, is sequences/ranges.</p>

<p>As of MySQL 8.0, this functionality is still not supported in a general sense, however, it’s now possible to generate a sequence to be used within a single query.</p>

<p>In this article, I’ll give a brief introduction to CTEs, and explain how to build different sequence generators; additionally, I’ll introduce the new (cool) MySQL 8.0 query hint <code class="language-plaintext highlighter-rouge">SET_VAR</code>, and a pinch of virtual columns and functional indexes (“functional key parts”, another MySQL 8.0 feature).</p>

<p>Contents:</p>

<ul>
  <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#a-brief-introduction-to-common-table-expressions-ctes">A brief introduction to Common Table Expressions (CTEs)</a></li>
  <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#recursive-ctes-and-generating-a-linear-sequence-of-integers">Recursive CTEs, and generating a linear sequence of integers</a>
    <ul>
      <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#per-statement-variables-setting">Per-statement variables setting</a></li>
    </ul>
  </li>
  <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#generating-a-sequence-of-random-integers">Generating a sequence of random integers</a></li>
  <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#generating-a-characters-interval">Generating a characters interval</a></li>
  <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#generating-a-dates-interval">Generating a dates interval</a></li>
  <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#conclusion">Conclusion</a></li>
  <li><a href="/Generating-sequences-ranges-via-mysql-8.0-ctes#footnotes">Footnotes</a></li>
</ul>

<h2 id="a-brief-introduction-to-common-table-expressions-ctes">A brief introduction to Common Table Expressions (CTEs)</h2>

<p>Roughly, Common Table Expressions (<code class="language-plaintext highlighter-rouge">CTE</code>s) can be thought as ephemeral views or temporary tables.</p>

<p>CTEs bring very significant advantages, one of the most important being recursion, which, barring hacks, wasn’t supported before.</p>

<p>The simplest syntax is:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="o">&lt;</span><span class="n">cte_name</span><span class="o">&gt;</span> <span class="p">(</span><span class="o">&lt;</span><span class="n">colums</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="o">&lt;</span><span class="n">cte_query</span><span class="o">&gt;</span>
<span class="p">)</span>
<span class="o">&lt;</span><span class="n">main_query</span><span class="o">&gt;</span>
</code></pre></div></div>

<p>for example<a href="#footnote01">¹</a>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">line_items</span><span class="p">(</span>
  <span class="n">item_number</span> <span class="nb">INT</span> <span class="nb">UNSIGNED</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">item_total</span>  <span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">order_number</span> <span class="nb">INT</span> <span class="nb">UNSIGNED</span> <span class="k">NOT</span> <span class="k">NULL</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">line_items</span> <span class="k">VALUES</span>
  <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
  <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
  <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="p">;</span>

<span class="k">WITH</span> <span class="n">order_totals</span><span class="p">(</span><span class="n">order_number</span><span class="p">,</span> <span class="n">order_total</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">order_number</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">item_total</span><span class="p">)</span> <span class="nv">`order_total`</span>
  <span class="k">FROM</span> <span class="n">line_items</span>
  <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">order_number</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">item_number</span><span class="p">,</span> <span class="n">item_total</span><span class="p">,</span> <span class="n">order_number</span><span class="p">,</span> <span class="n">order_total</span>
<span class="k">FROM</span> <span class="n">line_items</span>
     <span class="k">JOIN</span> <span class="n">order_totals</span> <span class="k">USING</span> <span class="p">(</span><span class="n">order_number</span><span class="p">)</span>
<span class="p">;</span>

<span class="c1">-- +-------------+------------+--------------+-------------+</span>
<span class="c1">-- | item_number | item_total | order_number | order_total |</span>
<span class="c1">-- +-------------+------------+--------------+-------------+</span>
<span class="c1">-- |           1 |      10.00 |            1 |       20.00 |</span>
<span class="c1">-- |           2 |      10.00 |            1 |       20.00 |</span>
<span class="c1">-- |           3 |      15.00 |            2 |       15.00 |</span>
<span class="c1">-- +-------------+------------+--------------+-------------+</span>
</code></pre></div></div>

<p>The syntax is intuitive; in this example, it’s used very much like a temporary table, with the advantage that no cleanup (<code class="language-plaintext highlighter-rouge">DROP TEMPORARY TABLE</code>) is needed.</p>

<h2 id="recursive-ctes-and-generating-a-linear-sequence-of-integers">Recursive CTEs, and generating a linear sequence of integers</h2>

<p>If one has to create a table filled with integers, say, as an example for a blog post 😉, the common approach is to use extended <code class="language-plaintext highlighter-rouge">INSERT</code>s (the form that stores multiple rows in one statement).</p>

<p>We can accomplish this more elegantly with a CTE, specifically, with a recursive one.</p>

<p>The syntax of recursive CTEs is:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="o">&lt;</span><span class="n">cte_name</span><span class="o">&gt;</span> <span class="p">(</span><span class="o">&lt;</span><span class="n">colums</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="o">&lt;</span><span class="n">base_case_query</span><span class="o">&gt;</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="o">&lt;</span><span class="n">recursive_step_query</span><span class="o">&gt;</span> <span class="c1">-- invoke the CTE here!</span>
<span class="p">)</span>
<span class="o">&lt;</span><span class="n">main_query</span><span class="o">&gt;</span>
</code></pre></div></div>

<p>The concept we apply here is to simulate iteration via recursion (more on this later).</p>

<p>Straight to the generator!:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Create a table with the integers in the range [0, 10].</span>
<span class="c1">--</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">int_sequence</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;=</span> <span class="mi">10</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">n</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>
</code></pre></div></div>

<p>The table creation syntax may be slightly odd - one may expect <code class="language-plaintext highlighter-rouge">CREATE TABLE</code> to be below the <code class="language-plaintext highlighter-rouge">WITH</code> clause - but the working is straightforward.</p>

<p>When the <code class="language-plaintext highlighter-rouge">SELECT</code> invokes the CTE:</p>

<ul>
  <li>the first row returned is the base case (<code class="language-plaintext highlighter-rouge">SELECT 0</code>);</li>
  <li>from the second onward, one row for each recursive step is returned.</li>
</ul>

<p>This is all in all, simple. However, something important to pay attention to, is the termination condition: <code class="language-plaintext highlighter-rouge">WHERE n + 1 &lt;= 0</code>. Why not using <code class="language-plaintext highlighter-rouge">WHERE n &lt;= ...</code>?</p>

<p>Because this is a part where, it’s easy to do a fencepost error. Let’s see the wrong case:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Attempt to select the integers in the range [0, 10], the wrong way.</span>
<span class="c1">--</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">&lt;=</span> <span class="mi">10</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">n</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>
</code></pre></div></div>

<p>What happens here is that one confuses the <em>returned row</em> with the <em>last verified condition</em>. On the two last steps,</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">n = 10</code>;</li>
  <li>the condition is verified;</li>
  <li><code class="language-plaintext highlighter-rouge">SELECT n + 1</code> is executed, returning <code class="language-plaintext highlighter-rouge">11</code>;</li>
  <li><code class="language-plaintext highlighter-rouge">n = 11</code>;</li>
  <li>the condition is <em>not</em> verified;</li>
  <li>recursion terminates.</li>
</ul>

<p>Now, two alternatives are the conditions <code class="language-plaintext highlighter-rouge">WHERE n &lt;= 9</code> or <code class="language-plaintext highlighter-rouge">WHERE n &lt; 10</code>; while they are correct, they may be less intuitive than <code class="language-plaintext highlighter-rouge">WHERE n + 1 &lt;= 10</code>, which mimicks the <code class="language-plaintext highlighter-rouge">SELECT</code>ed expression.</p>

<p>I’ll conclude with two final notes.</p>

<p>First, we’re using recursion as a way of performing iteration; this is subject to the same criticism of teaching recursion via Fibonacci series: it can arguably be considered as an overengineered/underperforming solution to a problem.</p>

<p>I don’t take any position in this case, however, my personal order of increasing elegance for filling a table with a series of numbers is:</p>

<ol>
  <li>using an extended <code class="language-plaintext highlighter-rouge">INSERT</code>,</li>
  <li>using a recursive CTE,</li>
  <li>using a sequence generator.</li>
</ol>

<p>Since MySQL doesn’t provide 3., I’m happy to use 2. 😬.</p>

<p>The second note is more interesting, and I’ll highlight it with a dedicated section.</p>

<h3 id="per-statement-variables-setting">Per-statement variables setting</h3>

<p>MySQL limits by default the number of recursions 1000, via the <a href="https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_cte_max_recursion_depth"><code class="language-plaintext highlighter-rouge">cte_max_recursion_depth</code> sysvar</a>.</p>

<p>Now, if we want to generate a long sequence, we should:</p>

<ol>
  <li>set the variable,</li>
  <li>execute the statement,</li>
  <li>reset the variable.</li>
</ol>

<p>This procedure consists of three statements, which is of course inconvenient. What do we do?</p>

<p>Enters the scene the <a href="https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html#optimizer-hints-set-var">Per-statement variables setting</a>.</p>

<p>This is a lesser known MySQL 8.0 new feature, that comes very handy where needed.</p>

<p>In short, <code class="language-plaintext highlighter-rouge">SET_VAR</code> is a query hint, that allows one or more variables to be set exclusively within the scope of a statement.</p>

<p>In this case, if we want to generate a 1M numbers sequence, we set <code class="language-plaintext highlighter-rouge">cte_max_recursion_depth</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Select the integers in the range [0, 1000000].</span>
<span class="c1">--</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;=</span> <span class="mi">1000000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="cm">/*+ SET_VAR(cte_max_recursion_depth = 1M) */</span>
  <span class="n">n</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>
</code></pre></div></div>

<p>(I’ve actually <a href="https://bugs.mysql.com/bug.php?id=98881">opened a bug</a> suggesting to include this function in the CTE manpage.)</p>

<h2 id="generating-a-sequence-of-random-integers">Generating a sequence of random integers</h2>

<p>If we want to create random numbers, we use <code class="language-plaintext highlighter-rouge">RAND()</code><a href="#footnote02">²</a> and <code class="language-plaintext highlighter-rouge">SELECT</code> only the associated expression:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Create a table with 1000 random integers in the range [0, 65536).</span>
<span class="c1">--</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">random_int_sequence</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mi">1000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">FLOOR</span><span class="p">(</span><span class="mi">65536</span> <span class="o">*</span> <span class="n">RAND</span><span class="p">())</span> <span class="nv">`rand_n`</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>
</code></pre></div></div>

<h2 id="generating-a-characters-interval">Generating a characters interval</h2>

<p>Nothing prohibits us from generating a sequence of characters; in this case, we’ll use the <code class="language-plaintext highlighter-rouge">CHAR()</code> and <code class="language-plaintext highlighter-rouge">ORD()</code> functions to increment the current value:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">random_char_sequence</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="k">c</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="s1">'A'</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="nb">CHAR</span><span class="p">(</span><span class="n">ORD</span><span class="p">(</span><span class="k">c</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">USING</span> <span class="n">ASCII</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="nb">CHAR</span><span class="p">(</span><span class="n">ORD</span><span class="p">(</span><span class="k">c</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">USING</span> <span class="n">ASCII</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="s1">'Z'</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="k">c</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>
</code></pre></div></div>

<h2 id="generating-a-dates-interval">Generating a dates interval</h2>

<p>Finally, we’ll generate a dates interval.</p>

<p>In this section, it’s worth mentioning an interesting usage. Suppose one is reporting monthly sales. Is this query correct?:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Underlying table structure.</span>
<span class="c1">--</span>
<span class="c1">-- CREATE TABLE line_items(</span>
<span class="c1">--   id INT    UNSIGNED PRIMARY KEY,</span>
<span class="c1">--   total     DECIMAL(8,2) NOT NULL,</span>
<span class="c1">--   sold_on   DATETIME NOT NULL </span>
<span class="c1">-- );</span>

<span class="k">SELECT</span> <span class="nb">YEAR</span><span class="p">(</span><span class="n">sold_on</span><span class="p">)</span> <span class="nv">`sale_year`</span><span class="p">,</span> <span class="k">MONTH</span><span class="p">(</span><span class="n">sold_on</span><span class="p">)</span> <span class="nv">`sale_month`</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">total</span><span class="p">)</span> <span class="nv">`month_sales`</span>
<span class="k">FROM</span> <span class="n">line_items</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">sale_year</span><span class="p">,</span> <span class="n">sale_month</span><span class="p">;</span>
</code></pre></div></div>

<p>The answer is: it depends on the requirements.</p>

<p>If the requirement is that <em>all</em> the months must be displayed, one may miss rows for months when there are no sales.</p>

<p>A solution is to use a sequence with all the months in the required interval, and (left) join the CTE with the table.</p>

<p>Let’s prepare some data (via CTE, of course! 😉), for a few months (except the current):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">line_items</span><span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span>       <span class="nb">UNSIGNED</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">total</span>        <span class="nb">DECIMAL</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">sold_on</span>      <span class="nb">DATETIME</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="n">sold_on_date</span> <span class="nb">DATE</span> <span class="k">AS</span> <span class="p">(</span><span class="nb">DATE</span><span class="p">(</span><span class="n">sold_on</span><span class="p">)),</span>
  <span class="k">KEY</span> <span class="p">(</span><span class="n">sold_on_date</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">sequence</span> <span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="mi">0</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">sequence</span> <span class="k">WHERE</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&lt;</span> <span class="mi">100000</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="cm">/*+ SET_VAR(cte_max_recursion_depth = 1M) */</span>
  <span class="k">CAST</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">RAND</span><span class="p">()</span> <span class="k">AS</span> <span class="nb">DECIMAL</span><span class="p">)</span> <span class="nv">`total`</span><span class="p">,</span>
  <span class="n">NOW</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="n">DAYOFMONTH</span><span class="p">(</span><span class="n">CURDATE</span><span class="p">())</span> <span class="k">DAY</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="p">(</span><span class="mi">100</span> <span class="o">*</span> <span class="n">RAND</span><span class="p">())</span> <span class="k">DAY</span> <span class="nv">`sold_on`</span>
<span class="k">FROM</span> <span class="n">sequence</span><span class="p">;</span>
</code></pre></div></div>

<p>There are a couple of interesting concepts here:</p>

<p>The first is that by using <code class="language-plaintext highlighter-rouge">NOW() - INTERVAL DAYOFMONTH(CURDATE()) DAY</code> as base, we ensure that we don’t store any sales for the current month.</p>

<p>The second is that, in order to perform an efficient left join, a functional index is required; there are a few considerations about this subject, which I’ll leave to a separate article.</p>

<p>Additionally, note that float <code class="language-plaintext highlighter-rouge">INTERVAL</code>s are rounded (but it’s irrelevant in this context).</p>

<p>Now we can query!</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">dates_range</span> <span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">CURDATE</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="mi">124</span> <span class="k">DAY</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">DAY</span> <span class="k">FROM</span> <span class="n">dates_range</span> <span class="k">WHERE</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">day</span> <span class="o">&lt;=</span> <span class="n">CURDATE</span><span class="p">()</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="nb">YEAR</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="nv">`sales_year`</span><span class="p">,</span> <span class="k">MONTH</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="nv">`sales_month`</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">total</span><span class="p">)</span> <span class="nv">`month_total_sales`</span>
<span class="k">FROM</span>
  <span class="n">dates_range</span>
  <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">line_items</span> <span class="k">ON</span> <span class="n">d</span> <span class="o">=</span> <span class="n">sold_on_date</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">sales_year</span><span class="p">,</span> <span class="n">sales_month</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">sales_year</span><span class="p">,</span> <span class="n">sales_month</span><span class="p">;</span>

<span class="c1">-- +------------+-------------+-------------------+</span>
<span class="c1">-- | sales_year | sales_month | month_total_sales |</span>
<span class="c1">-- +------------+-------------+-------------------+</span>
<span class="c1">-- |       2019 |          11 |          27895.00 |</span>
<span class="c1">-- |       2019 |          12 |         331700.00 |</span>
<span class="c1">-- |       2020 |           1 |         335775.00 |</span>
<span class="c1">-- |       2020 |           2 |         306289.00 |</span>
<span class="c1">-- |       2020 |           3 |              NULL |</span>
<span class="c1">-- +------------+-------------+-------------------+</span>
</code></pre></div></div>

<p>Excellent. The current month is displaying, as intended, even if it has no sales.</p>

<p>Let’s check the optimizer plan (note that I’ve removed the <code class="language-plaintext highlighter-rouge">ORDER BY</code> clause for simplicity):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span>
<span class="k">WITH</span> <span class="k">RECURSIVE</span> <span class="n">dates_range</span> <span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="k">AS</span>
<span class="p">(</span>
  <span class="k">SELECT</span> <span class="n">CURDATE</span><span class="p">()</span> <span class="o">-</span> <span class="n">INTERVAL</span> <span class="mi">124</span> <span class="k">DAY</span>
  <span class="k">UNION</span> <span class="k">ALL</span>
  <span class="k">SELECT</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">DAY</span> <span class="k">FROM</span> <span class="n">dates_range</span> <span class="k">WHERE</span> <span class="n">d</span> <span class="o">+</span> <span class="n">INTERVAL</span> <span class="mi">1</span> <span class="k">day</span> <span class="o">&lt;=</span> <span class="n">CURDATE</span><span class="p">()</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="nb">YEAR</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="nv">`sales_year`</span><span class="p">,</span> <span class="k">MONTH</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="nv">`sales_month`</span><span class="p">,</span> <span class="k">SUM</span><span class="p">(</span><span class="n">total</span><span class="p">)</span> <span class="nv">`month_total_sales`</span>
<span class="k">FROM</span>
  <span class="n">dates_range</span>
  <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">line_items</span> <span class="k">ON</span> <span class="n">d</span> <span class="o">=</span> <span class="n">sold_on_date</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">sales_year</span><span class="p">,</span> <span class="n">sales_month</span><span class="err">\</span><span class="k">G</span>

<span class="c1">-- *************************** 1. row ***************************</span>
<span class="c1">-- EXPLAIN: -&gt; Table scan on &lt;temporary&gt;</span>
<span class="c1">--     -&gt; Aggregate using temporary table</span>
<span class="c1">--         -&gt; Nested loop left join</span>
<span class="c1">--             -&gt; Table scan on dates_range</span>
<span class="c1">--                 -&gt; Materialize recursive CTE dates_range</span>
<span class="c1">--                     -&gt; Rows fetched before execution</span>
<span class="c1">--                     -&gt; Repeat until convergence</span>
<span class="c1">--                         -&gt; Filter: ((dates_range.d + interval 1 day) &lt;= &lt;cache&gt;(curdate()))  (cost=2.73 rows=2)</span>
<span class="c1">--                             -&gt; Scan new records on dates_range  (cost=2.73 rows=2)</span>
<span class="c1">--             -&gt; Index lookup on line_items using sold_on_date (sold_on_date=dates_range.d)  (cost=0.28 rows=1)</span>
</code></pre></div></div>

<p>The plan has a few interesting points, but they are left to the reader, since they are out of the scope of this article.</p>

<h2 id="conclusion">Conclusion</h2>

<p>MySQL 8.0 brought many, very interesting, features. Although sequences/generator are still not fully supported, we can use the (very flexible) CTEs to cover a part of the use cases.</p>

<p>Happy querying with MySQL 8.0!</p>

<h2 id="footnotes">Footnotes</h2>

<p><a name="footnote01">¹</a>: Please note that real-world schemas are generally designed differently, and this example has been written with simplicity in mind instead.
<a name="footnote02">²</a>: Remember that <code class="language-plaintext highlighter-rouge">RAND()</code> is not a cryptographically secure function.</p>]]></content><author><name></name></author><category term="mysql" /><category term="databases" /><category term="mysql" /><summary type="html"><![CDATA[A long-time missing (and missed) functionality in MySQL, is sequences/ranges. As of MySQL 8.0, this functionality is still not supported in a general sense, however, it’s now possible to generate a sequence to be used within a single query. In this article, I’ll give a brief introduction to CTEs, and explain how to build different sequence generators; additionally, I’ll introduce the new (cool) MySQL 8.0 query hint SET_VAR, and a pinch of virtual columns and functional indexes (“functional key parts”, another MySQL 8.0 feature). Contents: A brief introduction to Common Table Expressions (CTEs) Recursive CTEs, and generating a linear sequence of integers Per-statement variables setting Generating a sequence of random integers Generating a characters interval Generating a dates interval Conclusion Footnotes]]></summary></entry><entry><title type="html">PreFOSDEM talk: Upgrading from MySQL 5.7 to MySQL 8.0</title><link href="https://saveriomiroddi.github.io/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0/" rel="alternate" type="text/html" title="PreFOSDEM talk: Upgrading from MySQL 5.7 to MySQL 8.0" /><published>2020-02-23T00:00:00+00:00</published><updated>2020-02-23T00:00:00+00:00</updated><id>https://saveriomiroddi.github.io/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0</id><content type="html" xml:base="https://saveriomiroddi.github.io/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0/"><![CDATA[<p>In this post I’ll expand on the subject of my MySQL pre-FOSDEM talk: what dbadmins need to know and do, when upgrading from MySQL 5.7 to 8.0.</p>

<p>I’ve already published <a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset/">two</a> <a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations/">posts</a> on two specific issues; in this article, I’ll give the complete picture.</p>

<p>As usual, I’ll use this post to introduce tooling concepts that may be useful in generic system administration.</p>

<p>The presentation code is hosted on a <a href="https://github.com/saveriomiroddi/prefosdem-2020-presentation">GitHub repository</a> (including the <a href="https://github.com/saveriomiroddi/prefosdem-2020-presentation/tree/master/sources">the source files</a> and the output slides <a href="https://github.com/saveriomiroddi/prefosdem-2020-presentation/blob/master/slides/slides.pdf">in PDF format</a>), and on <a href="https://www.slideshare.net/SaverioM/friends-let-real-friends-use-mysql-80">Slideshare</a>.</p>

<p>Contents:</p>

<ul>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#summary-of-issues-and-scope">Summary of issues, and scope</a></li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#requirements">Requirements</a></li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#the-new-default-character-setcollation-utf8mb4utf8mb4_0900_ai_ci">The new default character set/collation: utf8mb4/utf8mb4_0900_ai_ci</a>
    <ul>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#summary">Summary</a></li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#tooling-mysql-rlike">Tooling: MySQL RLIKE</a></li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#how-the-charset-parameters-work">How the charset parameters work</a></li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#string-and-comparison-properties">String, and comparison, properties</a></li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#collation-coercion-and-issues-general--0900_ai">Collation coercion, and issues <code class="language-plaintext highlighter-rouge">general</code> &lt;&gt; <code class="language-plaintext highlighter-rouge">0900_ai</code></a>
        <ul>
          <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#comparisons-utf8_general_ci-column--literals">Comparisons utf8_general_ci column &lt;&gt; literals</a></li>
          <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#comparisons-utf8_general_ci-column--columns">Comparisons utf8_general_ci column &lt;&gt; columns</a></li>
          <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#summary-of-the-migration-path">Summary of the migration path</a></li>
        </ul>
      </li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#the-new-collation-doesnt-pad-anymore">The new collation doesn’t pad anymore</a></li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#triggers">Triggers</a>
        <ul>
          <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#sort-of-related-suggestion">Sort-of-related suggestion</a></li>
        </ul>
      </li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#behavior-with-indexes">Behavior with indexes</a></li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#consequences-of-the-increase-in-potential-size-of-char-columns">Consequences of the increase in (potential) size of char columns</a></li>
    </ul>
  </li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#information-schema-statistics-caching">Information schema statistics caching</a></li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#group-by-not-sorted-anymore-by-default-tooling">GROUP BY not sorted anymore by default (+tooling)</a></li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#schema-migration-tools-support">Schema migration tools support</a></li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#obsolete-mac-homebrew-default-collation">Obsolete Mac Homebrew default collation</a>
    <ul>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#modify-the-formula-and-recompile-the-binaries">Modify the formula, and recompile the binaries</a></li>
      <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#ignore-the-client-encoding-on-handshake">Ignore the client encoding on handshake</a></li>
    </ul>
  </li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#good-practice-for-majorminor-upgrades-comparing-the-system-variables">Good practice for (major/minor) upgrades: comparing the system variables</a></li>
  <li><a href="/Pre-fosdem-talk-upgrading-from-mysql-5.7-to-8.0#conclusion">Conclusion</a></li>
</ul>

<h2 id="summary-of-issues-and-scope">Summary of issues, and scope</h2>

<p>The following are the basic issues to handle when migrating:</p>

<ul>
  <li>the new charset/collation <code class="language-plaintext highlighter-rouge">utf8mb4</code>/<code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code>;</li>
  <li>the trailing whitespace is handled differently;</li>
  <li>GROUP BY is not sorted anymore by default;</li>
  <li>the information schema is now cached (by default);</li>
  <li>incompatibility with schema migration tools.</li>
</ul>

<p>Of course, the larger the scale, the more aspects will need to be considered; for example, large-scale write-bound systems may need to handle:</p>

<ul>
  <li>changes in dirty page cleaning parameters and design;</li>
  <li>(new) data dictionary contention;</li>
  <li>and so on.</li>
</ul>

<p>In this article, I’ll only deal with what can be reasonably considered the lowest common denominator of all the migrations.</p>

<h2 id="requirements">Requirements</h2>

<p>All the SQL examples are executed on MySQL 8.0.</p>

<h2 id="the-new-default-character-setcollation-utf8mb4utf8mb4_0900_ai_ci">The new default character set/collation: <code class="language-plaintext highlighter-rouge">utf8mb4</code>/<code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code></h2>

<h3 id="summary">Summary</h3>

<p>References:</p>

<ul>
  <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset/">An in depth DBA’s guide to migrating a MySQL database from the <code class="language-plaintext highlighter-rouge">utf8</code> to the <code class="language-plaintext highlighter-rouge">utf8mb4</code> charset</a></li>
  <li><a href="https://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details">MySQL 8.0 Collations: The devil is in the details.</a></li>
  <li><a href="http://mysqlserverteam.com/new-collations-in-mysql-8-0-0">New collations in MySQL 8.0.0</a></li>
</ul>

<p>MySQL introduces a new collation - <code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code>. Why?</p>

<p>Basically, it’s an improved version of the <code class="language-plaintext highlighter-rouge">general_ci</code> version - it supports Unicode 9.0, it irons out a few issues, and it’s faster.</p>

<p>The collation <code class="language-plaintext highlighter-rouge">utf8(mb4)_general_ci</code> wasn’t entirely correct; a typical example is <code class="language-plaintext highlighter-rouge">Å</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Å = U+212B</span>
<span class="k">SELECT</span> <span class="nv">"sÅverio"</span> <span class="o">=</span> <span class="nv">"saverio"</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_general_ci</span><span class="p">;</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- | result |</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- |      0 |</span>
<span class="c1">-- +--------+</span>

<span class="k">SELECT</span> <span class="nv">"sÅverio"</span> <span class="o">=</span> <span class="nv">"saverio"</span><span class="p">;</span> <span class="c1">-- Default (COLLATE utf8mb4_0900_ai_ci);</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- | result |</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- |      1 |</span>
<span class="c1">-- +--------+</span>
</code></pre></div></div>

<p>From this, you can also guess what <code class="language-plaintext highlighter-rouge">ai_ci</code> means: <code class="language-plaintext highlighter-rouge">a</code>ccent <code class="language-plaintext highlighter-rouge">i</code>nsensitive/<code class="language-plaintext highlighter-rouge">c</code>ase <code class="language-plaintext highlighter-rouge">i</code>nsensitive.</p>

<p>So, what’s the problem?</p>

<p>Legacy.</p>

<p>Technically, <code class="language-plaintext highlighter-rouge">utf8mb4</code> has been available in MySQL for a long time. At least a part of the industry started the migration long before, and publicly documented the process.</p>

<p>However, by that time, only <code class="language-plaintext highlighter-rouge">utf8mb4_general_ci</code> was available. Therefore, a vast amount of documentation around suggests to move to such collation.</p>

<p>While this is not an issue per se, is it a big issue when considering that the two collations are incompatible.</p>

<h3 id="tooling-mysql-rlike">Tooling: MySQL RLIKE</h3>

<p>For people who like (and frequently use) them, regular expressions are a fundamental tool.</p>

<p>In particular when performing administration tasks (using them in an application for data matching is a different topic), they can streamline some queries, avoiding lengthy concatenations of conditions.</p>

<p>In particular, I find it practical as a sophisticated <code class="language-plaintext highlighter-rouge">SHOW &lt;object&gt;</code> supplement.</p>

<p><code class="language-plaintext highlighter-rouge">SHOW &lt;object&gt;</code>, in MySQL, supports <code class="language-plaintext highlighter-rouge">LIKE</code>, however, it’s fairly limited in functionality, for example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="k">GLOBAL</span> <span class="n">VARIABLES</span> <span class="k">LIKE</span> <span class="s1">'character_set%'</span>
<span class="c1">-- +--------------------------+-------------------------------------------------------------------------+</span>
<span class="c1">-- | Variable_name            | Value                                                                   |</span>
<span class="c1">-- +--------------------------+-------------------------------------------------------------------------+</span>
<span class="c1">-- | character_set_client     | utf8mb4                                                                 |</span>
<span class="c1">-- | character_set_connection | utf8mb4                                                                 |</span>
<span class="c1">-- | character_set_database   | utf8mb4                                                                 |</span>
<span class="c1">-- | character_set_filesystem | binary                                                                  |</span>
<span class="c1">-- | character_set_results    | utf8mb4                                                                 |</span>
<span class="c1">-- | character_set_server     | utf8mb4                                                                 |</span>
<span class="c1">-- | character_set_system     | utf8                                                                    |</span>
<span class="c1">-- | character_sets_dir       | /home/saverio/local/mysql-8.0.19-linux-glibc2.12-x86_64/share/charsets/ |</span>
<span class="c1">-- +--------------------------+-------------------------------------------------------------------------+</span>
</code></pre></div></div>

<p>Let’s turbocharge it!</p>

<p>Let’s get all the meaningful charset-related variables, but not one more, in a single swoop:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="k">GLOBAL</span> <span class="n">VARIABLES</span> <span class="k">WHERE</span> <span class="n">Variable_name</span> <span class="n">RLIKE</span> <span class="s1">'^(character_set|collation)_'</span> <span class="k">AND</span> <span class="n">Variable_name</span> <span class="k">NOT</span> <span class="n">RLIKE</span> <span class="s1">'system|data'</span><span class="p">;</span>
<span class="c1">-- +--------------------------+--------------------+</span>
<span class="c1">-- | Variable_name            | Value              |</span>
<span class="c1">-- +--------------------------+--------------------+</span>
<span class="c1">-- | character_set_client     | utf8mb4            |</span>
<span class="c1">-- | character_set_connection | utf8mb4            |</span>
<span class="c1">-- | character_set_results    | utf8mb4            |</span>
<span class="c1">-- | character_set_server     | utf8mb4            |</span>
<span class="c1">-- | collation_connection     | utf8mb4_general_ci |</span>
<span class="c1">-- | collation_server         | utf8mb4_general_ci |</span>
<span class="c1">-- +--------------------------+--------------------+</span>
</code></pre></div></div>

<p>Nice. The first regex reads: “string starting with (<code class="language-plaintext highlighter-rouge">^</code>) either <code class="language-plaintext highlighter-rouge">character_set</code> or <code class="language-plaintext highlighter-rouge">collation</code>”, and followed by <code class="language-plaintext highlighter-rouge">_</code>. Note that if we don’t group <code class="language-plaintext highlighter-rouge">character_set</code> and <code class="language-plaintext highlighter-rouge">collation</code> (via <code class="language-plaintext highlighter-rouge">(</code>…<code class="language-plaintext highlighter-rouge">)</code>), the <code class="language-plaintext highlighter-rouge">^</code> metacharacter applies only to the first.</p>

<h3 id="how-the-charset-parameters-work">How the charset parameters work</h3>

<p>Character set and collation are a <em>very</em> big deal, because changing them in this case requires to literally (in a literal sense 😉) rebuild the entire database - all the records (and related indexes) including strings will need to be rebuilt.</p>

<p>In order to understand the concepts, let’s have a look at the MySQL server settings again; I’ll reorder and explain them.</p>

<p>Literals sent by the client are assumed to be in the following charset:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">character_set_client</code> (default: <code class="language-plaintext highlighter-rouge">utf8mb4</code>)</li>
</ul>

<p>after, they’re converted and processed by the server, to:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">character_set_connection</code> (default: <code class="language-plaintext highlighter-rouge">utf8mb4</code>)</li>
  <li><code class="language-plaintext highlighter-rouge">collation_connection</code> (default: <code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code>)</li>
</ul>

<p>The above settings are crucial, as literals are a foundation for exchanging data with the server. For example, when an ORM inserts data in a database, it creates an <code class="language-plaintext highlighter-rouge">INSERT</code> with a set of literals.</p>

<p>When the database system sends the results, it sends them in the following charset:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">character_set_results</code> (default: <code class="language-plaintext highlighter-rouge">utf8mb4</code>)</li>
</ul>

<p>Literals are not the only foundation. Database objects are the other side of the coin. Base defaults for database objects (e.g. the databases) use:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">character_set_server</code> (default: <code class="language-plaintext highlighter-rouge">utf8mb4</code>)</li>
  <li><code class="language-plaintext highlighter-rouge">collation_server</code> (default: <code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code>)</li>
</ul>

<h3 id="string-and-comparison-properties">String, and comparison, properties</h3>

<p>Some developers would define a string as a stream of bytes; this is not <em>entirely</em> correct.</p>

<p>To be exact, a string is a stream of bytes <em>associated to a character set</em>.</p>

<p>Now, this concept applies to strings in isolation. How about operations on sets of strings, e.g. comparisons?</p>

<p>In a similar way, we need another concept: the “collation”.</p>

<p>A collation is a set of rules that defines how strings are sorted, which is required to perform comparisons.</p>

<p>In a database system, a collation is associated to objects and literal, both through system and specific defaults: a column, for example, will have its own collation, while a literal will use the default, if not specified.</p>

<p>But when comparing two strings with different collations, how is it decided which collation to use?</p>

<p>Enter the “Collation coercibility”.</p>

<h3 id="collation-coercion-and-issues-general--0900_ai">Collation coercion, and issues <code class="language-plaintext highlighter-rouge">general</code> &lt;&gt; <code class="language-plaintext highlighter-rouge">0900_ai</code></h3>

<p>Reference: <a href="https://dev.mysql.com/doc/refman/8.0/en/charset-collation-coercibility.html">Collation Coercibility in Expressions</a></p>

<p>Coercibility is a property of collations, which defines the priority of collations in the context of a comparison.</p>

<p>MySQL has seven coercibility values:</p>

<blockquote>
  <p>0: An explicit COLLATE clause (not coercible at all)
1: The concatenation of two strings with different collations
2: The collation of a column or a stored routine parameter or local variable
3: A “system constant” (the string returned by functions such as USER() or VERSION())
4: The collation of a literal
5: The collation of a numeric or temporal value
6: NULL or an expression that is derived from NULL</p>
</blockquote>

<p>it’s not necessary to know them by heart, since their ordering makes sense, but it’s important to know how the main ones work in the context of a migration:</p>

<ul>
  <li>how columns will compare against literals;</li>
  <li>how columns will compare against each other.</li>
</ul>

<p>What we want to know is what happens in the workflow of a migration, in particular, if we:</p>

<ul>
  <li>start migrating the charset/collation defaults;</li>
  <li>then, we slowly migrate the columns.</li>
</ul>

<h4 id="comparisons-utf8_general_ci-column--literals">Comparisons utf8_general_ci column &lt;&gt; literals</h4>

<p>Let’s create a table with all the related collations:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">chartest</span> <span class="p">(</span>
  <span class="n">c3_gen</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb3</span> <span class="k">COLLATE</span> <span class="n">utf8mb3_general_ci</span><span class="p">,</span>
  <span class="n">c4_gen</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_general_ci</span><span class="p">,</span>
  <span class="n">c4_900</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_0900_ai_ci</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">chartest</span> <span class="k">VALUES</span><span class="p">(</span><span class="s1">'ä'</span><span class="p">,</span> <span class="s1">'ä'</span><span class="p">,</span> <span class="s1">'ä'</span><span class="p">);</span>
</code></pre></div></div>

<p>Note how we insert characters in the Basic Multilingual Plane) (<code class="language-plaintext highlighter-rouge">BMP</code>, essentially, the one supported by <code class="language-plaintext highlighter-rouge">utf8mb3</code>) - we’re simulating a database where we only changed the defaults, not the data.</p>

<p>Let’s compare with BMP <code class="language-plaintext highlighter-rouge">utf8mb4</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">c3_gen</span> <span class="o">=</span> <span class="s1">'ä'</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">chartest</span><span class="p">;</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- | result |</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- |      1 |</span>
<span class="c1">-- +--------+</span>
</code></pre></div></div>

<p>Nice; it works. Coercion values:</p>

<ul>
  <li>column:           2  # =&gt; wins</li>
  <li>literal implicit: 4</li>
</ul>

<p>More critical: we compare against a character in the Supplementary Multilingual Plane (<code class="language-plaintext highlighter-rouge">SMP</code>, essentially, one added by <code class="language-plaintext highlighter-rouge">utf8mb4</code>), with explicit collation:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">c3_gen</span> <span class="o">=</span> <span class="s1">'🍕'</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_0900_ai_ci</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">chartest</span><span class="p">;</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- | result |</span>
<span class="c1">-- +--------+</span>
<span class="c1">-- |      0 |</span>
<span class="c1">-- +--------+</span>
</code></pre></div></div>

<p>Coercion values:</p>

<ul>
  <li>column:           2</li>
  <li>literal explicit: 0  # =&gt; wins</li>
</ul>

<p>MySQL converts the first value and uses the explicit collation.</p>

<p>Most critical: compare against a character in the SMP, without implicit collation:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">c3_gen</span> <span class="o">=</span> <span class="s1">'🍕'</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">chartest</span><span class="p">;</span>
<span class="n">ERROR</span> <span class="mi">1267</span> <span class="p">(</span><span class="n">HY000</span><span class="p">):</span> <span class="n">Illegal</span> <span class="n">mix</span> <span class="k">of</span> <span class="n">collations</span> <span class="p">(</span><span class="n">utf8_general_ci</span><span class="p">,</span><span class="k">IMPLICIT</span><span class="p">)</span> <span class="k">and</span> <span class="p">(</span><span class="n">utf8mb4_general_ci</span><span class="p">,</span><span class="n">COERCIBLE</span><span class="p">)</span> <span class="k">for</span> <span class="k">operation</span> <span class="s1">'='</span>
</code></pre></div></div>

<p>WAT!!</p>

<p>Weird?</p>

<p>Well, this is because:</p>

<ul>
  <li>column:           2  # =&gt; wins</li>
  <li>literal implicit: 4</li>
</ul>

<p>MySQL tries to coerce the charset/collation to the column’s one, and fails!</p>

<p>This gives a clear indication to the migration: <em>do not</em> allow SMP characters in the system, until the entire dataset has been migrated.</p>

<h4 id="comparisons-utf8_general_ci-column--columns">Comparisons utf8_general_ci column &lt;&gt; columns</h4>

<p>Now, let’s see what happens between columns!</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">chartest</span> <span class="n">a</span> <span class="k">JOIN</span> <span class="n">chartest</span> <span class="n">b</span> <span class="k">ON</span> <span class="n">a</span><span class="p">.</span><span class="n">c3_gen</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="n">c4_gen</span><span class="p">;</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- | COUNT(*) |</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- |        1 |</span>
<span class="c1">-- +----------+</span>

<span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">chartest</span> <span class="n">a</span> <span class="k">JOIN</span> <span class="n">chartest</span> <span class="n">b</span> <span class="k">ON</span> <span class="n">a</span><span class="p">.</span><span class="n">c3_gen</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="n">c4_900</span><span class="p">;</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- | COUNT(*) |</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- |        1 |</span>
<span class="c1">-- +----------+</span>

<span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">chartest</span> <span class="n">a</span> <span class="k">JOIN</span> <span class="n">chartest</span> <span class="n">b</span> <span class="k">ON</span> <span class="n">a</span><span class="p">.</span><span class="n">c4_gen</span> <span class="o">=</span> <span class="n">b</span><span class="p">.</span><span class="n">c4_900</span><span class="p">;</span>
<span class="n">ERROR</span> <span class="mi">1267</span> <span class="p">(</span><span class="n">HY000</span><span class="p">):</span> <span class="n">Illegal</span> <span class="n">mix</span> <span class="k">of</span> <span class="n">collations</span> <span class="p">(</span><span class="n">utf8mb4_general_ci</span><span class="p">,</span><span class="k">IMPLICIT</span><span class="p">)</span> <span class="k">and</span> <span class="p">(</span><span class="n">utf8mb4_0900_ai_ci</span><span class="p">,</span><span class="k">IMPLICIT</span><span class="p">)</span> <span class="k">for</span> <span class="k">operation</span> <span class="s1">'='</span>
</code></pre></div></div>

<p>Ouch. BIG OUCH!</p>

<p>Why?</p>

<p>This is what happens to people who migrated, referring to obsolete documentation, to <code class="language-plaintext highlighter-rouge">utf8mb4_general_ci</code> - they can’t easily migrate to the new collation.</p>

<h4 id="summary-of-the-migration-path">Summary of the migration path</h4>

<p>The migration path outlined:</p>

<ul>
  <li>update the defaults to the new charset/collation;</li>
  <li>don’t allow SMP characters in the application;</li>
  <li>gradually convert the tables/columns;</li>
  <li>now allow everything you want 😄.</li>
</ul>

<p>is viable for production systems.</p>

<h3 id="the-new-collation-doesnt-pad-anymore">The new collation doesn’t pad anymore</h3>

<p>There’s another unexpected property of the new collation.</p>

<p>Let’s simulate MySQL 5.7:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Not exact, but close enough</span>
<span class="c1">--</span>
<span class="k">SELECT</span> <span class="s1">''</span> <span class="o">=</span> <span class="n">_utf8</span><span class="s1">' '</span> <span class="k">COLLATE</span> <span class="n">utf8_general_ci</span><span class="p">;</span>
<span class="c1">-- +---------------------------------------+</span>
<span class="c1">-- | '' = _utf8' ' COLLATE utf8_general_ci |</span>
<span class="c1">-- +---------------------------------------+</span>
<span class="c1">-- |                                     1 |</span>
<span class="c1">-- +---------------------------------------+</span>
</code></pre></div></div>

<p>How does this work on MySQL 8.0?:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Current (8.0):</span>
<span class="c1">--</span>
<span class="k">SELECT</span> <span class="s1">''</span> <span class="o">=</span> <span class="s1">' '</span><span class="p">;</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- | '' = ' ' |</span>
<span class="c1">-- +----------+</span>
<span class="c1">-- |        0 |</span>
<span class="c1">-- +----------+</span>
</code></pre></div></div>

<p>Ouch!</p>

<p>Where does this behavior come from? Let’s get some more info from the collations (with a regular expression, of course 😉):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="k">COLLATION</span> <span class="k">WHERE</span> <span class="k">Collation</span> <span class="n">RLIKE</span> <span class="s1">'utf8mb4_general_ci|utf8mb4_0900_ai_ci'</span><span class="p">;</span>
<span class="c1">-- +--------------------+---------+-----+---------+----------+---------+---------------+</span>
<span class="c1">-- | Collation          | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |</span>
<span class="c1">-- +--------------------+---------+-----+---------+----------+---------+---------------+</span>
<span class="c1">-- | utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |</span>
<span class="c1">-- | utf8mb4_general_ci | utf8mb4 |  45 |         | Yes      |       1 | PAD SPACE     |</span>
<span class="c1">-- +--------------------+---------+-----+---------+----------+---------+---------------+</span>
</code></pre></div></div>

<p>Hmmmm 🤔. Let’s have a look at the formal rules from the SQL (2003) standard (section 8.2):</p>

<blockquote>
  <p>3) The comparison of two character strings is determined as follows:</p>

  <p>a) Let CS be the collation […]</p>

  <p>b) <u>If the length in characters of X is not equal to the length in characters of Y, then the shorter string is
   effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to
   the length of the longer string by concatenation on the right of one or more pad characters</u>, where the
   pad character is chosen based on CS. <u>If CS has the NO PAD characteristic, then the pad character is
   an implementation-dependent character</u> different from any character in the character set of X and Y
   that collates less than any string under CS. Otherwise, the pad character is a space.</p>
</blockquote>

<p>In other words: the new collation does <strong>not</strong> pad.</p>

<p>This is not a big deal. Just, before migrating, trim the data, and make 100% sure that new instances are not introduced by the application before the migration is completed.</p>

<h3 id="triggers">Triggers</h3>

<p>Triggers are fairly easy to handle, as they can be dropped/rebuilt with the new settings - just make sure to consider comparisons <em>inside</em> the trigger body.</p>

<p>Sample of a trigger (edited):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="k">CREATE</span> <span class="k">TRIGGER</span> <span class="n">enqueue_comments_update_instance_event</span><span class="err">\</span><span class="k">G</span>

<span class="c1">-- SQL Original Statement:</span>
<span class="k">CREATE</span> <span class="k">TRIGGER</span> <span class="nv">`enqueue_comments_update_instance_event`</span>
<span class="k">AFTER</span> <span class="k">UPDATE</span> <span class="k">ON</span> <span class="nv">`comments`</span>
<span class="k">FOR</span> <span class="k">EACH</span> <span class="k">ROW</span>
<span class="n">trigger_body</span><span class="p">:</span> <span class="k">BEGIN</span>
  <span class="k">SET</span> <span class="o">@</span><span class="n">changed_fields</span> <span class="p">:</span><span class="o">=</span> <span class="k">NULL</span><span class="p">;</span>

  <span class="n">IF</span> <span class="k">NOT</span> <span class="p">(</span><span class="k">OLD</span><span class="p">.</span><span class="n">description</span> <span class="o">&lt;=&gt;</span> <span class="k">NEW</span><span class="p">.</span><span class="n">description</span> <span class="k">COLLATE</span> <span class="n">utf8_bin</span> <span class="k">AND</span> <span class="k">CHAR_LENGTH</span><span class="p">(</span><span class="k">OLD</span><span class="p">.</span><span class="n">description</span><span class="p">)</span> <span class="o">&lt;=&gt;</span> <span class="k">CHAR_LENGTH</span><span class="p">(</span><span class="k">NEW</span><span class="p">.</span><span class="n">description</span><span class="p">))</span> <span class="k">THEN</span>
    <span class="k">SET</span> <span class="o">@</span><span class="n">changed_fields</span> <span class="p">:</span><span class="o">=</span> <span class="n">CONCAT_WS</span><span class="p">(</span><span class="s1">','</span><span class="p">,</span> <span class="o">@</span><span class="n">changed_fields</span><span class="p">,</span> <span class="s1">'description'</span><span class="p">);</span>
  <span class="k">END</span> <span class="n">IF</span><span class="p">;</span>

  <span class="n">IF</span> <span class="o">@</span><span class="n">changed_fields</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span> <span class="k">THEN</span>
    <span class="k">SET</span> <span class="o">@</span><span class="n">old_values</span> <span class="p">:</span><span class="o">=</span> <span class="k">NULL</span><span class="p">;</span>
    <span class="k">SET</span> <span class="o">@</span><span class="n">new_values</span> <span class="p">:</span><span class="o">=</span> <span class="k">NULL</span><span class="p">;</span>

    <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">instance_events</span><span class="p">(</span><span class="n">created_at</span><span class="p">,</span> <span class="n">instance_type</span><span class="p">,</span> <span class="n">instance_id</span><span class="p">,</span> <span class="k">operation</span><span class="p">,</span> <span class="n">changed_fields</span><span class="p">,</span> <span class="n">old_values</span><span class="p">,</span> <span class="n">new_values</span><span class="p">)</span>
    <span class="k">VALUES</span><span class="p">(</span><span class="n">NOW</span><span class="p">(),</span> <span class="s1">'Comment'</span><span class="p">,</span> <span class="k">NEW</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="s1">'UPDATE'</span><span class="p">,</span> <span class="o">@</span><span class="n">changed_fields</span><span class="p">,</span> <span class="o">@</span><span class="n">old_values</span><span class="p">,</span> <span class="o">@</span><span class="n">new_values</span><span class="p">);</span>
  <span class="k">END</span> <span class="n">IF</span><span class="p">;</span>
<span class="k">END</span>
<span class="c1">--   character_set_client: utf8mb4</span>
<span class="c1">--   collation_connection: utf8mb4_0900_ai_ci</span>
<span class="c1">--     Database Collation: utf8mb4_0900_ai_ci</span>
</code></pre></div></div>

<p>As you see, a trigger has associated charset/collation settings. This is because, differently from a statement, it’s not sent by a client, so it needs to keep its own settings.</p>

<p>In the trigger above, dropping/recreating in the context of a system with the new default works, however, it’s not enough - there’s a comparison in the body!</p>

<p>Conclusion: don’t forget to look inside the triggers. Or better, make sure you have a solid test suite 😉.</p>

<h4 id="sort-of-related-suggestion">Sort-of-related suggestion</h4>

<p>We’ve been long time users of MySQL triggers. They make a wonderful callback system.</p>

<p>When a system grows, it’s increasingly hard (tipping into the unmaintainable) to maintain application-level callbacks. Triggers will <em>never</em> miss any database update, and with a logic like the above, a queue processor can process the database changes.</p>

<h3 id="behavior-with-indexes">Behavior with indexes</h3>

<p>Now that we’ve examined the compatibility, let’s examine the performance aspect.</p>

<p>Indexes are still usable cross-charset, due to automatic conversion performed by MySQL. The point to be aware of is that the values are converted after being read from the index.</p>

<p>Let’s create test tables:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">indextest3</span> <span class="p">(</span>
  <span class="n">c3</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span><span class="p">,</span>
  <span class="k">KEY</span> <span class="p">(</span><span class="n">c3</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">indextest3</span> <span class="k">VALUES</span> <span class="p">(</span><span class="s1">'a'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'b'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'c'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'d'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'e'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'f'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'g'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'h'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'i'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'j'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'k'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'l'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'m'</span><span class="p">);</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">indextest4</span> <span class="p">(</span>
  <span class="n">c4</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span><span class="p">,</span>
  <span class="k">KEY</span> <span class="p">(</span><span class="n">c4</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">indextest4</span> <span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">indextest3</span><span class="p">;</span>
</code></pre></div></div>

<p>Querying against a constant yields interesting results:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">indextest4</span> <span class="k">WHERE</span> <span class="n">c4</span> <span class="o">=</span> <span class="n">_utf8</span><span class="s1">'n'</span><span class="err">\</span><span class="k">G</span>
<span class="c1">-- -&gt; Aggregate: count(0)</span>
<span class="c1">--     -&gt; Filter: (indextest4.c4 = 'n')  (cost=0.35 rows=1)</span>
<span class="c1">--         -&gt; Index lookup on indextest4 using c4 (c4='n')  (cost=0.35 rows=1)</span>
</code></pre></div></div>

<p>MySQL recognizes that <code class="language-plaintext highlighter-rouge">n</code> is a valid utf8mb4 character, and matches it directly.</p>

<p>Against a column with index:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">indextest3</span> <span class="k">JOIN</span> <span class="n">indextest4</span> <span class="k">ON</span> <span class="n">c3</span> <span class="o">=</span> <span class="n">c4</span><span class="p">;</span>
<span class="c1">-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+</span>
<span class="c1">-- | id | select_type | table      | partitions | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                    |</span>
<span class="c1">-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+</span>
<span class="c1">-- |  1 | SIMPLE      | indextest3 | NULL       | index | NULL          | c3   | 4       | NULL |   13 |   100.00 | Using index              |</span>
<span class="c1">-- |  1 | SIMPLE      | indextest4 | NULL       | ref   | c4            | c4   | 5       | func |    1 |   100.00 | Using where; Using index |</span>
<span class="c1">-- +----+-------------+------------+------------+-------+---------------+------+---------+------+------+----------+--------------------------+</span>

<span class="k">EXPLAIN</span> <span class="n">FORMAT</span><span class="o">=</span><span class="n">TREE</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">indextest3</span> <span class="k">JOIN</span> <span class="n">indextest4</span> <span class="k">ON</span> <span class="n">c3</span> <span class="o">=</span> <span class="n">c4</span><span class="err">\</span><span class="k">G</span>
<span class="c1">--  -&gt; Aggregate: count(0)</span>
<span class="c1">--     -&gt; Nested loop inner join  (cost=6.10 rows=13)</span>
<span class="c1">--         -&gt; Index scan on indextest3 using c3  (cost=1.55 rows=13)</span>
<span class="c1">--         -&gt; Filter: (convert(indextest3.c3 using utf8mb4) = indextest4.c4)  (cost=0.26 rows=1)</span>
<span class="c1">--             -&gt; Index lookup on indextest4 using c4 (c4=convert(indextest3.c3 using utf8mb4))  (cost=0.26 rows=1)</span>
</code></pre></div></div>

<p>MySQL is using the index, so all good. However, what’s the <code class="language-plaintext highlighter-rouge">func</code>?</p>

<p>It simply tell us that the value used against the index is the result of a function. In this case, MySQL is converting the charset for us (<code class="language-plaintext highlighter-rouge">convert(indextest3.c3 using utf8mb4)</code>).</p>

<p>This is another crucial consideration for a migration - indexes will still be effective. Of course, (very) complex queries will need to be carefully examined, but there are the grounds for a smooth transition.</p>

<h3 id="consequences-of-the-increase-in-potential-size-of-char-columns">Consequences of the increase in (potential) size of char columns</h3>

<p>Reference: <a href="https://dev.mysql.com/doc/refman/8.0/en/char.html">The CHAR and VARCHAR Types</a></p>

<p>One concept to be aware of, although unlikely to hit real-world application, is that utf8mb4 characters will take up to 33% more.</p>

<p>In storage terms, databases need to know what’s the maximum limit of the data they handle. This means that even if a string will take the same space both in <code class="language-plaintext highlighter-rouge">utf8mb3</code> and <code class="language-plaintext highlighter-rouge">utf8mb4</code>, MySQL needs to know what’s the maximum space it can take.</p>

<p>The InnoDB index limit is 3072 bytes in MySQL 8.0; generally speaking, this is large enough not to care.</p>

<p>Remember!:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">[VAR]CHAR(n)</code> refers to the number of characters; therefore, the maximum requirement is <code class="language-plaintext highlighter-rouge">4 * n</code> bytes, but</li>
  <li><code class="language-plaintext highlighter-rouge">TEXT</code> fields refer to the number of bytes.</li>
</ul>

<h2 id="information-schema-statistics-caching">Information schema statistics caching</h2>

<p>Reference: <a href="https://dev.mysql.com/doc/refman/8.0/en/statistics-table.html">The INFORMATION_SCHEMA STATISTICS Table</a></p>

<p>Up to MySQL 5.7, <code class="language-plaintext highlighter-rouge">information_schema</code> statistics are updated real-time. In MySQL 8.0, statistics are cached, and updated only every 24 hours (by default).</p>

<p>In web applications, this affects only very specific use cases, but it’s important to know if one’s application is subject to this new behavior (our application was).</p>

<p>Let’s see the effects of this:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">ainc</span> <span class="p">(</span><span class="n">id</span> <span class="nb">INT</span> <span class="n">AUTO_INCREMENT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">);</span>

<span class="c1">-- On the first query, the statistics are generated.</span>
<span class="c1">--</span>
<span class="k">SELECT</span> <span class="k">TABLE_NAME</span><span class="p">,</span> <span class="n">AUTO_INCREMENT</span> <span class="k">FROM</span> <span class="n">information_schema</span><span class="p">.</span><span class="n">tables</span> <span class="k">WHERE</span> <span class="k">table_name</span> <span class="o">=</span> <span class="s1">'ainc'</span><span class="p">;</span>
<span class="c1">-- +------------+----------------+</span>
<span class="c1">-- | TABLE_NAME | AUTO_INCREMENT |</span>
<span class="c1">-- +------------+----------------+</span>
<span class="c1">-- | ainc       |           NULL |</span>
<span class="c1">-- +------------+----------------+</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">ainc</span> <span class="k">VALUES</span> <span class="p">();</span>

<span class="k">SELECT</span> <span class="k">TABLE_NAME</span><span class="p">,</span> <span class="n">AUTO_INCREMENT</span> <span class="k">FROM</span> <span class="n">information_schema</span><span class="p">.</span><span class="n">tables</span> <span class="k">WHERE</span> <span class="k">table_name</span> <span class="o">=</span> <span class="s1">'ainc'</span><span class="p">;</span>
<span class="c1">-- +------------+----------------+</span>
<span class="c1">-- | TABLE_NAME | AUTO_INCREMENT |</span>
<span class="c1">-- +------------+----------------+</span>
<span class="c1">-- | ainc       |           NULL |</span>
<span class="c1">-- +------------+----------------+</span>
</code></pre></div></div>

<p>Ouch! The cached values are returned.</p>

<p>How about <code class="language-plaintext highlighter-rouge">SHOW CREATE TABLE</code>?</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">ainc</span><span class="err">\</span><span class="k">G</span>
<span class="c1">-- CREATE TABLE `ainc` (</span>
<span class="c1">--   `id` int NOT NULL AUTO_INCREMENT,</span>
<span class="c1">--   PRIMARY KEY (`id`)</span>
<span class="c1">-- ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;</span>
</code></pre></div></div>

<p>This command is always up to date.</p>

<p>How to update the statistics? By using <code class="language-plaintext highlighter-rouge">ANALYZE TABLE</code>:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ANALYZE</span> <span class="k">TABLE</span> <span class="n">ainc</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="k">TABLE_NAME</span><span class="p">,</span> <span class="n">AUTO_INCREMENT</span> <span class="k">FROM</span> <span class="n">information_schema</span><span class="p">.</span><span class="n">tables</span> <span class="k">WHERE</span> <span class="k">table_name</span> <span class="o">=</span> <span class="s1">'ainc'</span><span class="p">;</span>
<span class="c1">-- +------------+----------------+</span>
<span class="c1">-- | TABLE_NAME | AUTO_INCREMENT |</span>
<span class="c1">-- +------------+----------------+</span>
<span class="c1">-- | ainc       |              2 |</span>
<span class="c1">-- +------------+----------------+</span>
</code></pre></div></div>

<p>There you go. Let’s find out the related setting:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="k">GLOBAL</span> <span class="n">VARIABLES</span> <span class="k">LIKE</span> <span class="s1">'%stat%exp%'</span><span class="p">;</span>
<span class="c1">-- +---------------------------------+-------+</span>
<span class="c1">-- | Variable_name                   | Value |</span>
<span class="c1">-- +---------------------------------+-------+</span>
<span class="c1">-- | information_schema_stats_expiry | 86400 |</span>
<span class="c1">-- +---------------------------------+-------+</span>
</code></pre></div></div>

<p>Developers who absolutely need to revert to the pre-8.0 behavior can set this value to 0.</p>

<h2 id="group-by-not-sorted-anymore-by-default-tooling">GROUP BY not sorted anymore by default (+tooling)</h2>

<p>Up to MySQL 5.7, <code class="language-plaintext highlighter-rouge">GROUP BY</code>’s result was sorted.</p>

<p>This was unnecessary - optimization-seeking developers used <code class="language-plaintext highlighter-rouge">ORDER BY NULL</code> in order to spare the sort, however, accidentally or not, some relied on it.</p>

<p>Those who relied on it are unfortunately required to scan the codebase. There isn’t a one-size-fits-all solution, and in this case, writing an automated solution may not be worth the time of manually inspecting the occurrences, however, this doesn’t prevent the Unix tools to help 😄</p>

<p>Let’s simulate a coding standard where <code class="language-plaintext highlighter-rouge">ORDER BY</code> is always on the line after <code class="language-plaintext highlighter-rouge">GROUP BY</code>, if present:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cat</span> <span class="o">&gt;</span> /tmp/test_groupby_1 <span class="o">&lt;&lt;</span> <span class="no">SQL</span><span class="sh">
  GROUP BY col1
  -- ends here

  GROUP BY col2
  ORDER BY col2

  GROUP BY col3
  -- ends here

  GROUP BY col4
</span><span class="no">SQL

</span><span class="nb">cat</span> <span class="o">&gt;</span> /tmp/test_groupby_2 <span class="o">&lt;&lt;</span> <span class="no">SQL</span><span class="sh">

  GROUP BY col5
  ORDER BY col5
</span><span class="no">SQL
</span></code></pre></div></div>

<p>A basic version would be a simple grep scan with <code class="language-plaintext highlighter-rouge">1</code> line <code class="language-plaintext highlighter-rouge">A</code>fter each <code class="language-plaintext highlighter-rouge">GROUP BY</code> match:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">grep</span> <span class="nt">-A</span> 1 <span class="s1">'GROUP BY'</span> /tmp/test_groupby_<span class="k">*</span>
/tmp/test_groupby_1:  GROUP BY col1
/tmp/test_groupby_1-  <span class="nt">--</span> ends here
<span class="nt">--</span>
/tmp/test_groupby_1:  GROUP BY col2
/tmp/test_groupby_1-  ORDER BY col2
<span class="nt">--</span>
/tmp/test_groupby_1:  GROUP BY col3
/tmp/test_groupby_1-  <span class="nt">--</span> ends here
<span class="nt">--</span>
/tmp/test_groupby_1:  GROUP BY col4
<span class="nt">--</span>
/tmp/test_groupby_2:  GROUP BY col5
/tmp/test_groupby_2-  ORDER BY col5
</code></pre></div></div>

<p>However, with some basic scripting, we can display only the <code class="language-plaintext highlighter-rouge">GROUP BY</code>s matching the criteria:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># First, we make Perl speak english: `-MEnglish`, which enables `$ARG` (among the other things).</span>
<span class="c">#</span>
<span class="c"># The logic is simple: we print the current line if the previous line matched /GROUP BY/, and the</span>
<span class="c"># current doesn't match /ORDER BY/; after, we store the current line as `$previous`.</span>
<span class="c">#</span>
perl <span class="nt">-MEnglish</span> <span class="nt">-ne</span> <span class="s1">'print "$ARGV: $previous $ARG" if $previous =~ /GROUP BY/ &amp;&amp; !/ORDER BY/; $previous = $ARG'</span> /tmp/test_groupby_<span class="k">*</span>

<span class="c"># As next step, we automatically open all the files matching the criteria, in an editor:</span>
<span class="c">#</span>
<span class="c"># - `-l`: adds the newline automatically;</span>
<span class="c"># - `$ARGV`: is the filename (which we print instead of the match);</span>
<span class="c"># - `unique`: if a file has more matches, the filename will be printed more than once - with</span>
<span class="c">#    `unique`, we remove duplicates; this is optional though, as editors open each file(name) only</span>
<span class="c">#    once;</span>
<span class="c"># - `xargs`: send the filenames as parameters to the command (in this case, `code`, from Visual Studio</span>
<span class="c">#    Code).</span>
<span class="c">#</span>
perl <span class="nt">-MEnglish</span> <span class="nt">-lne</span> <span class="s1">'print $ARGV if $previous =~ /GROUP BY/ &amp;&amp; !/ORDER BY/; $previous = $ARG'</span> /tmp/test_groupby_<span class="k">*</span> | <span class="nb">uniq</span> | xargs code
</code></pre></div></div>

<p>There is another approach: an inverted regular expression match:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Match lines with `GROUP BY`, followed by a line _not_ matching `ORDER BY`.</span>
<span class="c"># Reference: https://stackoverflow.com/a/406408.</span>
<span class="c">#</span>
<span class="nb">grep</span> <span class="nt">-zP</span> <span class="s1">'GROUP BY .+\n((?!ORDER BY ).)*\n'</span> /tmp/test_groupby_<span class="k">*</span>
</code></pre></div></div>

<p>This is, however, freaky, and as regular expressions in general, has a high risk of hairpulling (of course, this is up to the developer’s judgement). It will be the subject of a future article, though, because I find it is a very interesting case.</p>

<h2 id="schema-migration-tools-incompatibility">Schema migration tools incompatibility</h2>

<p>This is an easily missed problem! Some tools may not support MySQL 8.0.</p>

<p>There’s a known <a href="https://github.com/github/gh-ost/issues/687">showstopper bug</a> on the latest Gh-ost release, which prevents operations from succeeding on MySQL 8.0.</p>

<p>As a workaround, one case use trigger-based tools, like <a href="https://www.percona.com/downloads/percona-toolkit/LATEST/"><code class="language-plaintext highlighter-rouge">pt-online-schema-change</code></a> v3.1.1 or v3.0.x (but <strong>v3.1.0 is broken!</strong>) or <a href="https://github.com/facebookincubator/OnlineSchemaChange">Facebook’s OnlineSchemaChange</a>.</p>

<h2 id="obsolete-mac-homebrew-default-collation">Obsolete Mac Homebrew default collation</h2>

<p>When MySQL is installed via Homebrew (as of January 2020), the default collation is <code class="language-plaintext highlighter-rouge">utf8mb4_general_ci</code>.</p>

<p>There are a couple of solution to this problem.</p>

<h3 id="modify-the-formula-and-recompile-the-binaries">Modify the formula, and recompile the binaries</h3>

<p>A simple thing to do is to correct the Homebrew formula, and recompile the binaries.</p>

<p>For illustrative purposes, as part of this solution, I use the so-called “flip-flop” operator, which is something frowned upon… by people not using it 😉. As one can observe in fact, for the target use cases, it’s very convenient.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Find out the formula location</span>
<span class="c">#</span>
<span class="nv">$ mysql_formula_filename</span><span class="o">=</span><span class="si">$(</span>brew formula mysql<span class="si">)</span>

<span class="c"># Out of curiosity, let's print the relevant section.</span>
<span class="c">#</span>
<span class="c"># Flip-flop operator (`&lt;condition&gt; .. &lt;condition&gt;`): it matches *everything* between lines matching two conditions, in this case:</span>
<span class="c">#</span>
<span class="c"># - start: a line matching `/args = /`;</span>
<span class="c"># - end: a line matching `/\]/` (a closing square bracket, which needs to be escaped, since it's a regex metacharacter).</span>
<span class="c">#</span>
<span class="nv">$ </span>perl <span class="nt">-ne</span> <span class="s1">'print if /args = / .. /\]/'</span> <span class="s2">"</span><span class="si">$(</span>mysql_formula_filename<span class="si">)</span><span class="s2">"</span>
   args <span class="o">=</span> %W[
     <span class="nt">-DFORCE_INSOURCE_BUILD</span><span class="o">=</span>1
     <span class="nt">-DCOMPILATION_COMMENT</span><span class="o">=</span>Homebrew
     <span class="nt">-DDEFAULT_CHARSET</span><span class="o">=</span>utf8mb4
     <span class="nt">-DDEFAULT_COLLATION</span><span class="o">=</span>utf8mb4_general_ci
     <span class="nt">-DINSTALL_DOCDIR</span><span class="o">=</span>share/doc/#<span class="o">{</span>name<span class="o">}</span>
     <span class="nt">-DINSTALL_INCLUDEDIR</span><span class="o">=</span>include/mysql
     <span class="nt">-DINSTALL_INFODIR</span><span class="o">=</span>share/info
     <span class="nt">-DINSTALL_MANDIR</span><span class="o">=</span>share/man
     <span class="nt">-DINSTALL_MYSQLSHAREDIR</span><span class="o">=</span>share/mysql
     <span class="nt">-DINSTALL_PLUGINDIR</span><span class="o">=</span>lib/plugin
     <span class="nt">-DMYSQL_DATADIR</span><span class="o">=</span><span class="c">#{datadir}</span>
     <span class="nt">-DSYSCONFDIR</span><span class="o">=</span><span class="c">#{etc}</span>
     <span class="nt">-DWITH_BOOST</span><span class="o">=</span>boost
     <span class="nt">-DWITH_EDITLINE</span><span class="o">=</span>system
     <span class="nt">-DWITH_SSL</span><span class="o">=</span><span class="nb">yes</span>
     <span class="nt">-DWITH_PROTOBUF</span><span class="o">=</span>system
     <span class="nt">-DWITH_UNIT_TESTS</span><span class="o">=</span>OFF
     <span class="nt">-DENABLED_LOCAL_INFILE</span><span class="o">=</span>1
     <span class="nt">-DWITH_INNODB_MEMCACHED</span><span class="o">=</span>ON
   <span class="o">]</span>

<span class="c"># Fix it!</span>
<span class="c">#</span>
<span class="nv">$ </span>perl <span class="nt">-i</span>.bak <span class="nt">-ne</span> <span class="s1">'print unless /CHARSET|COLLATION/'</span> <span class="s2">"</span><span class="si">$(</span>mysql_formula_filename<span class="si">)</span><span class="s2">"</span>

<span class="c"># Now recompile and install the formula</span>
<span class="c">#</span>
<span class="nv">$ </span>brew <span class="nb">install</span> <span class="nt">--build-from-source</span> mysql
</code></pre></div></div>

<h3 id="ignore-the-client-encoding-on-handshake">Ignore the client encoding on handshake</h3>

<p>An alternative solution is for the server to ignore the client encoding on handshake.</p>

<p>When configured this way, the server will impose on the clients the the default character set/collation.</p>

<p>In order to apply this solution, add <code class="language-plaintext highlighter-rouge">character-set-client-handshake = OFF</code> to the server configuration.</p>

<h2 id="good-practice-for-majorminor-upgrades-comparing-the-system-variables">Good practice for (major/minor) upgrades: comparing the system variables</h2>

<p>A very good practice when performing (major/minor) upgrades is to compare the system variables, in order to spot differences that may have an impact.</p>

<p>The <a href="https://mysql-params.tmtms.net">MySQL Parameters website</a> gives a visual overview of the differences between versions.</p>

<p>For example, the URL https://mysql-params.tmtms.net/mysqld/?vers=5.7.29,8.0.19&amp;diff=true shows the differences between the system variables of v5.7.29 and v8.0.19.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The migration to MySQL 8.0 at Ticketsolve has been one of the smoothest, historically speaking.</p>

<p>This is a bit of a paradox, because we never had to rewrite our entire database for an upgrade, however, with sufficient knowledge of what to expect, we didn’t hit any significant bump (in particular, nothing unexpected in the optimizer department, which is usually critical).</p>

<p>Considering the main issues and their migration requirements:</p>

<ul>
  <li>the new charset/collation defaults are not mandatory, and the migration can be performed ahead of time and in stages;</li>
  <li>the trailing whitespace just requires the data to be checked and cleaned;</li>
  <li>the GROUP BY clauses can be inspected and updated ahead of time;</li>
  <li>the information schema caching is regulated by a setting;</li>
  <li>Gh-ost may be missed, but in worst case, there are valid comparable tools.</li>
</ul>

<p>the conclusion is that the preparation work can be entirely done before the upgrade, and subsequently perform it with reasonable expectations of low risk.</p>

<p>Happy migration 😄</p>]]></content><author><name></name></author><category term="mysql" /><category term="databases" /><category term="innodb" /><category term="linux" /><category term="mysql" /><category term="shell_scripting" /><category term="sysadmin" /><summary type="html"><![CDATA[In this post I’ll expand on the subject of my MySQL pre-FOSDEM talk: what dbadmins need to know and do, when upgrading from MySQL 5.7 to 8.0. I’ve already published two posts on two specific issues; in this article, I’ll give the complete picture. As usual, I’ll use this post to introduce tooling concepts that may be useful in generic system administration. The presentation code is hosted on a GitHub repository (including the the source files and the output slides in PDF format), and on Slideshare. Contents: Summary of issues, and scope Requirements The new default character set/collation: utf8mb4/utf8mb4_0900_ai_ci Summary Tooling: MySQL RLIKE How the charset parameters work String, and comparison, properties Collation coercion, and issues general &lt;&gt; 0900_ai Comparisons utf8_general_ci column &lt;&gt; literals Comparisons utf8_general_ci column &lt;&gt; columns Summary of the migration path The new collation doesn’t pad anymore Triggers Sort-of-related suggestion Behavior with indexes Consequences of the increase in (potential) size of char columns Information schema statistics caching GROUP BY not sorted anymore by default (+tooling) Schema migration tools support Obsolete Mac Homebrew default collation Modify the formula, and recompile the binaries Ignore the client encoding on handshake Good practice for (major/minor) upgrades: comparing the system variables Conclusion]]></summary></entry><entry><title type="html">Summary of trailing spaces handling in MySQL, with version 8.0 upgrade considerations</title><link href="https://saveriomiroddi.github.io/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations/" rel="alternate" type="text/html" title="Summary of trailing spaces handling in MySQL, with version 8.0 upgrade considerations" /><published>2019-07-09T00:00:00+00:00</published><updated>2019-07-09T20:40:00+00:00</updated><id>https://saveriomiroddi.github.io/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations</id><content type="html" xml:base="https://saveriomiroddi.github.io/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations/"><![CDATA[<p>Fairly recently, we’ve upgraded to MySQL 8; it’s been a relatively smooth transition, however, some minor differences needed to be handled. One of them is the behavior of trailing spaces.</p>

<p>Trailing spaces are a (not in a good way) surprising, but also widely covered argument. This article gives a short overview, and relates it to how this affects people upgrading to MySQL 8.0.</p>

<p>Contents:</p>

<ul>
  <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#premisesrequirements">Premises/Requirements</a></li>
  <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#behavior-in-different-contexts">Behavior in different contexts</a>
    <ul>
      <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#comparison--predicate-1">Comparison (<code class="language-plaintext highlighter-rouge">=</code>) predicate (1)</a>
        <ul>
          <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#inspecting-the-collations">Inspecting the collations</a></li>
        </ul>
      </li>
      <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#comparison--predicate-2">Comparison (<code class="language-plaintext highlighter-rouge">=</code>) predicate (2)</a></li>
      <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#like-predicate"><code class="language-plaintext highlighter-rouge">LIKE</code> predicate</a></li>
      <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#unique-indexes">Unique indexes</a></li>
      <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#distinct-predicate"><code class="language-plaintext highlighter-rouge">DISTINCT</code> predicate</a></li>
      <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#group-by-clause"><code class="language-plaintext highlighter-rouge">GROUP BY</code> clause</a></li>
    </ul>
  </li>
  <li><a href="/Summary-of-trailing-spaces-handling-in-MySQL-with-version-8.0-upgrade-considerations#conclusion">Conclusion</a></li>
</ul>

<h2 id="premisesrequirements">Premises/Requirements</h2>

<p>In this article I’m going to analyze only the <code class="language-plaintext highlighter-rouge">VARCHAR</code> data type behavior, as I’d like to keep the article concise. Interested readers can find information in the links provided.</p>

<p>As of MySQL 8.0, <code class="language-plaintext highlighter-rouge">utf8</code> is an alias to <code class="language-plaintext highlighter-rouge">utf8mb3</code> (MySQL 5.7’s underlying standard); using <code class="language-plaintext highlighter-rouge">utf8</code>/<code class="language-plaintext highlighter-rouge">utf8mb3</code> will generate warnings when running some statements on an 8.0 server, which can be ignored in the context of this article.</p>

<p>The reader needs to have an idea of what a <a href="https://dev.mysql.com/doc/refman/en/charset-general.html">collation</a> is (in short: a set of rules for comparing strings).</p>

<p>The MySQL version used, and required to run the article content, is 8.0.</p>

<h2 id="behavior-in-different-contexts">Behavior in different contexts</h2>

<h3 id="comparison--predicate-1">Comparison (<code class="language-plaintext highlighter-rouge">=</code>) predicate (1)</h3>

<p>The comparison (<code class="language-plaintext highlighter-rouge">=</code>) predicate specification is defined independently of its context, therefore, it behaves the same both in the select list (<code class="language-plaintext highlighter-rouge">SELECT ...</code>) and the search condition (<code class="language-plaintext highlighter-rouge">WHERE ...</code>).</p>

<p>Let’s start observing the MySQL 5.7 typical behavior:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">test_comparison_ps</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">str</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">CHARSET</span> <span class="n">utf8</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test_comparison_ps</span> <span class="p">(</span><span class="n">str</span><span class="p">)</span> <span class="k">VALUES</span><span class="p">(</span><span class="s1">''</span><span class="p">),</span> <span class="p">(</span><span class="s1">' '</span><span class="p">);</span>

<span class="k">SET</span> <span class="k">NAMES</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_general_ci</span><span class="p">;</span> <span class="o">#</span> <span class="k">set</span> <span class="n">the</span> <span class="k">connection</span> <span class="n">charset</span><span class="o">/</span><span class="k">collation</span>

<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">CONCAT</span><span class="p">(</span><span class="s1">'&lt;'</span><span class="p">,</span> <span class="n">str</span><span class="p">,</span> <span class="s1">'&gt;'</span><span class="p">)</span> <span class="nv">`qstr`</span><span class="p">,</span> <span class="n">str</span> <span class="o">=</span> <span class="s1">''</span> <span class="p">,</span> <span class="n">str</span> <span class="o">=</span> <span class="s1">' '</span> <span class="k">FROM</span> <span class="n">test_comparison_ps</span><span class="p">;</span>

<span class="o">#</span> <span class="o">+</span><span class="c1">----+------+----------+-----------+</span>
<span class="o">#</span> <span class="o">|</span> <span class="n">id</span> <span class="o">|</span> <span class="n">qstr</span> <span class="o">|</span> <span class="n">str</span> <span class="o">=</span> <span class="s1">''</span> <span class="o">|</span> <span class="n">str</span> <span class="o">=</span> <span class="s1">' '</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+------+----------+-----------+</span>
<span class="o">#</span> <span class="o">|</span>  <span class="mi">1</span> <span class="o">|</span> <span class="o">&lt;&gt;</span>   <span class="o">|</span>        <span class="mi">1</span> <span class="o">|</span>         <span class="mi">1</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">|</span>  <span class="mi">2</span> <span class="o">|</span> <span class="o">&lt;</span> <span class="o">&gt;</span>  <span class="o">|</span>        <span class="mi">1</span> <span class="o">|</span>         <span class="mi">1</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+------+----------+-----------+</span>
</code></pre></div></div>

<p>They’re all equal! This matches the typical outlook that “MySQL removes all the trailing spaces”.</p>

<p>But why so? Who’s responsible?</p>

<h4 id="inspecting-the-collations">Inspecting the collations</h4>

<p>According to the SQL standard, trailing spaces are not removed on storage and retrieval. In MySQL, this is a responsibility of the storage engine, in this case InnoDB; from the related <a href="https://dev.mysql.com/doc/refman/en/innodb-row-format.html#innodb-row-format-compact">manpage</a>, we read:</p>

<blockquote>
  <p>Trailing spaces are not truncated from VARCHAR columns.</p>
</blockquote>

<p>It turns out, the responsible is the collation. In this case, <code class="language-plaintext highlighter-rouge">utf8_general_ci</code>, the default collation of the default MySQL 5.7 charset, does not pad the strings during comparison.</p>

<p>How do we know how comparisons behave in relateion to padding? Let’s ask the information schema:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="k">COLLATION_NAME</span><span class="p">,</span> <span class="n">PAD_ATTRIBUTE</span> <span class="k">FROM</span> <span class="n">information_schema</span><span class="p">.</span><span class="n">collations</span> <span class="k">WHERE</span> <span class="k">COLLATION_NAME</span> <span class="n">RLIKE</span> <span class="s1">'utf8(mb4)?_(general|0900_ai)_ci'</span><span class="p">;</span>
<span class="cm">/*
+--------------------+---------------+
| COLLATION_NAME     | PAD_ATTRIBUTE |
+--------------------+---------------+
| utf8_general_ci    | PAD SPACE     | # 5.7 default
| utf8mb4_general_ci | PAD SPACE     | # utf8mb4 default in MySQL 5.7
| utf8mb4_0900_ai_ci | NO PAD        | # 8.0 default
+--------------------+---------------+
*/</span>
</code></pre></div></div>

<p>From the manpages <a href="https://dev.mysql.com/doc/refman/en/charset-unicode-sets.html#charset-unicode-sets-pad-attributes">page 1</a> and <a href="https://dev.mysql.com/doc/refman/en/charset-binary-collations.html#charset-binary-collations-trailing-space-comparisons">page 2</a>:</p>

<blockquote>
  <p>The pad attribute determines how trailing spaces are treated for comparison of nonbinary strings (CHAR, VARCHAR, and TEXT values):</p>

  <ul>
    <li>For PAD SPACE collations, trailing spaces are insignificant in comparisons; strings are compared without regard to any trailing spaces.</li>
    <li>NO PAD collations treat spaces at the end of strings like any other character.</li>
  </ul>
</blockquote>

<p>The following are the formal rules from the SQL (2003) standard (section 8.2):</p>

<blockquote>
  <p>3) The comparison of two character strings is determined as follows:</p>

  <p>a) Let CS be the collation as determined by Subclause 9.13, “Collation determination”, for the declared
   types of the two character strings.</p>

  <p>b) If the length in characters of X is not equal to the length in characters of Y, then the shorter string is
   effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to
   the length of the longer string by concatenation on the right of one or more pad characters, where the
   pad character is chosen based on CS. If CS has the NO PAD characteristic, then the pad character is
   an implementation-dependent character different from any character in the character set of X and Y
   that collates less than any string under CS. Otherwise, the pad character is a <space>.</space></p>

  <p>c) The result of the comparison of X and Y is given by the collation CS.</p>

  <p>d) Depending on the collation, two strings may compare as equal even if they are of different lengths or
   contain different sequences of characters. When any of the operations MAX, MIN, and DISTINCT
   reference a grouping column, and the UNION, EXCEPT, and INTERSECT operators refer to character
   strings, the specific value selected by these operations from a set of such equal values is implementation-
   dependent.</p>
</blockquote>

<p>the crucial point is b).</p>

<h3 id="comparison--predicate-2">Comparison (<code class="language-plaintext highlighter-rouge">=</code>) predicate (2)</h3>

<p>Now we can go back, and observe a different collation - <code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code>, MySQL 8.0 default:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">test_comparison_np</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">str</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">CHARSET</span> <span class="n">utf8mb4</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test_comparison_np</span> <span class="p">(</span><span class="n">str</span><span class="p">)</span> <span class="k">VALUES</span><span class="p">(</span><span class="s1">''</span><span class="p">),</span> <span class="p">(</span><span class="s1">' '</span><span class="p">);</span>

<span class="k">SET</span> <span class="k">NAMES</span> <span class="n">utf8mb4</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_0900_ai_ci</span><span class="p">;</span> <span class="o">#</span> <span class="n">behave</span> <span class="k">like</span> <span class="n">a</span> <span class="n">standard</span> <span class="n">MySQL</span> <span class="mi">8</span><span class="p">.</span><span class="mi">0</span> <span class="n">installation</span>

<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">CONCAT</span><span class="p">(</span><span class="s1">'&lt;'</span><span class="p">,</span> <span class="n">str</span><span class="p">,</span> <span class="s1">'&gt;'</span><span class="p">)</span> <span class="nv">`qstr`</span><span class="p">,</span> <span class="n">str</span> <span class="o">=</span> <span class="s1">''</span> <span class="p">,</span> <span class="n">str</span> <span class="o">=</span> <span class="s1">' '</span> <span class="k">FROM</span> <span class="n">test_comparison_np</span><span class="p">;</span>
<span class="cm">/*
+----+------+----------+-----------+
| id | qstr | str = '' | str = ' ' |
+----+------+----------+-----------+
|  1 | &lt;&gt;   |        1 |         0 |
|  2 | &lt; &gt;  |        0 |         1 |
+----+------+----------+-----------+
*/</span>
</code></pre></div></div>

<p>… so MySQL doesn’t “remove all the trailing spaces” after all.</p>

<h3 id="like-predicate"><code class="language-plaintext highlighter-rouge">LIKE</code> predicate</h3>

<p>Let’s see how the <code class="language-plaintext highlighter-rouge">LIKE</code> predicate behaves:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">test_like</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">str</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">CHARSET</span> <span class="n">utf8</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test_like</span> <span class="p">(</span><span class="n">str</span><span class="p">)</span> <span class="k">VALUES</span><span class="p">(</span><span class="s1">''</span><span class="p">),</span> <span class="p">(</span><span class="s1">' '</span><span class="p">);</span>

<span class="k">SET</span> <span class="k">NAMES</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_general_ci</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="n">CONCAT</span><span class="p">(</span><span class="s1">'&lt;'</span><span class="p">,</span> <span class="n">str</span><span class="p">,</span> <span class="s1">'&gt;'</span><span class="p">)</span> <span class="nv">`qstr`</span><span class="p">,</span> <span class="n">str</span> <span class="k">LIKE</span> <span class="s1">''</span> <span class="p">,</span> <span class="n">str</span> <span class="k">LIKE</span> <span class="s1">' '</span> <span class="k">FROM</span> <span class="n">test_like</span><span class="p">;</span>
<span class="cm">/*
+----+------+-------------+--------------+
| id | qstr | str LIKE '' | str LIKE ' ' |
+----+------+-------------+--------------+
|  1 | &lt;&gt;   |           1 |            0 |
|  2 | &lt; &gt;  |           0 |            1 |
+----+------+-------------+--------------+
*/</span>
</code></pre></div></div>

<p>Yikes! <code class="language-plaintext highlighter-rouge">LIKE</code> does not perform padding, even on a <code class="language-plaintext highlighter-rouge">PAD SPACE</code> collation such as <code class="language-plaintext highlighter-rouge">utf8_general_ci</code>.</p>

<p><code class="language-plaintext highlighter-rouge">LIKE</code> has some semantic differences from <code class="language-plaintext highlighter-rouge">=</code>, which are confusing (for example, when dealing with JSON), however, they’re expected.</p>

<p>Therefore, as long as we keep in mind that <code class="language-plaintext highlighter-rouge">LIKE</code> differs from <code class="language-plaintext highlighter-rouge">=</code>, we are less likely to make mistakes.</p>

<h3 id="unique-indexes">Unique indexes</h3>

<p>Let’s see how unique indexes behave:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">test_unique_index</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">str_ps</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">CHARSET</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_general_ci</span><span class="p">,</span>
  <span class="n">str_np</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">CHARSET</span> <span class="n">utf8mb4</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_0900_ai_ci</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test_unique_index</span> <span class="p">(</span><span class="n">str_ps</span><span class="p">,</span> <span class="n">str_np</span><span class="p">)</span> <span class="k">VALUES</span><span class="p">(</span><span class="s1">''</span><span class="p">,</span> <span class="s1">''</span><span class="p">),</span> <span class="p">(</span><span class="s1">' '</span><span class="p">,</span> <span class="s1">' '</span><span class="p">);</span>

<span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">test_unique_index</span> <span class="k">ADD</span> <span class="k">UNIQUE</span> <span class="p">(</span><span class="n">str_ps</span><span class="p">);</span>

<span class="c1">-- ERROR 1062 (23000): Duplicate entry '' for key 'str_ps'</span>

<span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">test_unique_index</span> <span class="k">ADD</span> <span class="k">UNIQUE</span> <span class="p">(</span><span class="n">str_np</span><span class="p">);</span>

<span class="c1">-- Query OK, 0 rows affected (0,02 sec)</span>
</code></pre></div></div>

<p>Unique indexes behave like the comparison predicate; this makes sense, since comparison is the core operation they’re associated to.</p>

<h3 id="distinct-predicate"><code class="language-plaintext highlighter-rouge">DISTINCT</code> predicate</h3>

<p>Let’s see the effects of the <code class="language-plaintext highlighter-rouge">DISTINCT</code> predicate:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">test_distinct</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">str</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">CHARSET</span> <span class="n">utf8</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">test_distinct</span> <span class="p">(</span><span class="n">str</span><span class="p">)</span> <span class="k">VALUES</span><span class="p">(</span><span class="s1">''</span><span class="p">),</span> <span class="p">(</span><span class="s1">' '</span><span class="p">);</span>

<span class="k">SET</span> <span class="k">NAMES</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_general_ci</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="k">DISTINCT</span> <span class="n">str</span> <span class="k">FROM</span> <span class="n">test_distinct</span><span class="p">;</span>
<span class="cm">/*
+------+
| str  |
+------+
|      | # ''
|      | # ' '
+------+
*/</span>
</code></pre></div></div>

<p>Very confusing: <code class="language-plaintext highlighter-rouge">DISTINCT</code> does not perform padding.</p>

<p>This is something to keep in mind.</p>

<h3 id="group-by-clause"><code class="language-plaintext highlighter-rouge">GROUP BY</code> clause</h3>

<p>Finally, the <code class="language-plaintext highlighter-rouge">GROUP BY</code> clause:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">group_by</span> <span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span> <span class="n">AUTO_INCREMENT</span><span class="p">,</span>
  <span class="n">str</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">CHARSET</span> <span class="n">utf8</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">group_by</span> <span class="p">(</span><span class="n">str</span><span class="p">)</span> <span class="k">VALUES</span><span class="p">(</span><span class="s1">''</span><span class="p">),</span> <span class="p">(</span><span class="s1">' '</span><span class="p">);</span>

<span class="k">SET</span> <span class="k">NAMES</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_general_ci</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="k">DISTINCT</span> <span class="n">str</span> <span class="k">FROM</span> <span class="n">group_by</span><span class="p">;</span>

<span class="cm">/*
+------+
| str  |
+------+
|      | # ''
|      | # ' '
+------+
*/</span>
</code></pre></div></div>

<p>Very confusing, again, although in a way, we could have expected this, since RDBMSs, in some cases, can process <code class="language-plaintext highlighter-rouge">DISTINCT</code> and <code class="language-plaintext highlighter-rouge">GROUP BY</code> the same way.</p>

<h2 id="conclusion">Conclusion</h2>

<p>All in all, the padding rules in MySQL are not <em>so</em> confusing, but one needs to be aware of them - and I haven’t even explored the <code class="language-plaintext highlighter-rouge">CHAR</code> data type.</p>

<p>In my opinion, they’re not worth the hassle, so MySQL 8.0’s behavior is a very welcome simplification. Time to update the database! 😄</p>]]></content><author><name></name></author><category term="mysql" /><category term="data_types" /><category term="databases" /><category term="mysql" /><summary type="html"><![CDATA[Fairly recently, we’ve upgraded to MySQL 8; it’s been a relatively smooth transition, however, some minor differences needed to be handled. One of them is the behavior of trailing spaces. Trailing spaces are a (not in a good way) surprising, but also widely covered argument. This article gives a short overview, and relates it to how this affects people upgrading to MySQL 8.0. Contents: Premises/Requirements Behavior in different contexts Comparison (=) predicate (1) Inspecting the collations Comparison (=) predicate (2) LIKE predicate Unique indexes DISTINCT predicate GROUP BY clause Conclusion]]></summary></entry><entry><title type="html">Text processing experiments for finding the MySQL configuration files</title><link href="https://saveriomiroddi.github.io/Text-processing-experiments-for-finding-the-mysql-configuration-files/" rel="alternate" type="text/html" title="Text processing experiments for finding the MySQL configuration files" /><published>2019-06-12T00:00:00+00:00</published><updated>2019-06-12T00:00:00+00:00</updated><id>https://saveriomiroddi.github.io/Text-processing-experiments-for-finding-the-mysql-configuration-files</id><content type="html" xml:base="https://saveriomiroddi.github.io/Text-processing-experiments-for-finding-the-mysql-configuration-files/"><![CDATA[<p>When it comes to configuring MySQL, a fundamental step is to find out which configuration files the MySQL server reads.</p>

<p>The operation itself is simple, however, if we want to script the operation, using text processing in a sharp way, it’s not immediate what the best solution is.</p>

<p>In this post I’ll explore the process of looking for a satisfying solution, going through grep, perl, and awk.</p>

<p>Contents:</p>

<ul>
  <li><a href="/Text-processing-experiments-for-finding-the-mysql-configuration-files#assumptions">Assumptions</a></li>
  <li><a href="/Text-processing-experiments-for-finding-the-mysql-configuration-files#input-data-finding-the-configuration-files-read-by-mysql">Input data (finding the configuration files read by MySQL)</a></li>
  <li><a href="/Text-processing-experiments-for-finding-the-mysql-configuration-files#first-step-greptail">First step: grep+tail</a></li>
  <li><a href="/Text-processing-experiments-for-finding-the-mysql-configuration-files#second-step-expanding-the-tilde">Second step: expanding the tilde</a></li>
  <li><a href="/Text-processing-experiments-for-finding-the-mysql-configuration-files#final-step-awks-super-powers">Final step: awk’s super powers</a></li>
  <li><a href="/Text-processing-experiments-for-finding-the-mysql-configuration-files#extra-step-using-the-output">Extra step: using the output</a></li>
  <li><a href="/Text-processing-experiments-for-finding-the-mysql-configuration-files#conclusion">Conclusion</a></li>
</ul>

<h2 id="assumptions">Assumptions</h2>

<p>For simplicity, we assume that the filenames returned by the <code class="language-plaintext highlighter-rouge">mysqld</code> commands, and the user home path, don’t require quoting (e.g. have spaces).</p>

<h2 id="input-data-finding-the-configuration-files-read-by-mysql">Input data (finding the configuration files read by MySQL)</h2>

<p>Finding the configuration files is a simple operation:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span>
</code></pre></div></div>

<p>This yields a pages-long text, with all the command lines parameter and the server configuration; the relevant section is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># ...
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
# ...
</code></pre></div></div>

<h2 id="first-step-greptail">First step: grep+tail</h2>

<p>A generic, manual, approach is to use grep to isolate the text:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">grep</span> <span class="nt">-A</span> 1 <span class="s2">"^Default options"</span>
Default options are <span class="nb">read </span>from the following files <span class="k">in </span>the given order:
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
</code></pre></div></div>

<p>Using the option <code class="language-plaintext highlighter-rouge">-A</code> (<code class="language-plaintext highlighter-rouge">--after-context</code>), we tell grep to print the given number of lines after the match.</p>

<p>Now we isolate the options line:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">grep</span> <span class="nt">-A</span> 1 <span class="s2">"^Default options"</span> | <span class="nb">tail</span> <span class="nt">-n</span> 1
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
</code></pre></div></div>

<p>Standard approach - we use <code class="language-plaintext highlighter-rouge">tail -n 1</code> in order to print the last 1 line(s).</p>

<h2 id="second-step-expanding-the-tilde">Second step: expanding the tilde</h2>

<p>There’s a problem now; we need to expand the tilde (<code class="language-plaintext highlighter-rouge">~</code>).</p>

<p>Since the string <code class="language-plaintext highlighter-rouge">~/.my.cnf</code> is the output of a command, it’s not expanded by the subshell; this simplified example fails:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">ls</span> <span class="nt">-l</span> <span class="si">$(</span><span class="nb">echo</span> <span class="s1">'~/.my.cnf'</span><span class="si">)</span>
<span class="nb">ls</span>: cannot access <span class="s1">'~/.my.cnf'</span>: No such file or directory
</code></pre></div></div>

<p>We’ll try search/replace the tilde with the home path (<code class="language-plaintext highlighter-rouge">$HOME</code> in any shell) via Perl:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">grep</span> <span class="nt">-A</span> 1 <span class="s2">"^Default options"</span> | <span class="nb">tail</span> <span class="nt">-n</span> 1 | perl <span class="nt">-pe</span> <span class="s2">"s/~/</span><span class="nv">$HOME</span><span class="s2">/g"</span>
Unknown regexp modifier <span class="s2">"/h"</span> at <span class="nt">-e</span> line 1, at end of line
syntax error at <span class="nt">-e</span> line 1, at EOF
Execution of <span class="nt">-e</span> aborted due to compilation errors.
</code></pre></div></div>

<p>Yikes! What happened?</p>

<p>The problem is that <code class="language-plaintext highlighter-rouge">$HOME</code>, in my case <code class="language-plaintext highlighter-rouge">/home/saverio</code>, contains backslashes, which are interpolated by the shell, and ultimately interpreted by Perl; this is the simplified example:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo </span>perl <span class="nt">-pe</span> <span class="s2">"s/~/</span><span class="nv">$HOME</span><span class="s2">/g"</span>
perl <span class="nt">-pe</span> s/~//home/saverio/g

<span class="nv">$ </span><span class="nb">echo</span> | perl <span class="nt">-pe</span> <span class="s1">'s/~//home/saverio/g'</span>
Unknown regexp modifier <span class="s2">"/h"</span> at <span class="nt">-e</span> line 1, at end of line
Execution of <span class="nt">-e</span> aborted due to compilation errors.
</code></pre></div></div>

<p>which causes the error previously raised.</p>

<p>Perl can access environment variables - this comes to our rescue:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">echo</span> <span class="s1">'~/.my.cnf'</span> | perl <span class="nt">-pe</span> <span class="s1">'s/~/$ENV{"HOME"}/'</span>
/home/saverio/.my.cnf
</code></pre></div></div>

<p>We now have the building blocks of a fully functional command:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">grep</span> <span class="nt">-A</span> 1 <span class="s2">"^Default options"</span> | <span class="nb">tail</span> <span class="nt">-n</span> 1 | perl <span class="nt">-pe</span> <span class="s1">'s/~/$ENV{"HOME"}/g'</span>
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf /home/saverio/.my.cnf
</code></pre></div></div>

<p>Don’t forget the <code class="language-plaintext highlighter-rouge">/g</code> regex modifier! It tells Perl to replace all the occurrences of a pattern in each matching line, if there’s more than one match (per line).</p>

<p>Our task is now accomplished. Can we do better?</p>

<h2 id="final-step-awks-super-powers">Final step: awk’s super powers</h2>

<p>While the last revision of the command works, it contains way too many commands. Does the GNU toolbox have better tools?</p>

<p>Let’s see what awk offers.</p>

<p>Awk is a (Turing-complete!) programming language, dedicated to text-processing; hopefully, it includes built-in functions relevant to our task.</p>

<p>The ugliest part right now is to isolate the options string from the entire <code class="language-plaintext highlighter-rouge">mysqld</code> help. The logic required is:</p>

<ul>
  <li>find a matching line</li>
  <li>print the line below</li>
</ul>

<p>with grep, unfortunately we can’t just print the line below without printing the matching line. But we can with awk!:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">awk</span> <span class="s1">'/^Default options/ { getline; print }'</span>
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf ~/.my.cnf
</code></pre></div></div>

<p>Awk’s language is fortunately fairly intuitive.<br />
We use pattern matching <code class="language-plaintext highlighter-rouge">/&lt;pattern&gt;/</code> to match the intended line, and for the matches we execute a block (<code class="language-plaintext highlighter-rouge">{ ... }</code>) that goes to the next line (<code class="language-plaintext highlighter-rouge">getline</code>) and then prints the current one (<code class="language-plaintext highlighter-rouge">print</code>).</p>

<p>Now, in the current revision, we still have two commands, <code class="language-plaintext highlighter-rouge">awk</code> and <code class="language-plaintext highlighter-rouge">perl</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">awk</span> <span class="s1">'/^Default options/ { getline; print }'</span> | perl <span class="nt">-pe</span> <span class="s1">'s/~/$ENV{"HOME"}/g'</span>
</code></pre></div></div>

<p>Let’s merge them! We use awk’s search and replace, and environment variables access:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">awk</span> <span class="s1">'/^Default options/ { getline; gsub("~", ENVIRON["HOME"]); print }'</span>
/etc/my.cnf /etc/mysql/my.cnf /usr/local/mysql/etc/my.cnf /home/saverio/.my.cnf
</code></pre></div></div>

<p>Here we use the search and replace function (<code class="language-plaintext highlighter-rouge">gsub(source[, destination[, how]])</code>; <code class="language-plaintext highlighter-rouge">how</code> is not relevant to this article) and associative arrays applied to environment variables (<code class="language-plaintext highlighter-rouge">ENVIRON[&lt;variable_name&gt;]</code>).</p>

<p>Note that <code class="language-plaintext highlighter-rouge">gsub</code> is the global version of search/replace; it replaces all the occurrence in a string, like perl <code class="language-plaintext highlighter-rouge">/g</code> regex modifier.</p>

<h2 id="extra-step-using-the-output">Extra step: using the output</h2>

<p>As extra step, we want to use the output. Say, let’s add a comment to the <code class="language-plaintext highlighter-rouge">[mysqld]</code> block:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>perl <span class="nt">-i</span> <span class="nt">-pe</span> <span class="s1">'s/^(\[mysqld\]\n)/# Server configuration group follows:\n$1/'</span> <span class="si">$(</span>mysqld <span class="nt">--verbose</span> <span class="nt">--help</span> | <span class="nb">awk</span> <span class="s1">'/^Default options/ { getline; gsub("~", ENVIRON["HOME"]); print }'</span><span class="si">)</span> 2&gt; /dev/null
</code></pre></div></div>

<p>We just ignore the errors (due to file(s) not found), by sending them to <code class="language-plaintext highlighter-rouge">/dev/null</code>.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Long ago, I thought that one could improve text processing tools with a straight read of educational material. Nowadays, I find much more effective (and pleasant) instead, to try finding out, when I have the opportunity, which are the most effective tools to a accomplish a task.</p>

<p>In this article we’ve done an iterative search of the best text processing tools for the given use case; we’ve found that awk compactly, yet intuitively, satisfies the requirements, and we’ve explored a few, interesting and useful, features along the way.</p>]]></content><author><name></name></author><category term="mysql" /><category term="linux" /><category term="sysadmin" /><category term="mysql" /><category term="awk" /><category term="perl" /><category term="text_processing" /><summary type="html"><![CDATA[When it comes to configuring MySQL, a fundamental step is to find out which configuration files the MySQL server reads. The operation itself is simple, however, if we want to script the operation, using text processing in a sharp way, it’s not immediate what the best solution is. In this post I’ll explore the process of looking for a satisfying solution, going through grep, perl, and awk. Contents: Assumptions Input data (finding the configuration files read by MySQL) First step: grep+tail Second step: expanding the tilde Final step: awk’s super powers Extra step: using the output Conclusion]]></summary></entry><entry><title type="html">An in depth DBA’s guide to migrating a MySQL database from the `utf8` to the `utf8mb4` charset</title><link href="https://saveriomiroddi.github.io/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset/" rel="alternate" type="text/html" title="An in depth DBA’s guide to migrating a MySQL database from the `utf8` to the `utf8mb4` charset" /><published>2019-03-25T00:00:00+00:00</published><updated>2020-02-03T10:29:00+00:00</updated><id>https://saveriomiroddi.github.io/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset</id><content type="html" xml:base="https://saveriomiroddi.github.io/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset/"><![CDATA[<p>We’re in the process of upgrading our MySQL databases from v5.7 to v8.0; since one of the differences in v8.0 is that the default encoding changed from <code class="language-plaintext highlighter-rouge">utf8</code> to <code class="language-plaintext highlighter-rouge">utf8mb4</code>, and we had the conversion in plan anyway, we anticipated it and performed it as preliminary step for the upgrade.</p>

<p>This post describes in depth the overall experience, including tooling and pitfalls, and related subjects.</p>

<p>Contents:</p>

<ul>
  <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#introduction">Introduction</a></li>
  <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#migration-plan-overview-and-considerations">Migration plan: overview and considerations</a>
    <ul>
      <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#-collation-warning-">!! COLLATION WARNING !!</a></li>
      <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#free-step-connection-configuration">Free step: connection configuration</a>
        <ul>
          <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#how-do-charset-settings-affect-database-operations">How do charset settings affect database operations?</a></li>
        </ul>
      </li>
      <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#step-2-preparing-the-the-alter-statements">Step 2: Preparing the the <code class="language-plaintext highlighter-rouge">ALTER</code> statements</a>
        <ul>
          <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#issue-columnindex-size-limits">Issue: Column/index size limits</a></li>
          <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#issue-triggersfunctions">Issue: Triggers/Functions</a></li>
          <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#issue-joins-between-columns-with-heterogeneous-charsets">Issue: Joins between columns with heterogeneous charsets</a></li>
        </ul>
      </li>
      <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#step-3-altering-the-schema-and-tables">Step 3: Altering the schema and tables</a></li>
      <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#warnings">Warnings</a>
        <ul>
          <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#other-schemas">Other schemas</a></li>
          <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#always-run-analyze-table">Always run <code class="language-plaintext highlighter-rouge">ANALYZE TABLE</code></a></li>
          <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#dont-rush-the-drop-table">Don’t rush the <code class="language-plaintext highlighter-rouge">DROP TABLE</code></a></li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#notes-about-mathias-bynens-post-on-the-same-subject">Notes about Mathias Bynens’ post on the same subject</a></li>
  <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#conclusion">Conclusion</a></li>
  <li><a href="/An-in-depth-dbas-guide-to-migrating-a-mysql-database-from-the-utf8-to-the-utf8mb4-charset#footnotes">Footnotes</a></li>
</ul>

<h2 id="introduction">Introduction</h2>

<p><code class="language-plaintext highlighter-rouge">utf8mb4</code> is the MySQL encoding that fully covers the UTF-8 standard. Up to MySQL 5.7, the default encoding is <code class="language-plaintext highlighter-rouge">utf8</code>; the name is somewhat misleading, as this is a variant with a maximum width of 3 bytes.</p>

<p>Although there’s no practical purpose nowadays in using 3-bytes rather than 4-bytes UTF-8, this choice was originally made <a href="https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4">for performance reasons</a>.</p>

<p>From a practical perspective, not all the applications will benefit from the extra byte of width, whose most common use cases include <a href="https://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use">emojis and mathematical letters</a>, however, conforming to standards is a routine task in software engineering.</p>

<p>Since <code class="language-plaintext highlighter-rouge">utf8mb4</code> is a superset of <code class="language-plaintext highlighter-rouge">utf8</code>, the conversion is relatively painless, however, it’s crucial to be aware of the implications of the procedure.</p>

<h2 id="migration-plan-overview-and-considerations">Migration plan: overview and considerations</h2>

<p>It’s impossible to make a general plan, due to the different requirements of any use case; high traffic applications may for example require that no locking should be involved (ie. no <code class="language-plaintext highlighter-rouge">ALTER TABLE</code>), while low traffic/size applications may just do with a few <code class="language-plaintext highlighter-rouge">ALTER TABLE</code>s.</p>

<p>However, I’ll trace a granular set of steps that should cover the vast majority of the cases; GitHub’s gh-ost is used, therefore, there’s no table locking during the data conversion step.</p>

<p>The setup is assumed to be single-master; there are generally sophisticated multi-master strategies for schema updates, however, they are outside the scope of this article.</p>

<p>The only migration constraint set is that until the end of the migration, the user should not allow 4-byte characters into the database; this gives the certainty that any implicit conversion performed before the end of the migration will succeed.</p>

<p>Users can certainly lift this constraint, however, they must thoroughly analyze the application data flows, in order to be 100% sure that <code class="language-plaintext highlighter-rouge">utf8mb4</code> strings including 4-byte characters won’t mingle with <code class="language-plaintext highlighter-rouge">utf8</code> strings, as this will cause errors.</p>

<h3 id="-collation-warning-">!! COLLATION WARNING !!</h3>

<p>MySQL 8.0 changed the <code class="language-plaintext highlighter-rouge">utf8mb4</code> default collation from <code class="language-plaintext highlighter-rouge">utf8mb4_general_ci</code> to <code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code> (for details, see <a href="https://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details">here</a> and <a href="http://mysqlserverteam.com/new-collations-in-mysql-8-0-0">here</a>).</p>

<p>This has a very significant impact - if the <code class="language-plaintext highlighter-rouge">utf8</code> update if performed on a MySQL 5.7 server, without specifying the collation, and then the server is upgraded to v8.0, the collation of all the data structures will not match the default.<br />
Of course, in such case it’s possible to leave the system as is, however, it won’t be the standard (and the settings will need to be set accordingly, in order to ensure that new tables/columns will be created with the intended collation).</p>

<p>It’s crucial to be aware of this, because most of the online information about the <code class="language-plaintext highlighter-rouge">utf8</code> conversion has been written when MySQL 8.0 was not released yet, so it holds the outdated assumption that the default <code class="language-plaintext highlighter-rouge">utf8mb4</code> collation is <code class="language-plaintext highlighter-rouge">utf8mb4_general_ci</code>.</p>

<p>In the following sections, I’ll point out which configuration parameters are required, when performing the conversion on a 5.7 server.</p>

<h3 id="free-step-connection-configuration">Free step: connection configuration</h3>

<p>The character set [from now on abbreviated as <code class="language-plaintext highlighter-rouge">charset</code>] and collation of a given string or database object (ultimately, a column), and the operation performed, are determined by one or more settings/properties at different levels:</p>

<ol>
  <li>connection (set by the database client, which in turn can be set by the application framework) settings;</li>
  <li>database server settings;</li>
  <li>trigger settings;</li>
  <li>database -&gt; table -&gt; column properties;</li>
</ol>

<p>For example:</p>

<ul>
  <li>when creating a database, the charset is defaulted to the one set in the database server configuration,</li>
  <li>when creating a trigger, the connection will determine the charset,</li>
</ul>

<p>and so on.</p>

<p>Additionally, MySQL server attempts to use a compatible combination charset+collation for incompatible charsets, overriding the configuration/settings.</p>

<p>In order to view the connection and database server settings, we can use this handy query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="n">VARIABLES</span> <span class="k">WHERE</span> <span class="n">Variable_name</span> <span class="n">RLIKE</span> <span class="s1">'^(character_set|collation)_'</span> <span class="k">AND</span> <span class="n">Variable_name</span> <span class="k">NOT</span> <span class="n">RLIKE</span> <span class="s1">'_(database|filesystem|system)$'</span><span class="p">;</span>
</code></pre></div></div>

<p>some settings are skipped, as they’re unrelated or deprecated.</p>

<p>This is a table of the relevant entries:</p>

<table>
  <thead>
    <tr>
      <th>Setting</th>
      <th>New value</th>
      <th>Notes</th>
      <th>Server setting</th>
      <th>Client setting</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">character_set_client</code></td>
      <td><code class="language-plaintext highlighter-rouge">utf8mb4</code></td>
      <td>data sent by the client</td>
      <td> </td>
      <td>✓</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">character_set_connection</code></td>
      <td><code class="language-plaintext highlighter-rouge">utf8mb4</code></td>
      <td>server converts client data into this charset for processing</td>
      <td> </td>
      <td>✓</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">collation_connection</code></td>
      <td><code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code></td>
      <td>server uses this collation for processing</td>
      <td> </td>
      <td>✓</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">character_set_results</code></td>
      <td><code class="language-plaintext highlighter-rouge">utf8mb4</code></td>
      <td>data and metadata sent by the server</td>
      <td> </td>
      <td>✓</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">character_set_server</code></td>
      <td><code class="language-plaintext highlighter-rouge">utf8mb4</code></td>
      <td>default (and fallback) charset for objects</td>
      <td>✓</td>
      <td> </td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">collation_server</code></td>
      <td><code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code></td>
      <td>default (and fallback) collation for objects</td>
      <td>✓</td>
      <td> </td>
    </tr>
  </tbody>
</table>

<p>Server settings are defined at the server level, and as such, they’re typically set in the server configuration file - this is required if we’re operating on MySQL 5.7 (since it uses <code class="language-plaintext highlighter-rouge">utf8</code> by default).</p>

<p>Client settings are specified by the client on connection; typically, they’re set via the <a href="https://dev.mysql.com/doc/refman/8.0/en/set-names.html"><code class="language-plaintext highlighter-rouge">SET NAMES &lt;charset&gt; [COLLATE &lt;collation&gt;]</code></a> statement.<br />
This command is invoked when the encoding/collation are configured by the application framework; in the case of Rails, the parameters are in <code class="language-plaintext highlighter-rouge">database.yml</code>:</p>

<div class="language-yml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Typical structure</span>
<span class="na">login</span><span class="pi">:</span>
  <span class="na">encoding</span><span class="pi">:</span> <span class="s">utf8mb4</span>
  <span class="na">collation</span><span class="pi">:</span> <span class="s">utf8mb4_0900_ai_ci</span>
  <span class="c1"># ...</span>
</code></pre></div></div>

<p>In Django, we add the following to <code class="language-plaintext highlighter-rouge">settings.py</code>:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Typical structure
</span><span class="n">DATABASES</span> <span class="o">=</span> <span class="p">{</span>
  <span class="s">'default'</span><span class="p">:</span> <span class="p">{</span>
    <span class="s">'OPTIONS'</span><span class="p">:</span> <span class="p">{</span><span class="s">'charset'</span><span class="p">:</span> <span class="s">'utf8mb4'</span><span class="p">},</span>
    <span class="c1"># ...
</span>  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The changes above will cause the following statement to be issued on the first connection:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SET</span> <span class="k">NAMES</span> <span class="n">utf8mb4</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_0900_ai_ci</span> <span class="o">#</span> <span class="n">Rails</span> <span class="n">also</span> <span class="k">sets</span> <span class="n">other</span> <span class="n">variables</span> <span class="n">here</span><span class="p">.</span>
</code></pre></div></div>

<p>Based on a brief look at the source code, there is no collation option in Django, so the <code class="language-plaintext highlighter-rouge">COLLATE utf8mb4_0900_ai_ci</code> won’t be specified in the SQL statement.</p>

<p>This step can be performed at the beginning or the end of the migration; the reason is explained in the next subsection.</p>

<h4 id="how-do-charset-settings-affect-database-operations">How do charset settings affect database operations?</h4>

<p>During the migration, with either <code class="language-plaintext highlighter-rouge">utf8</code> or <code class="language-plaintext highlighter-rouge">utf8mb4</code> connection settings, we’ll find data belonging to the other charset. Is this a problem?</p>

<p>First, an introduction to the the <a href="https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html">charset/collation settings</a> is required.</p>

<p>Over the course of a database connection, the data (flow) is processed in several steps:</p>

<ul>
  <li>client data sent: it’s assumed to be in the format defined by <code class="language-plaintext highlighter-rouge">character_set_client</code></li>
  <li>server processing: converted to the format defined by <code class="language-plaintext highlighter-rouge">character_set_connection</code> (and compared using the <code class="language-plaintext highlighter-rouge">collation_connection</code>)</li>
  <li>server results: sent in the format defined by <code class="language-plaintext highlighter-rouge">character_set_results</code></li>
</ul>

<p>All the above settings (unless explicitly set) are set automatically, according to the <code class="language-plaintext highlighter-rouge">character_set_client</code> settings, so we can really think of all of them as a single entity.</p>

<p>So, the core question is: for client data in a given format (<code class="language-plaintext highlighter-rouge">utf8</code> or <code class="language-plaintext highlighter-rouge">utf8mb4</code>), will processing (comparison or storage) always succeed?</p>

<p>Fortunately, in our context, the answer is always yes.</p>

<p>When it comes to storage, the matter is pretty simple; MySQL will take care of “converting” the format. We’re safe here because by using 3-byte characters, we can convert without any problem from and to the other charset.</p>

<p>However, in this context, strings manipulation is not only about storage - comparison is the other aspect to consider. It’s time to introduce the concept of collation and the related rules.</p>

<p>Strings are compared according to a “collation”, which defines how the data is sorted and compared. Each charset has a default collation, which in MySQL is the case-insensitive one (<code class="language-plaintext highlighter-rouge">utf8_general_ci</code> and <code class="language-plaintext highlighter-rouge">utf8mb4_general_ci</code>/<code class="language-plaintext highlighter-rouge">utf8mb4_0900_ai_ci</code>).</p>

<p>Now, when collating strings of mixed type, will the operation succeed? The answer is… no, but yes!</p>

<p>The reason for the no is that, unlike storage, we can’t use a collation for two different charsets. However, MySQL comes to the rescue.</p>

<p>MySQL has a set of <a href="https://dev.mysql.com/doc/refman/5.7/en/charset-collation-coercibility.html">coercibility rules</a>, which determine which collation to use in a given operation (or if an error should be raised).</p>

<p>The rules are quite a few, however, they’re consistently defined, so they’re easy to understand.</p>

<p>We’ll see a few relevant examples, where we’ll also introduce a few interesting SQL clauses:</p>

<ul>
  <li>we define a default collation for a column;</li>
  <li>we use an <a href="https://dev.mysql.com/doc/refman/5.7/en/charset-introducer.html">“introducer”</a> on a string literal;</li>
  <li>we override the default collation of a string literal.</li>
</ul>

<p>First example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">test_table</span> <span class="p">(</span>
  <span class="n">utf8col</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_bin</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">_utf8</span><span class="s1">'ä'</span> <span class="nv">`utf8col`</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="n">utf8col</span> <span class="o">&lt;</span> <span class="n">_utf8mb4</span><span class="s1">'🍕'</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_bin</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">test_table</span><span class="p">;</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">--------+</span>
<span class="o">#</span> <span class="o">|</span> <span class="k">result</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">--------+</span>
<span class="o">#</span> <span class="o">|</span>      <span class="mi">1</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">--------+</span>
</code></pre></div></div>

<p>The relevant rules are:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">An explicit COLLATE clause has a coercibility of 0 (not coercible at all)</code></li>
  <li><code class="language-plaintext highlighter-rouge">The collation of a column or a stored routine parameter or local variable has a coercibility of 2</code></li>
</ol>

<p>which rule the collation as <code class="language-plaintext highlighter-rouge">utf8mb4_bin</code>. Shouldn’t the <code class="language-plaintext highlighter-rouge">utf8col</code> value fail, due to being an <code class="language-plaintext highlighter-rouge">utf8</code> value, which is not handled by the winning collation?</p>

<p>No! MySQL will automatically convert the value, making it compatible. This is equivalent to:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">_utf8mb4</span><span class="s1">'ä'</span> <span class="o">&lt;</span> <span class="n">_utf8mb4</span><span class="s1">'🍕'</span> <span class="k">COLLATE</span> <span class="n">utf8mb4_bin</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">test_table</span><span class="p">;</span>
</code></pre></div></div>

<p>Second example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SET</span> <span class="k">NAMES</span> <span class="n">utf8mb4</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">test_table</span> <span class="p">(</span>
  <span class="n">utf8col</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_bin</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">_utf8</span><span class="s1">'ä'</span> <span class="nv">`utf8col`</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="n">utf8col</span> <span class="o">&lt;</span> <span class="s1">'ë'</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">test_table</span><span class="p">;</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">--------+</span>
<span class="o">#</span> <span class="o">|</span> <span class="k">result</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">--------+</span>
<span class="o">#</span> <span class="o">|</span>      <span class="mi">1</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">--------+</span>
</code></pre></div></div>

<p>The relevant rules are:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">The collation of a column or a stored routine parameter or local variable has a coercibility of 2</code></li>
  <li><code class="language-plaintext highlighter-rouge">The collation of a literal has a coercibility of 4</code></li>
</ol>

<p>The collation will be <code class="language-plaintext highlighter-rouge">utf8_bin</code>. Since <code class="language-plaintext highlighter-rouge">ë</code> can be converted, there’s no problem.</p>

<p>Equivalent statement:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">_utf8</span><span class="s1">'ä'</span> <span class="k">COLLATE</span> <span class="n">utf8_bin</span> <span class="o">&lt;</span> <span class="n">_utf8mb4</span><span class="s1">'ë'</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">test_table</span><span class="p">;</span>
</code></pre></div></div>

<p>Final example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">test_table</span> <span class="p">(</span>
  <span class="n">utf8col</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span> <span class="k">COLLATE</span> <span class="n">utf8_bin</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="n">_utf8</span><span class="s1">'ä'</span> <span class="nv">`utf8col`</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="n">utf8col</span> <span class="o">&lt;</span> <span class="n">_utf8mb4</span><span class="s1">'🍕'</span> <span class="nv">`result`</span> <span class="k">FROM</span> <span class="n">test_table</span><span class="p">;</span>
<span class="n">ERROR</span> <span class="mi">1267</span> <span class="p">(</span><span class="n">HY000</span><span class="p">):</span> <span class="n">Illegal</span> <span class="n">mix</span> <span class="k">of</span> <span class="n">collations</span> <span class="p">(</span><span class="n">utf8_bin</span><span class="p">,</span><span class="k">IMPLICIT</span><span class="p">)</span> <span class="k">and</span> <span class="p">(</span><span class="n">utf8mb4_0900_ai_ci</span><span class="p">,</span><span class="n">COERCIBLE</span><span class="p">)</span> <span class="k">for</span> <span class="k">operation</span> <span class="s1">'&lt;'</span>
</code></pre></div></div>

<p>Error! What happened here?</p>

<p>The relevant rules and chosen collation are the same as the previous example, however, in this case, the pizza emoji (<code class="language-plaintext highlighter-rouge">🍕</code>) can’t be converted to <code class="language-plaintext highlighter-rouge">utf8</code>, therefore, the operation fails.</p>

<p>The conclusion is that as long as we use <code class="language-plaintext highlighter-rouge">utf8</code> characters only during the migration, we’ll have no problem, as the only relevant case is the second example.</p>

<h3 id="step-2-preparing-the-the-alter-statements">Step 2: Preparing the the <code class="language-plaintext highlighter-rouge">ALTER</code> statements</h3>

<p>In this step we’ll prepare all the <code class="language-plaintext highlighter-rouge">ALTER</code> statements that will change the schema/table metadata, and the data.</p>

<p>The operations are performed on a development database with the same structure as production.</p>

<p>First, we convert the database default charset (both production and development):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="k">SCHEMA</span> <span class="n">production_schema</span> <span class="nb">CHARACTER</span> <span class="k">SET</span><span class="o">=</span><span class="n">utf8mb4</span><span class="p">;</span>
</code></pre></div></div>

<p>data is not changed - only the metadata.</p>

<p>Then, we convert all the table charset to <code class="language-plaintext highlighter-rouge">utf8mb4</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mysqldump <span class="s2">"</span><span class="nv">$updating_schema</span><span class="s2">"</span> |
  perl <span class="nt">-ne</span> <span class="s1">'print "ALTER TABLE $1 CHARACTER SET utf8mb4;\n" if /CREATE TABLE (.*) /'</span> |
  mysql <span class="s2">"</span><span class="nv">$updating_schema</span><span class="s2">"</span>
</code></pre></div></div>

<p>again, data is not changed. This operation will cause all the columns that don’t match the new charset (supposedly, all the existing character columns), to show the former (<code class="language-plaintext highlighter-rouge">utf8</code>) charset in their definition:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">#</span> <span class="k">before</span> <span class="p">(</span><span class="n">simplified</span><span class="p">)</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">mytable</span> <span class="p">(</span>
  <span class="n">intcol</span> <span class="nb">INT</span><span class="p">,</span>
  <span class="n">strcol</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
  <span class="n">strcol2</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="p">);</span>

<span class="o">#</span> <span class="k">after</span>

<span class="k">CREATE</span> <span class="k">TEMPORARY</span> <span class="k">TABLE</span> <span class="n">mytable</span> <span class="p">(</span>
  <span class="n">intcol</span> <span class="nb">INT</span><span class="p">,</span>
  <span class="n">strcol</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span><span class="p">,</span>
  <span class="n">strcol2</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span>
<span class="p">)</span> <span class="k">DEFAULT</span> <span class="n">CHARSET</span><span class="o">=</span><span class="n">utf8mb4</span><span class="p">;</span>
</code></pre></div></div>

<p>This allows us to write a straight conversion command:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mysqldump <span class="nt">--no-data</span> <span class="nt">--skip-triggers</span> <span class="s2">"</span><span class="nv">$updating_schema</span><span class="s2">"</span> |
  egrep <span class="s1">'^CREATE TABLE|CHARACTER SET utf8\b'</span> |
  perl <span class="nt">-0777</span> <span class="nt">-pe</span> <span class="s1">'s/(CREATE TABLE [^\n]+ \(\n)+CREATE/CREATE/g'</span> | <span class="c"># remove tables without entries</span>
  perl <span class="nt">-0777</span> <span class="nt">-pe</span> <span class="s1">'s/,?\n(CREATE|$)/;\n$1/g'</span>  |                    <span class="c"># change comma of each last column def to semicolon (or add it)</span>
  perl <span class="nt">-pe</span> <span class="s1">'s/(CHARACTER SET utf8\b)/$1mb4/'</span> |                    <span class="c"># change charset</span>
  perl <span class="nt">-pe</span> <span class="s1">'s/  `/  MODIFY `/'</span> |                                  <span class="c"># add `MODIFY`</span>
  perl <span class="nt">-pe</span> <span class="s1">'s/^CREATE TABLE (.*) \(/ALTER TABLE $1/'</span>              <span class="c"># convert `CREATE TABLE ... (` to `ALTER TABLE`</span>
</code></pre></div></div>

<p>The output will consist of all the required <code class="language-plaintext highlighter-rouge">ALTER TABLES</code>, for example:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="nv">`mytable`</span>
  <span class="k">MODIFY</span> <span class="nv">`strcol`</span> <span class="nb">char</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span> <span class="k">DEFAULT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="k">MODIFY</span> <span class="nv">`strcol2`</span> <span class="nb">char</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span> <span class="k">DEFAULT</span> <span class="k">NULL</span><span class="p">;</span>
</code></pre></div></div>

<h4 id="issue-columnindex-size-limits">Issue: Column/index size limits</h4>

<p>A database engine needs to know the maximum length of the stored data, in this case, text, because the data structures are subject to limits.</p>

<p>In relation to the utf8 migration, the two related limits are:</p>

<ul>
  <li>the maximum length of a character column;</li>
  <li>the number of prefix characters stored in an index.</li>
</ul>

<p>In practice, something that may happen is that a table defined as such:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">mytable</span> <span class="p">(</span>
  <span class="n">longcol</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">21844</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span>
<span class="p">);</span>
</code></pre></div></div>

<p>will cause an error when converting to utf8mb4:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">mytable</span> <span class="k">MODIFY</span> <span class="n">longcol</span> <span class="nb">varchar</span><span class="p">(</span><span class="mi">21844</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span><span class="p">;</span>
<span class="n">ERROR</span> <span class="mi">1074</span> <span class="p">(</span><span class="mi">42000</span><span class="p">):</span> <span class="k">Column</span> <span class="k">length</span> <span class="n">too</span> <span class="n">big</span> <span class="k">for</span> <span class="k">column</span> <span class="s1">'longcol'</span> <span class="p">(</span><span class="k">max</span> <span class="o">=</span> <span class="mi">16383</span><span class="p">);</span> <span class="n">use</span> <span class="nb">BLOB</span> <span class="k">or</span> <span class="nb">TEXT</span> <span class="k">instead</span>
</code></pre></div></div>

<p>because of MySQL restriction of 65535 (2^16 - 1) bytes on the combined size of all the columns:</p>

<ul>
  <li>utf8:    21844 * 3 = 65532</li>
  <li>utf8mb4: 21844 * 4 = 87376 # too much</li>
  <li>utf8mb4: 16383 * 4 = 65532</li>
</ul>

<p>The same limit applies to index prefixes, although in this case there are two limits, 767 and 3072, depending on the row format and the long prefix option.</p>

<p>The restriction specifications can be found in the <a href="https://dev.mysql.com/doc/refman/5.7/en/innodb-restrictions.html#innodb-maximums-minimums">MySQL manual</a>.</p>

<p>If reducing the column width is not an option, the column will need to be converted to a <code class="language-plaintext highlighter-rouge">TEXT</code> data type.</p>

<p>Note that using very long character columns should be carefully evaluated. Advanced DBAs know the implications, however it’s worth mentioning that in relation to the topic of internal temporary tables, character columns larger than 512 characters cause on-disk tables to be used; large object columns (<code class="language-plaintext highlighter-rouge">BLOB</code>/<code class="language-plaintext highlighter-rouge">TEXT</code>) don’t have this problem from version 8.0.3 onwards (see <a href="https://dev.mysql.com/doc/refman/8.0/en/internal-temporary-tables.html">MySQL manual</a>).<br />
Therefore, large object columns are suitable for a larger amount of use cases than they were in the past.</p>

<h4 id="issue-triggersfunctions">Issue: Triggers/Functions</h4>

<p>Triggers and functions also require review.</p>

<p>Since they are executed outside the context of a connection, they carry their charset settings:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="n">TRIGGERS</span><span class="err">\</span><span class="k">G</span>
<span class="o">#</span> <span class="p">[...]</span>
<span class="o">#</span> <span class="n">character_set_client</span><span class="p">:</span> <span class="n">utf8</span>
<span class="o">#</span> <span class="n">collation_connection</span><span class="p">:</span> <span class="n">utf8_general_ci</span>
<span class="o">#</span>   <span class="k">Database</span> <span class="k">Collation</span><span class="p">:</span> <span class="n">utf8_general_ci</span>
</code></pre></div></div>

<p>On one hand, those properties can be executed at any point of the migration, as they act exactly as described in the <a href="#the-flexible-step-connection-configurations">connection configurations section</a>.</p>

<p>On the other hand, we need to take care of explicit <code class="language-plaintext highlighter-rouge">COLLATE</code> clauses involving columns being converted, if present.</p>

<p>Suppose we have this statement:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="k">SET</span> <span class="o">@</span><span class="n">column_updated</span> <span class="p">:</span><span class="o">=</span> <span class="k">OLD</span><span class="p">.</span><span class="n">strcol</span> <span class="o">&lt;=&gt;</span> <span class="k">NEW</span><span class="p">.</span><span class="n">strcol</span> <span class="k">COLLATE</span> <span class="n">utf8_bin</span><span class="p">;</span>
</code></pre></div></div>

<p>If we migrate the column to <code class="language-plaintext highlighter-rouge">utf8</code>, as soon as the <code class="language-plaintext highlighter-rouge">ALTER TABLE</code> completes, any operation associated to the trigger (eg. <code class="language-plaintext highlighter-rouge">INSERT</code>) will <strong>always</strong> fail, because the <code class="language-plaintext highlighter-rouge">utf8_bin</code> collation is not compatible with the new <code class="language-plaintext highlighter-rouge">utf8mb4</code> charset.</p>

<p>The solution is fairly simple - the trigger needs to be dropped before the <code class="language-plaintext highlighter-rouge">ALTER TABLE</code>, and recreated after. This of course, can be a serious challenge for high-traffic websites.</p>

<h4 id="issue-joins-between-columns-with-heterogeneous-charsets">Issue: Joins between columns with heterogeneous charsets</h4>

<p>Inevitably, some tables will be converted before others; even assuming parallel conversion, it’s not possible (without locking) to synchronize the end of the conversion of a set of given tables.</p>

<p>This creates a problem for a specific case: JOINs between columns of heterogeneous charsets - in practice, between a <code class="language-plaintext highlighter-rouge">utf8</code> column and an <code class="language-plaintext highlighter-rouge">utf8mb4</code> one.</p>

<p>In theory, this shouldn’t be a problem in itself. Let’s see what MySQL does in this case; let’s create a couple of tables:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">utf8_table</span> <span class="p">(</span>
  <span class="n">mb3col</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8</span><span class="p">,</span>
  <span class="k">KEY</span> <span class="nv">`mb3idx`</span> <span class="p">(</span><span class="n">mb3col</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">utf8_table</span>
<span class="k">VALUES</span> <span class="p">(</span><span class="s1">'a'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'b'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'c'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'d'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'e'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'f'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'g'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'h'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'i'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'j'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'k'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'l'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'m'</span><span class="p">),</span>
       <span class="p">(</span><span class="s1">'n'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'o'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'p'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'q'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'r'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'s'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'t'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'u'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'v'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'w'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'x'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'y'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'z'</span><span class="p">);</span>

<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">utf8mb4_table</span> <span class="p">(</span>
  <span class="n">mb4col</span> <span class="nb">CHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span><span class="p">,</span>
  <span class="k">KEY</span> <span class="nv">`mb4idx`</span> <span class="p">(</span><span class="n">mb4col</span><span class="p">)</span>
<span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">utf8mb4_table</span>
<span class="k">VALUES</span> <span class="p">(</span><span class="s1">'a'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'b'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'c'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'d'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'e'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'f'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'g'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'h'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'i'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'j'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'k'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'l'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'m'</span><span class="p">),</span>
       <span class="p">(</span><span class="s1">'n'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'o'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'p'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'q'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'r'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'s'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'t'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'u'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'v'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'w'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'x'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'y'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'z'</span><span class="p">),</span>
       <span class="p">(</span><span class="s1">'🍕'</span><span class="p">);</span>
</code></pre></div></div>

<p>First, let’s see what happen for simple index scans.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">utf8mb4_table</span> <span class="k">WHERE</span> <span class="n">mb4col</span> <span class="o">=</span> <span class="n">_utf8</span><span class="s1">'n'</span><span class="p">;</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+</span>
<span class="o">#</span> <span class="o">|</span> <span class="n">id</span> <span class="o">|</span> <span class="n">select_type</span> <span class="o">|</span> <span class="k">table</span>         <span class="o">|</span> <span class="n">partitions</span> <span class="o">|</span> <span class="k">type</span> <span class="o">|</span> <span class="n">possible_keys</span> <span class="o">|</span> <span class="k">key</span>    <span class="o">|</span> <span class="n">key_len</span> <span class="o">|</span> <span class="k">ref</span>   <span class="o">|</span> <span class="k">rows</span> <span class="o">|</span> <span class="n">filtered</span> <span class="o">|</span> <span class="n">Extra</span>       <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+</span>
<span class="o">#</span> <span class="o">|</span>  <span class="mi">1</span> <span class="o">|</span> <span class="k">SIMPLE</span>      <span class="o">|</span> <span class="n">utf8mb4_table</span> <span class="o">|</span> <span class="k">NULL</span>       <span class="o">|</span> <span class="k">ref</span>  <span class="o">|</span> <span class="n">mb4idx</span>        <span class="o">|</span> <span class="n">mb4idx</span> <span class="o">|</span> <span class="mi">5</span>       <span class="o">|</span> <span class="n">const</span> <span class="o">|</span>    <span class="mi">1</span> <span class="o">|</span>   <span class="mi">100</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span> <span class="k">Using</span> <span class="k">index</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+-------------+---------------+------------+------+---------------+--------+---------+-------+------+----------+-------------+</span>

<span class="k">SHOW</span> <span class="n">WARNINGS</span><span class="err">\</span><span class="k">G</span>
<span class="o">#</span> <span class="p">[...]</span>
<span class="o">#</span> <span class="n">Message</span><span class="p">:</span> <span class="cm">/* select#1 */</span> <span class="k">select</span> <span class="k">count</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">`COUNT(*)`</span> <span class="k">from</span> <span class="nv">`db`</span><span class="p">.</span><span class="nv">`utf8mb4_table`</span> <span class="k">where</span> <span class="p">(</span><span class="nv">`db`</span><span class="p">.</span><span class="nv">`utf8mb4_table`</span><span class="p">.</span><span class="nv">`mb4col`</span> <span class="o">=</span> <span class="s1">'n'</span><span class="p">)</span>
</code></pre></div></div>

<p>Interestingly, it seems that MySQL converts the data before it reaches the optimizer; this is valuable knowledge, because with the current constraint(s), we can rely on the indexes as much as before the migration start.</p>

<p>What happens with JOINs?</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">EXPLAIN</span> <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">FROM</span> <span class="n">utf8_table</span> <span class="k">JOIN</span> <span class="n">utf8mb4_table</span> <span class="k">ON</span> <span class="n">mb3col</span> <span class="o">=</span> <span class="n">mb4col</span><span class="p">;</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+</span>
<span class="o">#</span> <span class="o">|</span> <span class="n">id</span> <span class="o">|</span> <span class="n">select_type</span> <span class="o">|</span> <span class="k">table</span>         <span class="o">|</span> <span class="n">partitions</span> <span class="o">|</span> <span class="k">type</span>  <span class="o">|</span> <span class="n">possible_keys</span> <span class="o">|</span> <span class="k">key</span>    <span class="o">|</span> <span class="n">key_len</span> <span class="o">|</span> <span class="k">ref</span>  <span class="o">|</span> <span class="k">rows</span> <span class="o">|</span> <span class="n">filtered</span> <span class="o">|</span> <span class="n">Extra</span>                    <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+</span>
<span class="o">#</span> <span class="o">|</span>  <span class="mi">1</span> <span class="o">|</span> <span class="k">SIMPLE</span>      <span class="o">|</span> <span class="n">utf8_table</span>    <span class="o">|</span> <span class="k">NULL</span>       <span class="o">|</span> <span class="k">index</span> <span class="o">|</span> <span class="k">NULL</span>          <span class="o">|</span> <span class="n">mb3idx</span> <span class="o">|</span> <span class="mi">4</span>       <span class="o">|</span> <span class="k">NULL</span> <span class="o">|</span>   <span class="mi">26</span> <span class="o">|</span>   <span class="mi">100</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span> <span class="k">Using</span> <span class="k">index</span>              <span class="o">|</span>
<span class="o">#</span> <span class="o">|</span>  <span class="mi">1</span> <span class="o">|</span> <span class="k">SIMPLE</span>      <span class="o">|</span> <span class="n">utf8mb4_table</span> <span class="o">|</span> <span class="k">NULL</span>       <span class="o">|</span> <span class="k">ref</span>   <span class="o">|</span> <span class="n">mb4idx</span>        <span class="o">|</span> <span class="n">mb4idx</span> <span class="o">|</span> <span class="mi">5</span>       <span class="o">|</span> <span class="n">func</span> <span class="o">|</span>    <span class="mi">1</span> <span class="o">|</span>   <span class="mi">100</span><span class="p">.</span><span class="mi">00</span> <span class="o">|</span> <span class="k">Using</span> <span class="k">where</span><span class="p">;</span> <span class="k">Using</span> <span class="k">index</span> <span class="o">|</span>
<span class="o">#</span> <span class="o">+</span><span class="c1">----+-------------+---------------+------------+-------+---------------+--------+---------+------+------+----------+--------------------------+</span>
</code></pre></div></div>

<p>What’s <code class="language-plaintext highlighter-rouge">func</code>?</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SHOW</span> <span class="n">WARNINGS</span><span class="err">\</span><span class="k">G</span>
<span class="o">#</span> <span class="n">Message</span><span class="p">:</span> <span class="cm">/* select#1 */</span> <span class="k">select</span> <span class="k">count</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="k">AS</span> <span class="nv">`COUNT(*)`</span> <span class="k">from</span> <span class="nv">`db`</span><span class="p">.</span><span class="nv">`utf8_table`</span> <span class="k">join</span> <span class="nv">`db`</span><span class="p">.</span><span class="nv">`utf8mb4_table`</span> <span class="k">where</span> <span class="p">(</span><span class="k">convert</span><span class="p">(</span><span class="nv">`db`</span><span class="p">.</span><span class="nv">`utf8_table`</span><span class="p">.</span><span class="nv">`mb3col`</span> <span class="k">using</span> <span class="n">utf8mb4</span><span class="p">)</span> <span class="o">=</span> <span class="nv">`db`</span><span class="p">.</span><span class="nv">`utf8mb4_table`</span><span class="p">.</span><span class="nv">`mb4col`</span><span class="p">)</span>
</code></pre></div></div>

<p>Very interesting; we see what MySQL does in this case: it iterates <code class="language-plaintext highlighter-rouge">utf8_table.mb3col</code> (specifically, it iterates the index <code class="language-plaintext highlighter-rouge">mb3idx</code>), and for each value, it converts it to <code class="language-plaintext highlighter-rouge">utf8mb4</code>, so that it can be sought it in the <code class="language-plaintext highlighter-rouge">utf8mb4_table.mb4idx</code> index.</p>

<p>Note that this is a simple case; more complex JOINs in the app should still be carefully reviewed.</p>

<h3 id="step-3-altering-the-schema-and-tables">Step 3: Altering the schema and tables</h3>

<p>Now we can proceed to alter the production schema.</p>

<p>The schema encoding can be changed without any worry, as it’s not a locking operation (up to v5.7, database properties are stored in a separate file, <code class="language-plaintext highlighter-rouge">db.opt</code>).</p>

<p>The table changes are the “big deal”: we need to perform them without locking, and with an awareness of the implications.</p>

<p>In order to avoid table locking, we use <a href="https://github.com/github/gh-ost">gh-ost</a>, which is easy to use and well-documented.</p>

<p>Generally speaking, each <code class="language-plaintext highlighter-rouge">ALTER TABLE</code> of the list generated in <a href="#step-2-preparing-the-the-alter-statements">the previous step</a> must be converted to a <code class="language-plaintext highlighter-rouge">gh-ost</code> command and executed.</p>

<p>For example, this DDL statement:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ALTER</span> <span class="k">TABLE</span> <span class="nv">`mytable`</span>
  <span class="k">MODIFY</span> <span class="nv">`strcol`</span> <span class="nb">char</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span> <span class="k">DEFAULT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="k">MODIFY</span> <span class="nv">`strcol2`</span> <span class="nb">char</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="nb">CHARACTER</span> <span class="k">SET</span> <span class="n">utf8mb4</span> <span class="k">DEFAULT</span> <span class="k">NULL</span><span class="p">;</span>
</code></pre></div></div>

<p>needs to be performed as [simplified form]:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gh-ost <span class="nt">--database</span><span class="o">=</span><span class="s2">"</span><span class="nv">$production_schema</span><span class="s2">"</span> <span class="nt">--table</span><span class="o">=</span><span class="s2">"mytable"</span> <span class="nt">--alter</span><span class="o">=</span><span class="s2">"
  CHARACTER SET utf8mb4,
  MODIFY </span><span class="sb">`</span>strcol<span class="sb">`</span><span class="s2"> char(1) CHARACTER SET utf8mb4 DEFAULT NULL,
  MODIFY </span><span class="sb">`</span>strcol2<span class="sb">`</span><span class="s2"> char(1) CHARACTER SET utf8mb4 DEFAULT NULL
"</span>
</code></pre></div></div>

<p>This is a fairly simple procedure. Don’t forget to run <code class="language-plaintext highlighter-rouge">ANALYZE TABLE</code> on each table after it’s been rebuilt.</p>

<p>The problem that some users will have is triggers; gh-ost doesn’t support tables with triggers, so an alternative procedure needs to be applied by high-traffic websites using this functionality.</p>

<h3 id="warnings">Warnings</h3>

<p>Little gotchas to be aware of!</p>

<h4 id="other-schemas">Other schemas</h4>

<p>Don’t forget to convert the other schemas as well!</p>

<p>In particular, if you’re on AWS, the schema <code class="language-plaintext highlighter-rouge">tmp</code> will need to be converted. Forgetting to do so may cause errors if this database is used for temporary data operations that involve the main production database.</p>

<h4 id="always-run-analyze-table">Always run <code class="language-plaintext highlighter-rouge">ANALYZE TABLE</code></h4>

<p>It’s crucial to always run an <code class="language-plaintext highlighter-rouge">ANALYZE TABLE</code> for each table rebuilt. Gh-ost builds tables via successive insert, and it’s good (MySQL) DBA practice to:</p>

<blockquote>
  <p>run ANALYZE TABLE after loading substantial data into an InnoDB table, or creating a new index for one</p>
</blockquote>

<p>See the <a href="https://dev.mysql.com/doc/refman/5.7/en/analyze-table.html">MySQL manual</a> for more informations.</p>

<h4 id="dont-rush-the-drop-table">Don’t rush the <code class="language-plaintext highlighter-rouge">DROP TABLE</code></h4>

<p>Gh-ost doesn’t delete the old table after replacing it - it only renames it. Be very careful when deleting it; a straight <code class="language-plaintext highlighter-rouge">DROP TABLE</code> may flood the server with I/O.</p>

<p>Internally, we have a script for dropping large tables that first drops the indexes one by one, then deletes the records in chunks, and only at the end drops the (now empty) table.</p>

<h2 id="notes-about-mathias-bynens-post-on-the-same-subject">Notes about Mathias Bynens’ post on the same subject</h2>

<p>There’s <a href="https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4">a popular post about the same subject</a>, by a V8 developer (Mathias Bynens).</p>

<p>A couple of concepts are worth considering:</p>

<blockquote>
  <p># For each table<br />
REPAIR TABLE table_name;<br />
OPTIMIZE TABLE table_name;</p>
</blockquote>

<p>From this, it can be deduced that the author uses MyISAM, as InnoDB doesn’t support <code class="language-plaintext highlighter-rouge">REPAIR TABLE</code> (see the <a href="https://dev.mysql.com/doc/refman/5.7/en/repair-table.html">MySQL manual</a>).</p>

<blockquote>
  <p>make sure to repair and optimize all databases and tables […] ran into some weird bugs where UPDATE statements didn’t have any effect, even though no errors were thrown</p>
</blockquote>

<p>this is very likely a bug, and based on the previous point, it may be MyISAM related (or related to <code class="language-plaintext highlighter-rouge">ALTER TABLE</code>). MyISAM has been essentially abandoned for a long time, and we’ve experienced buggy behaviors as well (although not in the context of charsets), so it wouldn’t be a surprise; the post is also very old (2012).</p>

<p>We’re entirely on InnoDB, and we didn’t experience any issue when changing the charset via <code class="language-plaintext highlighter-rouge">ALTER TABLE</code> (small tables in our model have been done this way). It’s also worth considering that gh-ost alters tables by creating an empty table and slowly filling it, which is different from issuing an <code class="language-plaintext highlighter-rouge">ALTER TABLE</code>.</p>

<p>If somebody still wanted to do a rebuild of the table, note that <code class="language-plaintext highlighter-rouge">OPTIMIZE TABLE</code> performs a full rebuild followed by <code class="language-plaintext highlighter-rouge">ANALYZE TABLE</code>, so it’s not required to run the latter statement separately.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Considering that migrating a database to <code class="language-plaintext highlighter-rouge">utf8mb4</code> implies literally rebuilding the entire database’s data, it’s been a ride with relatively few bumps.</p>

<p>The core issue is handling JOINs between columns being migrated; it may not be a trivial matter, but it’s possible to get deterministic behavior with a thorough analysis.</p>

<p>Projects planning to move to MySQL 8.0 are encouraged to perform this step ahead, to shift as many possible changes related to the upgrade ahead of the upgrade itself.</p>

<p>All in all, migrating to <code class="language-plaintext highlighter-rouge">utf8mb4</code> is a very significant change, but knowing where to look at, it’s possible to perform it smoothly.</p>

<h2 id="footnotes">Footnotes</h2>

<p><a name="footnote01">¹</a> Very likely, partial indexes are a fit solution to this problem, but they’re not supported by MySQL.</p>]]></content><author><name></name></author><category term="mysql" /><category term="databases" /><category term="mysql" /><category term="sysadmin" /><summary type="html"><![CDATA[We’re in the process of upgrading our MySQL databases from v5.7 to v8.0; since one of the differences in v8.0 is that the default encoding changed from utf8 to utf8mb4, and we had the conversion in plan anyway, we anticipated it and performed it as preliminary step for the upgrade. This post describes in depth the overall experience, including tooling and pitfalls, and related subjects. Contents: Introduction Migration plan: overview and considerations !! COLLATION WARNING !! Free step: connection configuration How do charset settings affect database operations? Step 2: Preparing the the ALTER statements Issue: Column/index size limits Issue: Triggers/Functions Issue: Joins between columns with heterogeneous charsets Step 3: Altering the schema and tables Warnings Other schemas Always run ANALYZE TABLE Don’t rush the DROP TABLE Notes about Mathias Bynens’ post on the same subject Conclusion Footnotes]]></summary></entry><entry><title type="html">Dropping a database column in production without waiting time and/or schema-aware code, on a MySQL/Rails setup</title><link href="https://saveriomiroddi.github.io/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup/" rel="alternate" type="text/html" title="Dropping a database column in production without waiting time and/or schema-aware code, on a MySQL/Rails setup" /><published>2019-02-12T00:00:00+00:00</published><updated>2019-02-12T00:00:00+00:00</updated><id>https://saveriomiroddi.github.io/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup</id><content type="html" xml:base="https://saveriomiroddi.github.io/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup/"><![CDATA[<p>We recently had to drop a column in production, from a relatively large (order of 10⁷ records) table.</p>

<p>On modern MySQL setups, dropping a column doesn’t lock the table (it does, actually, but for a relatively short time), however, we wanted to improve a very typical Rails migration scenario in a few ways:</p>

<ol>
  <li>offloading the column dropping time from the deploy;</li>
  <li>ensuring that in the time between the column is dropped and the app servers restarted, the app doesn’t raise errors due to the expectation that the column is present;</li>
  <li>not overloading the database with I/O.</li>
</ol>

<p>I’ll give the Gh-ost tool a brief introduction, and show how to fulfill the above requirements in a simple way, by using this tool and an ActiveRecord flag.</p>

<p>This workflow can be applied to almost any table alteration scenario.</p>

<p>Contents:</p>

<ul>
  <li><a href="/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup#gh-ost">Gh-ost</a></li>
  <li><a href="/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup#setup-and-workflow">Setup and workflow</a>
    <ul>
      <li><a href="/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup#existing-configuration">Existing configuration</a></li>
      <li><a href="/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup#configure-activerecord-for-ignoring-the-column-and-performing-the-deploy">Configure ActiveRecord for ignoring the column, and performing the deploy</a></li>
      <li><a href="/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup#using-gh-ost-to-drop-the-column">Using gh-ost to drop the column</a></li>
      <li><a href="/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup#remove-the-ignored_columns-and-redeploy">Remove the <code class="language-plaintext highlighter-rouge">ignored_columns</code> and redeploy</a></li>
    </ul>
  </li>
  <li><a href="/Dropping-a-database-column-in-production-without-waiting-time-and-or-schema-aware-code-on-a-mysql-rails-setup#conclusion">Conclusion</a></li>
</ul>

<h2 id="gh-ost">Gh-ost</h2>

<p>Gh-ost is a relatively recent tool by GitHub, which allows online table modifications without locking.</p>

<p>Tools like gh-ost existed before - the first being <code class="language-plaintext highlighter-rouge">mk-online-schema-change</code> (now <code class="language-plaintext highlighter-rouge">pt-online-schema-change</code>), developed by Percona.</p>

<p>The Percona tool relies on triggers in order to achieve the objective, which is a good enough, stable, solution. However, there are a <a href="https://github.com/github/gh-ost/blob/master/doc/why-triggerless.md">variety of reasons</a> that (can) make the tool inadequate for high-load conditions.</p>

<p>Gh-ost introduced the novel idea of reading from the binary log (which logs all the write operation) in order to reproduce the writes on the temporary table.</p>

<p>Gh-ost can be run in different setups; this article will show the simplest one.</p>

<h2 id="setup-and-workflow">Setup and workflow</h2>

<h3 id="existing-configuration">Existing configuration</h3>

<p>Let’s assume the following table:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="nv">`customers`</span> <span class="p">(</span>
  <span class="c1">--- column definitions</span>
  <span class="nv">`source_id`</span> <span class="nb">int</span><span class="p">(</span><span class="mi">11</span><span class="p">)</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
  <span class="c1">-- index definitions</span>
  <span class="k">KEY</span> <span class="nv">`index_customers_on_source_id`</span> <span class="p">(</span><span class="nv">`source_id`</span><span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>with the corresponding model:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Customer</span> <span class="o">&lt;</span> <span class="no">ApplicationRecord</span>
  <span class="c1"># model content</span>
<span class="k">end</span>
</code></pre></div></div>

<p>and migration:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DropCustomersSourceId</span> <span class="o">&lt;</span> <span class="no">ActiveRecord</span><span class="o">::</span><span class="no">Migration</span>
  <span class="k">def</span> <span class="nf">change</span>
    <span class="n">remove_column</span> <span class="ss">:customers</span><span class="p">,</span> <span class="ss">:source_id</span>
  <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>

<h3 id="configure-activerecord-for-ignoring-the-column-and-performing-the-deploy">Configure ActiveRecord for ignoring the column, and performing the deploy</h3>

<p>First, we tackle point #2. Let’s have a look at the stages of a typical deploy with migrations:</p>

<ol>
  <li>the deploy starts: various operations are performed, including copying the new codebase to a release directory, without the app servers actually (re)loading it;</li>
  <li>the migrations are executed - in this case, with an underlying <code class="language-plaintext highlighter-rouge">ALTER TABLE</code> statement, which will take a long time;</li>
  <li>the <em>current</em> release directory is linked to the new codebase, and the app servers (processes) are restarted;</li>
  <li>other operations are performed.</li>
</ol>

<p>The problem is that between the stages 2. and 3. (and also, depending on the app server configuration, during the processes restart), the app servers will have in memory the old version of the codebase, which expects <code class="language-plaintext highlighter-rouge">customers.source_id</code> to be present.</p>

<p>Although this time is relatively short, on a high-load environment, if a <code class="language-plaintext highlighter-rouge">Customer</code> instance is saved, the operation will fail, because ActiveRecord will include the column in the underlying INSERT.</p>

<p>In systems engineering, schema-aware code strategy is sometimes applied: essentially, writing code in the form “if the schema is <code class="language-plaintext highlighter-rouge">foo</code>, do <code class="language-plaintext highlighter-rouge">bar</code>, otherwise, do <code class="language-plaintext highlighter-rouge">baz</code>”.</p>

<p>In the case of a column drop, we have at our disposal a “cheap” schema-aware strategy: <code class="language-plaintext highlighter-rouge">ignored_columns</code> (see the <a href="https://github.com/rails/rails/pull/21720">Rails PR</a>).</p>

<p>This directive makes ActiveRecord entirely ignore a column, so that the column can disappear at any time, without ActiveRecord noticing.</p>

<p>Let’s update the model:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Customer</span> <span class="o">&lt;</span> <span class="no">ApplicationRecord</span>
  <span class="nb">self</span><span class="p">.</span><span class="nf">ignored_columns</span> <span class="o">=</span> <span class="sx">%w(source_id)</span>
  <span class="c1"># model content</span>
<span class="k">end</span>
</code></pre></div></div>

<p>and the migration:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DropcustomersSourceId</span> <span class="o">&lt;</span> <span class="no">ActiveRecord</span><span class="o">::</span><span class="no">Migration</span><span class="p">[</span><span class="mf">5.2</span><span class="p">]</span>
  <span class="k">def</span> <span class="nf">change</span>
    <span class="n">remove_column</span> <span class="ss">:customers</span><span class="p">,</span> <span class="ss">:source_id</span> <span class="k">unless</span> <span class="n">is_production_environment?</span>
  <span class="k">end</span>

  <span class="k">def</span> <span class="nf">is_production_environment?</span>
    <span class="c1"># choose strategy</span>
  <span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>

<p>We can now perform the deploy; this time, the table column will not be dropped. After the deploy, we will use gh-ost, as outlined in the next section.</p>

<h3 id="using-gh-ost-to-drop-the-column">Using gh-ost to drop the column</h3>

<p>Gh-ost is pretty straightforward to use. In this context it’s used in the simplest way possible, that is, running directly on master.</p>

<p>Note that there are many options available, including:</p>

<ul>
  <li>sharing the load with slaves,</li>
  <li>regulating the I/O load,</li>
  <li>not including the password in the command (for security reasons).</li>
</ul>

<p>A summary document is available <a href="https://github.com/github/gh-ost/blob/master/doc/cheatsheet.md">here</a>; gh-ost has good documentation.</p>

<p>The sample command we use is:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ GHOST_TABLE</span><span class="o">=</span><span class="s2">"customers"</span>
<span class="nv">$ GHOST_ALTER</span><span class="o">=</span><span class="s2">"DROP source_id"</span>

<span class="nv">$ </span>gh-ost <span class="se">\</span>
    <span class="nt">--user</span><span class="o">=</span><span class="s2">"</span><span class="nv">$GHOST_USER</span><span class="s2">"</span> <span class="nt">--password</span><span class="o">=</span><span class="s2">"</span><span class="nv">$GHOST_PASSWORD</span><span class="s2">"</span> <span class="nt">--host</span><span class="o">=</span><span class="s2">"</span><span class="nv">$GHOST_HOST</span><span class="s2">"</span> <span class="se">\</span>
    <span class="nt">--database</span><span class="o">=</span><span class="s2">"</span><span class="nv">$GHOST_SCHEMA</span><span class="s2">"</span> <span class="nt">--table</span><span class="o">=</span><span class="s2">"</span><span class="nv">$GHOST_TABLE</span><span class="s2">"</span> <span class="nt">--alter</span><span class="o">=</span><span class="s2">"</span><span class="nv">$GHOST_ALTER</span><span class="s2">"</span> <span class="se">\</span>
    <span class="nt">--allow-on-master</span> <span class="nt">--exact-rowcount</span> <span class="nt">--verbose</span> <span class="nt">--execute</span>
</code></pre></div></div>

<p>The options are clear; <code class="language-plaintext highlighter-rouge">--exact-rowcount</code> will trade a little execution time for more accurate progress estimation.</p>

<p>Gh-ost will create a temporary (in a logical, not SQL, sense) table, slowly fill it and update with original table updates, then swap (with negligible locking time) them.</p>

<p>A crucial detail is that gh-ost will leave the original table in the database, renamed (in this case, <code class="language-plaintext highlighter-rouge">_customers_del</code>).</p>

<p>Although there is an option to drop the table automatically, <strong>do not enable it or do not attempt to do it manually</strong>: dropping a large table creates a large amount of I/O, due to MySQL freeing the pool pages, which will likely halt the database system to a grind for some time. Instead, one should follow a progressive table drop workflow:</p>

<ul>
  <li>drop the indexes (optionally, individually);</li>
  <li>delete the records in batches;</li>
  <li>drop the (now empty) table.</li>
</ul>

<p>Between each drop/deletion, <code class="language-plaintext highlighter-rouge">SLEEP</code> calls should be performed, in order to ensure that the writes are fully flushed.</p>

<p>Internally, we have a script for this, and it’s advised to find or develop something similar.</p>

<p>Of course, <code class="language-plaintext highlighter-rouge">SLEEP</code> can be replaced with sophisticated strategies (eg. relying on the server statistics to track the I/O), however, in our system, <code class="language-plaintext highlighter-rouge">SLEEP</code> is a perfectly adequate while simple strategy.</p>

<h3 id="remove-the-ignored_columns-and-redeploy">Remove the <code class="language-plaintext highlighter-rouge">ignored_columns</code> and redeploy</h3>

<p>At this point, in production, Rails will be completely unaware of the existence (or not) of the column (being) dropped.</p>

<p>After the column is dropped, we can remove the <code class="language-plaintext highlighter-rouge">Customer.ignored_columns</code> directive, and deploy any time (or even wait for the next deploy).</p>

<h2 id="conclusion">Conclusion</h2>

<p>We’ve been using gh-ost for a long time by now, and we’ve developed a surrounding tooling ecosystem.</p>

<p>Once one gets used to such workflows, it’s actually satisfying to perform “push-button” table alterations without any locking or performance drop in general, instead of being worried of the impact of (relatively) large-scale db operations.</p>

<p>Paraphrasing the typical joke:</p>

<ul>
  <li>Did you notice the downtime today during the migration?</li>
  <li>WHAT!?! NO!</li>
  <li>Exactly.</li>
</ul>

<p>;-)</p>]]></content><author><name></name></author><category term="mysql" /><category term="ruby" /><category term="rails" /><category term="mysql" /><category term="databases" /><summary type="html"><![CDATA[We recently had to drop a column in production, from a relatively large (order of 10⁷ records) table. On modern MySQL setups, dropping a column doesn’t lock the table (it does, actually, but for a relatively short time), however, we wanted to improve a very typical Rails migration scenario in a few ways: offloading the column dropping time from the deploy; ensuring that in the time between the column is dropped and the app servers restarted, the app doesn’t raise errors due to the expectation that the column is present; not overloading the database with I/O. I’ll give the Gh-ost tool a brief introduction, and show how to fulfill the above requirements in a simple way, by using this tool and an ActiveRecord flag. This workflow can be applied to almost any table alteration scenario. Contents: Gh-ost Setup and workflow Existing configuration Configure ActiveRecord for ignoring the column, and performing the deploy Using gh-ost to drop the column Remove the ignored_columns and redeploy Conclusion]]></summary></entry></feed>