Friday, April 26, 2024
HomeJavaUtilizing native JSON variables to emulate window features in MySQL 5.7

Utilizing native JSON variables to emulate window features in MySQL 5.7


One in every of MySQL 8’s greatest enhancements is the assist of window features. As I all the time mentioned in conferences, there’s SQL earlier than window features and SQL after window features. When you begin utilizing them, you’ll use them all over the place.

A few of you poor souls are unlucky sufficient to be caught on MySQL 5.7, both of your individual selecting, or since you’re utilizing a clone / fork that’s nonetheless 5.7 suitable. Whereas for most individuals, this weblog publish is simply on your amusement, or nostalgia, for a few of you this publish will probably be fairly helpful.

Utilizing native variables

Loads of Stack Overflow questions or weblog posts on the market present the identical outdated trick utilizing native variables. In a procedural context, native variables make excellent sense. For instance, this assertion batch.

SET @c = (SELECT COUNT(*) FROM information_schema.tables);
-- Extra processing
-- Return the end result:
SELECT @c;

A bit hairier is the truth that these native variables will be declared inside a question, and incremented procedurally inside a question:

SELECT
  a, 

  -- Use and increment your variable in SELECT
  @rn := @rn + 1
FROM 
  (
    SELECT 3 AS a UNION ALL
    SELECT 4 AS a    
  ) AS t, 

  -- Declare your variable in FROM
  (SELECT @rn := 0) r
ORDER BY a;

And growth, you’ve a ROW_NUMBER() OVER (ORDER BY a) window operate! The end result being:

|a  |@rn := @rn + 1|
|---|--------------|
|3  |1             |
|4  |2             |

This works fairly by the way, as a result of the expression incrementing the row quantity “occurs to” be evaluated within the desired order, row by row, due to the question’s ORDER BY a clause. Revert it:

SELECT
  a, @rn := @rn + 1
FROM (
  SELECT 3 AS a UNION ALL
  SELECT 4 AS a    
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;

And you continue to get the specified end result:

|a  |@rn := @rn + 1|
|---|--------------|
|4  |1             |
|3  |2             |

That is actually bushy, as a result of it violates the concept of SQL’s logical order of operations, which most RDBMS agree upon. It assumes ORDER BY “occurs earlier than” SELECT, simply because the optimiser chooses to do issues this fashion. You’ll be able to tamper with the optimiser and break the “characteristic” simply, e.g. by including DISTINCT:

SELECT DISTINCT
  a, @rn := @rn + 1
FROM (
  SELECT 3 AS a UNION ALL
  SELECT 4 AS a    
) AS t, (SELECT @rn := 0) r
ORDER BY a DESC;

Now the result’s now not what we needed (how might it probably be?):

|a  |@rn := @rn + 1|
|---|--------------|
|4  |2             |
|3  |1             |

The reason being that DISTINCT is often applied utilizing a form or a hashmap, each is not going to protect any ordering, and in response to the aforementioned logical order of operations, that is completely nice, as a result of ORDER BY is purported to “occur after” SELECT and after DISTINCT, not less than logically.

However when you’re cautious, and canopy every part with sufficient assessments, you possibly can nonetheless use this trick. In any case, being caught with MySQL 5.7 is already painful sufficient, so why not deal with your self to an “virtually window operate”.

Observe: Simply to point how a lot of a nasty concept relying on this incidental characteristic is, MySQL 8.x now points a deprecation warning:

Setting person variables inside expressions is deprecated and will probably be eliminated in a future launch. Take into account options: ‘SET variable=expression, …’, or ‘SELECT expression(s) INTO variables(s)’.

The principle cause I’ve seen this syntax getting used on Stack Overflow to this point is to emulate ROW_NUMBER, so, I’d say, good riddance (now that MySQL 8 has window operate assist)

PARTITION BY utilizing ORDER BY

What I haven’t seen a lot on Stack Overflow or in blogs, is PARTITION BY assist. Most options I’ve seen use ORDER BY to implement partitioning, which is okay. For instance:

SELECT
  a, b,
  ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  IF (
    @prev = a, 
    @rn := @rn + 1, 
    CASE WHEN (@prev := a) IS NOT NULL OR TRUE THEN @rn := 1 END
  ) AS rn2
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 2 AS a, 4 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := 0, @prev := NULL) r
ORDER BY a, b DESC;

Producing:

|a  |b  |rn1|rn2|
|---|---|---|---|
|1  |5  |1  |1  |
|1  |3  |2  |2  |
|2  |6  |1  |1  |
|2  |4  |2  |2  |

Just a few notes:

  • The specified PARTITION BY and ORDER BY clauses each must be mirrored within the prime degree question. If you happen to solely needed to ORDER BY b DESC, not ORDER BY a as effectively, robust luck. (If you wish to mess around with this, strive eradicating the ROW_NUMBER() operate, which additionally orders stuff by a, implicitly)
  • I’ve tried to place all of the variable project logic right into a single expression as a way to keep away from any further columns being generated. This makes the expression a bit extra ugly than it wanted to be.

PARTITION BY utilizing JSON

A extra sturdy, however maybe slower strategy to emulating PARTITION BY could be to keep up a JSON object that retains monitor of every partition key’s ROW_NUMBER(), as a result of why not?

Behold this magnificence:

SELECT
  a, b,
  ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  json_extract(
    @rn := json_set(
      @rn, @path := concat('$."', a, '"'), 
      (coalesce(json_extract(@rn, @path), 0) + 1)
    ), 
    @path
  ) AS rn2,
  @rn AS debug -- Added for debugging functions solely
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 2 AS a, 4 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

Take a look at the outcomes:

|a  |b  |rn1|rn2|debug               |
|---|---|---|---|--------------------|
|2  |6  |1  |1.0|{"1": 2.0, "2": 1.0}|
|1  |5  |1  |1.0|{"1": 1.0}          |
|2  |4  |2  |2.0|{"1": 2.0, "2": 2.0}|
|1  |3  |2  |2.0|{"1": 2.0}          |

You’ll be able to do this on MySQL 5.7 (eradicating the ROW_NUMBER(), after all), and also you’ll see this works completely nice! How does it work?

  • We begin with an empty object {} within the FROM clause.
  • On each row that’s by the way ordered by the ORDER BY b DESC clause, we’ll extract the row quantity worth for the partition key PARTITION BY a. That is performed with a dynamically created JSON path expression concat('$."', a, '"'). For instance: $."1" or $."2".
  • At first, this worth is NULL, after all, so we flip it to zero with COALESCE(<expr>, 0).
  • We add 1 to it
  • Then we JSON_SET the worth again into the article, assigning the end result again to @rn.
  • Then, we re-extract the worth we’ve simply calculated

This could possibly be simplified a bit if it wasn’t only a single expression, however since I’m considering of implementing this emulation in jOOQ, I needed to do the train of retaining the projection unchanged (think about, the jOOQ person writes ROW_NUMBER() with jOOQ, and desires this to “simply work”).

Caveats:

  • If the PARTITION BY clause has a number of expressions, then the composite worth must be used as a key, e.g. utilizing some “not possible” concatenation token (a token that may’t seem within the information set), or a hash worth (risking collisions, after all), or an extra lookup, making issues fairly sophisticated.
  • The concat('$."', a, '"') expression doesn’t correctly quote a but, in case it comprises double quotes.
  • If a number of distinct window operate calculations with distinct ORDER BY clauses are required, then this strategy received’t work as simply. It could be attainable to calculate issues with one derived desk nest degree per window operate (?). Nonetheless, a number of distinct PARTITION BY clauses are nice. Simply generate a separate @rn variable per distinct PARTITOIN BY clause.
  • The JSON doc would possibly lose information sort info. For instance, in JSON, numbers could also be represented as floats, so when you require decimal precision, maybe you need to work with JSON strings as an alternative, and forged issues forwards and backwards, all the time valuing correctness over efficiency.

Do you assume you’ve seen every part? Let’s do one thing much more bushy:

DENSE_RANK with PARTITION BY

We received’t cease right here, as a result of as soon as we’ve chosen this loopy path, we’d as effectively see it to the top. Let’s emulate DENSE_RANK(), which is a bit tougher, making the SQL extra “stunning”:

SELECT
  a, b,
  DENSE_RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  json_extract(
    @rn := json_set(@rn, 
      @rnpath := concat('$."rn-', a, '"'), 
      (coalesce(json_extract(@rn, @rnpath), 0) + IF (
        json_extract(@rn, @prepath := concat('$."pre-v-', a, '"')) = b, 
        0, 1
      )),
      @prepath,
      b
    ), 
    @rnpath
  ) AS rn2,
  @rn AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

Right here’s the end result:

|a  |b  |rn1|rn2|debug                                                 |
|---|---|---|---|------------------------------------------------------|
|2  |6  |1  |1.0|{"rn-1": 2.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6}|
|1  |5  |1  |1.0|{"rn-1": 1.0, "pre-v-1": 5}                           |
|1  |5  |1  |1.0|{"rn-1": 1.0, "pre-v-1": 5}                           |
|1  |3  |2  |2.0|{"rn-1": 2.0, "pre-v-1": 3}                           |

How does it differ?

  • We now have to recollect not simply the earlier row quantity worth per partition ("rn-1", "rn-2"), but additionally the earlier worth of b (the ORDER BY standards) per partition ("pre-v-1", "pre-v-2").
  • Then, we increment the row quantity per partition provided that the earlier worth is totally different from the present worth

Caveats:

  • There can nonetheless be a number of PARTITION BY expressions, in addition to path escaping issues, see caveats of ROW_NUMBER for particulars.
  • If there are a number of ORDER BY columns, the "pre-v-n" values must keep in mind their composite worth, e.g. by nesting a JSON object. This can be a bit less complicated to consider than a number of PARTITION BY expressions

Furry sufficient? Let’s go deeper

RANK with PARTITION BY

Who would have thought that RANK is tougher than DENSE_RANK (see this text for a direct comparability of the features). Now, along with remembering the earlier ordering worth per partition, we additionally want to recollect the earlier rank per partition (all of the whereas persevering with to rely up the row quantity).

Observe, you possibly can refactor this to one thing extra readable when you take away the jOOQ imposed single expression restriction, however the place’s the problem in that, proper? Right here it’s, bow earlier than it in awe (or terror):

SELECT
  a, b,
  RANK() OVER (PARTITION BY a ORDER BY b DESC) AS rn1,
  coalesce(
    json_extract(
      @rn := json_set(@rn, 
        @rnpath := concat('$."rn-', a, '"'), 
        @currn := coalesce(json_extract(@rn, @rnpath), 0) + 1,
        @prevpath := concat('$."pre-v-', a, '"'),
        b,
        @prernpath := concat('$."pre-rn-', a, '"'),
        IF (json_extract(@rn, @prevpath) = b, 
          coalesce(json_extract(@rn, @prernpath), @currn) div 1,
          @currn
        )
      ), 
      @prernpath
    ), 
    @currn
  ) AS rn2,
  @rn AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @rn := '{}') r
ORDER BY b DESC;

It produces:

|a  |b  |rn1|rn2|debug                                                                                   |
|---|---|---|---|----------------------------------------------------------------------------------------|
|2  |6  |1  |1.0|{"rn-1": 3.0, "rn-2": 1.0, "pre-v-1": 3, "pre-v-2": 6, "pre-rn-1": 3.0, "pre-rn-2": 1.0}|
|1  |5  |1  |1.0|{"rn-1": 1.0, "pre-v-1": 5, "pre-rn-1": 1.0}                                            |
|1  |5  |1  |1.0|{"rn-1": 2.0, "pre-v-1": 5, "pre-rn-1": 1.0}                                            |
|1  |3  |3  |3.0|{"rn-1": 3.0, "pre-v-1": 3, "pre-rn-1": 3.0}                                            |

How does it work? “Merely”:

Caveats:

PERCENT_RANK and CUME_DIST

I’m not satisfied that these will be emulated with the native variable primarily based strategy. In precept:

  • PERCENT_RANK() OVER w is simply (RANK() OVER w - 1) / (COUNT(*) OVER () - 1)
  • CUME_DIST() OVER w is simply (RANK() OVER w) / (COUNT(*) OVER ())

However as we’ll see beneath, it’s not attainable (I feel?) to emulate COUNT(*) OVER () utilizing this native variable primarily based strategy. You possibly can, possibly, do one other spherical of calculations when wrapping issues in a derived desk, although.

LEAD, LAG, and so forth.

A few of these can be emulated with the above method, particularly those which can be “backward trying”.

  • LAG: For instance, with LAG, we simply have to recollect once more the "pre-v-x" for every partition, and produce it once more on the present row. Relying on the LAG‘s OFFSET, we’d have to preserve round a JSON array of values, all the time appending the present worth to the array, and eradicating the primary worth, like in a FIFO queue.
  • LEAD: The ahead trying features simply must reverse the ORDER BY clause. For instance, all LEAD features will be applied with LAG as effectively.
  • FIRST_VALUE: This can be a bit less complicated than LAG, as we don’t must preserve a whole JSON array of values. We simply keep in mind the primary one, after which preserve reproducing this.
  • LAST_VALUE is once more simply the inverse of FIRST_VALUE with reversed ORDER BY clause.
  • NTH_VALUE wants a counter per partition, to make sure we catch the Nth worth. Alternatively, we are able to once more retailer every part in a JSON array till it reaches dimension N.
  • IGNORE NULLS will be applied by skipping all of the NULL values from being entered into the aforementioned FIFO queue
  • Issues get a bit trickier when there’s a RANGE or ROWS clause, in case of which the JSON array / FIFO queue needs to be shifted. This impacts FIRST_VALUE greater than LEAD, I’d say.

The precise implementation is left as an train to the person. (In all probability about time to contemplate upgrading to MySQL 8, by now!)

Mixture features

All SQL combination features will be became window features by appending OVER (). For instance:

  • SUM(x) is an combination operate, aggregating information per group generated by the GROUP BY clause, shared by your complete question.
  • SUM(x) OVER () is the corresponding window operate, aggregating information per partition generated by the PARTITION BY clause per window operate (or fairly, per specific or implicit window specification)

Since these beforehand mentioned native variable primarily based approaches are row-by-row primarily based calculations, I don’t assume it’s attainable to emulate partition extensive combination window features, as a result of these require with the ability to take a look at your complete partition, together with rows that haven’t but been projected.

Nonetheless (by no means hand over!), some window frames will be emulated additionally for combination features, particularly the backward trying ones. For simplicity, I’ll simply strive emulating this:

SUM(b) OVER (
  PARTITION BY a
  ORDER BY b DESC
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)

Observe: with out specific window body, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is implicit, and that signifies that as a way to embody tied rows within the sum, we’d must once more be ahead trying, which I don’t assume is feasible with the native variable row-by-row primarily based strategy.

Nonetheless, it could be attainable to emulate various backward trying ROWS frames. That train is once more left to the reader.

So, let’s do that:

SELECT
  a, b,
  SUM(b) OVER w AS sum1,
  json_extract(
    @w := json_set(@w, 
      @spath := concat('$."s-', a, '"'), 
      (coalesce(json_extract(@w, @spath), 0) + b),
      @cpath := concat('$."c-', a, '"'), 
      (coalesce(json_extract(@w, @cpath), 0) + 1)
    ), 
    @spath
  ) AS sum2,
  COUNT(*) OVER w AS cnt1,
  json_extract(@w, @cpath) AS cnt2,
  AVG(b) OVER w AS avg1,
  json_extract(@w, @spath) / json_extract(@w, @cpath) AS avg2,
  @w AS debug
FROM (
  SELECT 1 AS a, 3 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 1 AS a, 5 AS b UNION ALL
  SELECT 2 AS a, 6 AS b
) AS t, (SELECT @w := '{}') r
WINDOW w AS (
  PARTITION BY a 
  ORDER BY b DESC 
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY b DESC;

The output appears appropriate:

|a  |b  |sum1|sum2|cnt1|cnt2|avg1  |avg2        |debug                                            |
|---|---|----|----|----|----|------|------------|-------------------------------------------------|
|2  |6  |6   |6.0 |1   |1.0 |6     |6           |{"c-1": 3.0, "c-2": 1.0, "s-1": 13.0, "s-2": 6.0}|
|1  |5  |5   |5.0 |1   |1.0 |5     |5           |{"c-1": 1.0, "s-1": 5.0}                         |
|1  |5  |10  |10.0|2   |2.0 |5     |5           |{"c-1": 2.0, "s-1": 10.0}                        |
|1  |3  |13  |13.0|3   |3.0 |4.3333|4.3333333333|{"c-1": 3.0, "s-1": 13.0}                        |

Notes:

  • I’ve saved all of the window calculations in the identical JSON object, assuming all of them share the identical window specification, to allow them to reuse their values (e.g. AVG(x) = SUM(x) / COUNT(x), and thus AVG(x) OVER w = SUM(x) OVER w / COUNT(x) OVER w)
  • Aside from that, issues work just about identical to for ROW_NUMBER()

Conclusion

This has been a enjoyable weblog publish to write down. I hope it was useful to you both as an train to consider what window features actually do, or within the worst case, that can assist you poor soul really implement issues this fashion on MySQL 5.7.

There have been plenty of caveats. This emulation strategy doesn’t all the time work and makes (heavy) assumptions about your question. For instance:

  • You’ll be able to’t use DISTINCT
  • You’ll be able to’t use arbitrary ORDER BY clauses that don’t match the window operate’s
  • You’ll be able to’t use a number of window features with totally different window specs
  • You’ll be able to’t use ahead trying window frames (together with frameless combination window features that combination your complete partition)

There are most likely extra caveats that haven’t been mentioned right here. If you happen to’re diligent, and take a look at issues closely, nonetheless, you would possibly be capable to pull off utilizing these approaches. Good luck (and don’t blame me 😅)

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments