You are here
Databases are standardized but in detail they behave different
For a fancy application we want to query a chunk of rows from a table and therefore we need the minimum and the maximum of the Primary Key of these rows.
Because InnoDB is an Index Organized Table or Index Clustered Table we know that this access will use the Primary Key. But to be sure and to be compliant with the standard (and compatible) we use and ORDER BY
on the Primary Key.
MySQL 5.7
First we create some test data:
mysql> CREATE TABLE t_my ( ID CHAR(32) NOT NULL PRIMARY KEY ) ENGINE = InnoDB; mysql> INSERT INTO t_my SELECT MD5(RAND()) FROM t_my; ... create more than 10 rows mysql> SELECT id FROM t_my ORDER BY id LIMIT 11; +----------------------------------+ | id | +----------------------------------+ | 01a6e76643c83c91867636ce90a8def5 | | 0ea1b1670343b4e70dd449207c720957 | | 141ec92e809c1d6af83d27e8a3e74fe7 | | 1605890e2c0244b019e6f66cc94790f2 | | 19826d67b6013ed3bc1105b9708959c4 | | 1a9ffd320187831df939d596c9a50aa1 | | 24ae3a883803f5ae8416754593cd881c | | 27e614f1b4490a6db1b26364e467d361 | | 285e3d84b81d97a40d66049d2f30071f | | 2db85e2f2639d637ee21888ca34334d7 | | 2f0e944ca977826730c352a1920cda1f | +----------------------------------+ 11 rows in set (0.00 sec)
Now comes the interesting part: We want the minimum and the maximum of the first chunk:
mysql> SELECT MIN(id), MAX(id) FROM t_my ORDER BY id LIMIT 10; +----------------------------------+----------------------------------+ | MIN(id) | MAX(id) | +----------------------------------+----------------------------------+ | 01a6e76643c83c91867636ce90a8def5 | f685b7269d76f47e7517cdd5fc4253bf | +----------------------------------+----------------------------------+
And this is completely not expected (aka wrong?)! Instead of the highest Primary Key of the chunk MySQL returns the highest Primary Key of the whole table:
mysql> SELECT MIN(id), MAX(id) FROM t_my; +----------------------------------+----------------------------------+ | MIN(id) | MAX(id) | +----------------------------------+----------------------------------+ | 01a6e76643c83c91867636ce90a8def5 | f685b7269d76f47e7517cdd5fc4253bf | +----------------------------------+----------------------------------+
If we look at how this is executed we see that the MySQL Optimizer took some kind of short cut:
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------------------+
And from this result it does a LIMIT 10
on just one tuple which ends up in the result above. If this is a bug or not I cannot say. But it is at least not what I want.
Now we want to know how other RDBMS are dealing with this problem:
PostgreSQL 11
shell> su - postgres shell> psql postgres=# CREATE TABLE t_pg ( id char(32) PRIMARY KEY ); postgres=# INSERT INTO t_pg SELECT MD5(RANDOM()::TEXT) FROM t_pg; ... create more that 10 rows postgres=# SELECT id FROM t_pg LIMIT 11; id ---------------------------------- 9433b59ec7e14b1232743b3bdcc745a0 56c59c2ce35e79b8f4141160b6dbcb69 dbfe35456b12b741c7e20a973a65fcac f5bbd52a92c7c0f63b5bdf14e0b1b020 e9b11d9243c701155f43506f7da95076 aba50026e35562d867398ddb5e1ffc37 586b98cfb8d7b19bf09f32e611298be5 3fd768fda852972d096a015be675233c 8c33a72edf0479298093b83a2d53ad59 98dfbe2979df25d8169747ee15bced07 5f7594d8b9de2694b4438d62579d658d (11 rows)
Now comes the interesting part:
postgres=# SELECT MIN(id), MAX(id) FROM t_pg ORDER BY id LIMIT 10; ERROR: column "t_pg.id" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: SELECT MIN(id), MAX(id) FROM t_pg ORDER BY id LIMIT 10; ^
PostgreSQL does NOT even allow this query. And complains about the id
in the ORDER BY
clause. If we write the query "correctly" we get a completely different result. Which is not usable for us:
postgres=# SELECT MIN(id), MAX(id) FROM t_pg GROUP BY id ORDER BY id LIMIT 10; min | max ----------------------------------+---------------------------------- 03581a898bcedb0fb2bbb842be2fdaf5 | 03581a898bcedb0fb2bbb842be2fdaf5 277ea3f40d4431cb9f41ac37848605f0 | 277ea3f40d4431cb9f41ac37848605f0 3d007edf4cb9b9ffed10d74ef30f6a4b | 3d007edf4cb9b9ffed10d74ef30f6a4b 3fd768fda852972d096a015be675233c | 3fd768fda852972d096a015be675233c 56c59c2ce35e79b8f4141160b6dbcb69 | 56c59c2ce35e79b8f4141160b6dbcb69 586b98cfb8d7b19bf09f32e611298be5 | 586b98cfb8d7b19bf09f32e611298be5 591ab5d306827cc7a8a3f5d9ee780edc | 591ab5d306827cc7a8a3f5d9ee780edc 5c33d18b907638956469d54630307b9d | 5c33d18b907638956469d54630307b9d 5f7594d8b9de2694b4438d62579d658d | 5f7594d8b9de2694b4438d62579d658d 6245ee76fbbe48d99b359deda7e38c0a | 6245ee76fbbe48d99b359deda7e38c0a (10 rows)
If we do it this way, which is not what we want, we get, similar to MySQL the "wrong" result.
postgres=# SELECT MIN(id), MAX(id) FROM t_pg LIMIT 10; min | max ----------------------------------+---------------------------------- 03581a898bcedb0fb2bbb842be2fdaf5 | f5bbd52a92c7c0f63b5bdf14e0b1b020
This result is the same like the maximum Primary Key:
postgres=# SELECT MIN(id), MAX(id) FROM t_pg; min | max ----------------------------------+---------------------------------- 03581a898bcedb0fb2bbb842be2fdaf5 | f5bbd52a92c7c0f63b5bdf14e0b1b020
If PostgreSQL does it like this, it at least according to the standards, I hope. But NOT what I want.
SQLite 3
The third candidate we have chosen is SQLite3:
sqlite> CREATE TABLE t_sl ( id TEXT PRIMARY KEY );
Here was the challenging part to make the MD5()
function work. Luckily we found a nearly perfect solution on StackOverflow.
sqlite> INSERT INTO t_sl SELECT HEX(MD5(RANDOM())) FROM t_sl; ... create more that 10 rows sqlite> SELECT id FROM t_sl LIMIT 11; 467FE9B4EC744D1B4C21C1405936E863 F7A2E0BF53EA5243A734A1FAACCD1D28 3DE4FC5680C9F2E3AB9E7EA4BE7F6D69 0878C0298916B1FBFE7808263CA1703D 56332C0BC2EBCB3D960167CF475B9581 9F8661DE560EF8040B205A58224A2251 5F8A7807F56E604DC8BB595FE0F579B4 A015A9539F3966930F17EE4B545271F6 714E544157E871CE826E5923F84AA096 5FD9F4F71739AB75BD60B94F303973AA 96113F12CDDFEC20E98BA621783E0A6C
Same results as MySQL and PostgreSQL and still not what I want:
sqlite> SELECT MIN(id), MAX(id) FROM t_sl ORDER BY id LIMIT 10; 0249461D8D3516D513F18DE3BC4CE677|FB75ECCB85BDDA88E5B8D48CF056B1CC sqlite> SELECT MIN(id), MAX(id) FROM t_sl; 0249461D8D3516D513F18DE3BC4CE677|FB75ECCB85BDDA88E5B8D48CF056B1CC
Oracle 18
Now let us have a look how Oracle does it:
sqlplus> CREATE TABLE t_ora ( id CHAR(32) NOT NULL PRIMARY KEY ); sqlplus> INSERT INTO t_ora SELECT STANDARD_HASH(dbms_random.random, 'MD5') FROM dual; ... create more that 10 rows sqlplus> SELECT DISTINCT id FROM t_ora; 264270A59D9EE04668C8F298DF3DF184 CCE8CA725CD633FC1AC2C73C32F0EAF4 9B4C28001530BA8FA8F682597576B88C 215D03A9E409D99C6EB9EAD11CD722CA 770F99E6D2A4929DEE4D54214D1C99E4 39A48517FB5B58403C85317F02AFC167 7E2D9C597602637634C7164811C3FA15 1826607210B081381254F7D1D061B25E 6984D425379806E16AA81424438003BA EBBCAE0E0B263A223D70E94D1020CCE9 7E854FFBAA1EAB1D7E18ADB085975C40 06215B1756DA5926D27BDB0BC47DDEA9 D3CC548842BD1F978326CAED25518533 E80B9F42585E42F1AAE3726EC49FEE7F 62FA08D4EF65DEBEC08F219C1DC4F583 7ADECA10EACA57F9D75AD1CDB4E7C965
Here we get completely confusing results with ROWNUM
:
sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 10; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 1826607210B081381254F7D1D061B25E EBBCAE0E0B263A223D70E94D1020CCE9 sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 5; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 C5EA7113397CF161A951ECBB80E1DFEF sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 6; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 E6692DFBF57EB110CAE2169A4204C37B sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 7; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 E6692DFBF57EB110CAE2169A4204C37B sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 8; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 E6692DFBF57EB110CAE2169A4204C37B sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 9; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 EF34D8DBC0832BFD813939FB1840804D sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 2 order by id ; MIN(ID) MAX(ID) -------------------------------- -------------------------------- B54068FF7AC56D6D8200F4E44410DCC6 C5EA7113397CF161A951ECBB80E1DFEF sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 3 order by id ; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 5D7FB42D47FB5F6F220B3872B48AC8ED C5EA7113397CF161A951ECBB80E1DFEF sqlplus> SELECT MIN(id), MAX(id) FROM t_ora WHERE ROWNUM <= 4 order by id; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 1CF93D49926CC3D9F83FC639B423A7F7 C5EA7113397CF161A951ECBB80E1DFEF
This is explainable according to my contact because Oracle does not guarantee ROWNUM
and it is a new set of data (ROWNUM
Pseudocolumn).
But rewriting the query would at least give us the right result.
sqlplus> SELECT MIN(id), MAX(id) FROM (SELECT id FROM t_ora ORDER BY id) WHERE ROWNUM <= 5 ; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 32ABDC3657FAE00B8EEE6EB2C42C12F1 sqlplus> SELECT MIN(id), MAX(id) FROM (SELECT id FROM t_ora ORDER BY id) WHERE ROWNUM <= 6 ; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 38FC2D4348D7A44C834D008F5B7BBD5E
I am pretty sure that this Query Execution Plan on Oracle would also look pretty bad and will not perform on huge data sets as we plan to have for our fancy application.
And in this case Oracle behaves like the other databases:
sqlplus> SELECT MIN(id), MAX(id) FROM t_ora FETCH FIRST 10 ROWS ONLY ; MIN(ID) MAX(ID) -------------------------------- -------------------------------- 06FDD744640D35C1E527AB05B8379349 F5D8266D3646C57F6AB3E6001D305A2C
Thanks to Markus R. for assistance on Oracle!
SQL Server 2019
And finally the last candidate to test is Microsoft SQL Server 2019 on Ubuntu 18.04:
shell> sqlcmd -S localhost -U SA mssql> CREATE TABLE t_ms ( ID CHAR(32) NOT NULL PRIMARY KEY ) GO mssql> INSERT INTO t_ms SELECT CONVERT(VARCHAR(32), HashBytes('MD5', STR(RAND(), 25, 20)), 2) GO ... create more than 10 rows mssql> SELECT id FROM t_ms ORDER BY id OFFSET 0 ROWS FETCH NEXT 11 ROWS ONLY GO id -------------------------------- 3203A25102554923CA11BD80C99D2728 3A0D79AA2466AE0EA580295FD5C81145 3DFE9A0A1FDD2654C6BBB24680D13B15 6A5CF25DDFE0278674EE98E02B5C4B38 8CAC11376C31E22E4AEF8214F31AA36B 96C5362832286577A3FB72A840855DFA A0CA369CAAE540A2A3E92317DC5B939F A5543C6D1244357CA89DD8B16A85E9EF B98148167FC627C0EDF632C981E9296B C50B5BC636BC29D2EDB13C5C1749F7D9 E633FD894FBFD6CAA20C7C4182D8EEBC (11 rows affected)
Similar error like PostgreSQL:
mssql> SELECT MIN(id), MAX(id) FROM t_ms ORDER BY id OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY GO Msg 8127, Level 16, State 1, Server ubuntu1804, Line 10 Column "t_ms.ID" is invalid in the ORDER BY clause because it is not contained in either an aggregate function or the GROUP BY clause.
And unusable results as with the other RDBMS:
mssql> SELECT MIN(id), MAX(id) FROM t_ms GROUP BY id ORDER BY id OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY GO -------------------------------- -------------------------------- 3203A25102554923CA11BD80C99D2728 3203A25102554923CA11BD80C99D2728 3A0D79AA2466AE0EA580295FD5C81145 3A0D79AA2466AE0EA580295FD5C81145 3DFE9A0A1FDD2654C6BBB24680D13B15 3DFE9A0A1FDD2654C6BBB24680D13B15 6A5CF25DDFE0278674EE98E02B5C4B38 6A5CF25DDFE0278674EE98E02B5C4B38 8CAC11376C31E22E4AEF8214F31AA36B 8CAC11376C31E22E4AEF8214F31AA36B 96C5362832286577A3FB72A840855DFA 96C5362832286577A3FB72A840855DFA A0CA369CAAE540A2A3E92317DC5B939F A0CA369CAAE540A2A3E92317DC5B939F A5543C6D1244357CA89DD8B16A85E9EF A5543C6D1244357CA89DD8B16A85E9EF B98148167FC627C0EDF632C981E9296B B98148167FC627C0EDF632C981E9296B C50B5BC636BC29D2EDB13C5C1749F7D9 C50B5BC636BC29D2EDB13C5C1749F7D9 (10 rows affected) mssql> SELECT MIN(id), MAX(id) FROM t_ms GO -------------------------------- -------------------------------- 3203A25102554923CA11BD80C99D2728 F8FFC36AB5AE80748CE6248EE9C4ACD8 (1 rows affected)
Microsoft SQL Server seems to behave similar to PostgreSQL. And IMHO Microsoft SQL Server is really unhandy on the CLI.
Some help to make MS SQL Server work:
- Quickstart: Install SQL Server and create a database on Ubuntu
- Reset SA password on SQL Server on Linux
I do not know why it did not accept the password in the first run.
Solution
The "correct" (aka wanted) result we get like this which is a similar solution as Markus proposed above:
mysql> SELECT MIN(id), MAX(id) FROM (SELECT id FROM t_my ORDER BY id LIMIT 10) AS x; +----------------------------------+----------------------------------+ | MIN(id) | MAX(id) | +----------------------------------+----------------------------------+ | 01a6e76643c83c91867636ce90a8def5 | 2db85e2f2639d637ee21888ca34334d7 | +----------------------------------+----------------------------------+
The MySQL Query Execution Plan does not really look too cool but is acceptable:
+----+-------------+------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 100.00 | NULL | | 2 | DERIVED | t_my | NULL | index | NULL | PRIMARY | 32 | NULL | 10 | 100.00 | Using index | +----+-------------+------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
Final thoughts
Linux (Debian and Ubuntu) is really cool because we were capable to test 4 of the these 5 different database products within a very short time (2 hours from scratch install). There is just one product which causes a bit more headache. But for this product one has friends...
And finally we found a completely different and much more generic approach for the problem to solve in our fancy application...
- Shinguz's blog
- Log in or register to post comments
Comments
Window functions
Some of my colleagues mentioned Window Functions. Maybe it works. But I fear that window functions are not fast because they do some materialization in between? I have to test...