Feed aggregator

dbstat for MariaDB after one month of productive use

Shinguz - Fri, 2024-04-26 14:13

Table of contents

Review
One month later
Size of the tables
Process list
Global variables
Metadata Lock and InnoDB Transaction Lock
Global status

Review

After we introduced dbstat for MariaDB (and MySQL) a good 5 weeks ago, we naturally also rolled it out on our systems to test the behaviour in daily use (eat your own dog food).

This went quite well until we came up with the idea of activating dbstat on the passive dbstat node on our MariaDB active/passive master/master replication cluster (a similar situation would also occur with a Galera cluster). We realised that the design of dbstat still had potential. After this problem was fixed (v0.0.2 and v0.0.3) and the problem of how to activate events on master AND slave was solved (MDEV-33782: Event is always disabled on slave), everything seemed fine at first glance. Unfortunately, we did not realise that the data also had to be adjusted. As a result, our replication came to a complete stop over the Easter holidays, which then led to another problem when catching up (MDEV-33923: MariaDB parallel replication causes Foreign Key errors).

After this minor incident was also resolved, dbstat has been running flawlessly on our MariaDB master/master replication cluster ever since... The product dbstat is open source (GPLv2) and can be downloaded from GitHub.

One month later

Databases should NOT grow over time but only over the number of {customers, products, etc.} once the desired equilibrium (steady state) is reached. In our dbstat installation, we have set this equilibrium state to 30 days. So it is now time that the size of dbstat stabilises and the database stops growing...

It would also be interesting to understand what practical use dbstat has. That is why we have now set to work and are trying to analyse the results of dbstat.

Here is an overview of the 11 current running database events:

Firstly, the growth of dbstat itself is interesting. But of course this evaluation can also be carried out for any other database, table or catalogue (coming in MariaDB 11.7?):

SQL> SET SESSION sql_mode='STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION,only_full_group_by'; SQL> SET @machine_name = @@hostname; SQL> SELECT `table_schema`, SUBSTR(`ts`, 1, 10) AS date , ROUND(SUM(`data_length`)/1024/1024, 1) AS data_mb , ROUND(SUM(`index_length`)/1024/1024, 1) AS index_mb , ROUND(SUM(`data_free`)/1024/1024, 1) AS free_mb , ROUND((SUM(`data_length`) + SUM(`index_length`) + SUM(`data_free`))/1024/1024, 1) AS total_mb , ROUND(SUM(`table_rows`)/1000/1000, 1) AS rows_m FROM `table_size` WHERE `machine_name` = @machine_name AND `table_catalog` = 'def' AND `table_schema` = 'dbstat' GROUP BY `table_catalog`, `table_schema`, `date` ORDER BY `table_catalog`, `table_schema`, `date` ASC ; +--------------+------------+---------+----------+---------+----------+--------+ | table_schema | date | data_mb | index_mb | free_mb | total_mb | rows_m | +--------------+------------+---------+----------+---------+----------+--------+ | dbstat | 2024-03-26 | 762.8 | 1128.6 | 18.0 | 1909.4 | 10.9 | | dbstat | 2024-03-27 | 835.8 | 1241.6 | 17.0 | 2094.4 | 11.1 | | dbstat | 2024-03-28 | 837.8 | 1241.6 | 14.0 | 2093.4 | 11.8 | | dbstat | 2024-03-29 | 960.7 | 1443.6 | 18.0 | 2422.4 | 14.2 | | dbstat | 2024-03-30 | 960.7 | 1443.6 | 17.0 | 2421.4 | 15.0 | | dbstat | 2024-03-31 | 1057.7 | 1604.6 | 20.0 | 2682.4 | 16.9 | | dbstat | 2024-04-01 | 1057.7 | 1602.6 | 21.0 | 2681.4 | 17.6 | | dbstat | 2024-04-02 | 1172.7 | 1797.6 | 22.0 | 2992.3 | 17.8 | | dbstat | 2024-04-03 | 1442.8 | 2333.7 | 12.0 | 3788.5 | 22.8 | | dbstat | 2024-04-04 | 1649.8 | 2723.7 | 13.0 | 4386.5 | 24.4 | | dbstat | 2024-04-05 | 1649.8 | 2722.7 | 14.0 | 4386.5 | 26.0 | | dbstat | 2024-04-06 | 1821.8 | 3034.8 | 13.0 | 4869.6 | 24.6 | | dbstat | 2024-04-07 | 1821.8 | 3034.8 | 14.0 | 4870.6 | 26.2 | | dbstat | 2024-04-08 | 1989.9 | 3344.8 | 12.0 | 5346.6 | 29.9 | | dbstat | 2024-04-09 | 1990.9 | 3343.8 | 14.0 | 5348.6 | 31.5 | | dbstat | 2024-04-10 | 2193.9 | 3712.8 | 13.0 | 5919.7 | 31.6 | | dbstat | 2024-04-11 | 2193.9 | 3712.8 | 15.0 | 5921.7 | 31.1 | | dbstat | 2024-04-12 | 2405.8 | 4119.1 | 12.0 | 6537.0 | 34.9 | | dbstat | 2024-04-13 | 2405.8 | 4119.1 | 14.0 | 6538.9 | 35.7 | | dbstat | 2024-04-14 | 2480.8 | 4278.9 | 15.0 | 6774.8 | 36.2 | | dbstat | 2024-04-15 | 2560.8 | 4443.7 | 12.0 | 7016.5 | 37.5 | | dbstat | 2024-04-16 | 2560.8 | 4443.7 | 12.0 | 7016.5 | 38.2 | | dbstat | 2024-04-17 | 2640.8 | 4610.6 | 18.0 | 7269.4 | 38.5 | | dbstat | 2024-04-18 | 2640.9 | 4611.6 | 14.0 | 7266.5 | 39.7 | | dbstat | 2024-04-19 | 2743.9 | 4826.5 | 14.0 | 7584.3 | 36.9 | | dbstat | 2024-04-20 | 2826.9 | 4995.5 | 14.0 | 7836.4 | 38.3 | | dbstat | 2024-04-21 | 2830.9 | 4997.4 | 18.0 | 7846.3 | 39.2 | | dbstat | 2024-04-22 | 2919.9 | 5177.4 | 14.0 | 8111.3 | 43.2 | | dbstat | 2024-04-23 | 2923.0 | 5177.3 | 16.0 | 8116.3 | 44.1 | | dbstat | 2024-04-24 | 3020.0 | 5376.3 | 16.0 | 8412.3 | 41.0 | | dbstat | 2024-04-25 | 3024.0 | 5377.3 | 17.0 | 8418.3 | 40.9 | +--------------+------------+---------+----------+---------+----------+--------+

If you take the disc space in the O/S for comparison:

# du -shc *.ibd 8.6G global_status.ibd 308K global_variables.ibd 692K metadata_lock.ibd 97M processlist.ibd 18M table_size.ibd 212K trx_and_lck.ibd 8.7G total

you can see that the values from the database are approximately correct (5% error)...

Important: The database dbstat reaches a size of approx. 9 Gbyte after approx. one month on a not particularly large database system.

You can also see that the size of the database is only just stabilising:

If you want to know more precisely which tables are responsible for which part of the data volume, you can also zoom in or drill down into the data:

SQL> SELECT `table_name`, SUBSTR(`ts`, 1, 10) AS date , ROUND(`data_length`/1024/1024, 1) AS data_mb , ROUND(`index_length`/1024/1024, 1) AS index_mb , ROUND(`data_free`/1024/1024, 1) AS free_mb , ROUND((`data_length` + `index_length` + `data_free`)/1024/1024, 1) AS total_mb , ROUND((`data_length` + `index_length` + `data_free`)/1024/1024/8418.26*100, 1) AS pct , ROUND(`table_rows`/1000/1000, 1) AS rows_m FROM `table_size` WHERE `machine_name` = @machine_name AND `table_catalog` = 'def' AND `table_schema` = 'dbstat' AND SUBSTR(`ts`, 1, 10) = CURRENT_DATE() ORDER BY rows_m DESC ; +------------------+------------+---------+----------+---------+----------+------+--------+ | table_name | date | data_mb | index_mb | free_mb | total_mb | pct | rows_m | +------------------+------------+---------+----------+---------+----------+------+--------+ | global_status | 2024-04-25 | 2949.9 | 5356.9 | 5.0 | 8311.8 | 98.7 | 40.4 | | processlist | 2024-04-25 | 68.2 | 17.1 | 7.0 | 92.2 | 1.1 | 0.4 | | global_variables | 2024-04-25 | 0.1 | 0.1 | 0.0 | 0.2 | 0.0 | 0.0 | | metadata_lock | 2024-04-25 | 0.4 | 0.2 | 0.0 | 0.6 | 0.0 | 0.0 | | table_size | 2024-04-25 | 5.4 | 3.1 | 5.0 | 13.5 | 0.2 | 0.0 | | trx_and_lck | 2024-04-25 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +------------------+------------+---------+----------+---------+----------+------+--------+

Note: Please excuse the non-use of the window function!

The only real driver for the data volume of this database is the global_status table. This is to be expected (see: see quantity structure of dbstat).

SQL> SELECT SUBSTR(ts, 1, 10) AS date, table_rows/1000/1000 AS k_rows , ROUND(data_length/1024/1024, 1) AS data_mb, ROUND(index_length/1024/1024, 1) AS index_mb, ROUND(data_free/1024/1024, 1) AS free_mb , ROUND((data_length + index_length + data_free)/1024/1024, 1) AS total_mb FROM table_size WHERE `machine_name` = @machine_name AND `table_catalog` = 'def' AND `table_schema` = 'dbstat' AND table_name = 'global_status' AND ts > DATE_SUB(CURRENT_DATE, INTERVAL 10 DAY) ; +------------+-------------+---------+----------+---------+----------+ | date | k_rows | data_mb | index_mb | free_mb | total_mb | +------------+-------------+---------+----------+---------+----------+ | 2024-04-15 | 37.13876300 | 2512.9 | 4433.0 | 4.0 | 6949.9 | | 2024-04-16 | 37.94217200 | 2512.9 | 4433.0 | 4.0 | 6949.9 | + 0M | 2024-04-17 | 38.19867500 | 2592.9 | 4600.0 | 7.0 | 7199.9 | + 250M | 2024-04-18 | 39.39108500 | 2592.9 | 4600.0 | 5.0 | 7197.9 | - 2M | 2024-04-19 | 36.52539600 | 2691.9 | 4813.0 | 5.0 | 7509.8 | + 312M | 2024-04-20 | 37.99073500 | 2770.9 | 4980.9 | 6.0 | 7757.8 | + 248M | 2024-04-21 | 38.79420200 | 2770.9 | 4980.9 | 7.0 | 7758.8 | + 1M | 2024-04-22 | 42.82606200 | 2855.9 | 5158.9 | 6.0 | 8020.8 | + 263M | 2024-04-23 | 43.62953000 | 2855.9 | 5158.9 | 7.0 | 8021.8 | + 1M | 2024-04-24 | 40.54342200 | 2949.9 | 5356.9 | 7.0 | 8313.8 | + 292M | 2024-04-25 | 40.43067700 | 2949.9 | 5356.9 | 5.0 | 8311.8 | - 2M +------------+-------------+---------+----------+---------+----------+

Note: Sorry, I should really familiarise myself with the window functions...

If we analyse the data a bit more closely, we see that the number of rows has slowly stabilised over the last 4 days (note: table_rows is calculated (from the number of blocks and the average row length?) and is not an exact value), but the "amount of data" has continued to increase until yesterday, which is probably due to the fragmentation of the tables and indexes...

The primary key of the global_status table was chosen to optimise the localisation of the data:

PRIMARY KEY (`machine_name`,`variable_name`,`ts`),

The situation should calm down in the next few days. In 2 to 4 weeks we will have to check the situation again.

Summary: I would say that this feature fulfils the requirements and helps to understand the data growth.

List of processes

Since we do not have any serious load issues in our databases, this feature is not that interesting in our case. For example, we can see what a (persistent) connection has done:

SQL> SELECT connection_id, ts, command, time, state, SUBSTR(REGEXP_REPLACE(REPLACE(query, "\n", ' '), '\ +', ' '), 1, 64) FROM processlist WHERE machine_name = @machine_name AND command != 'Sleep' AND connection_id = @connection_id AND state NOT IN ( 'Waiting for next activation' , 'Master has sent all binlog to slave; waiting for more updates' , 'Waiting for master to send event' , 'Slave has read all relay log; waiting for more updates' ) ORDER BY ts ASC ; +---------------+---------------------+---------+-------+----------------+----------------------------------------------------------------------+ | connection_id | ts | command | time | state | SUBSTR(REGEXP_REPLACE(REPLACE(query, "\n", ' '), '\ +', ' '), 1, 64) | +---------------+---------------------+---------+-------+----------------+----------------------------------------------------------------------+ | 18 | 2024-04-17 12:30:28 | Query | 0.029 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-17 14:58:28 | Query | 0.009 | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-18 06:24:28 | Query | 0.003 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-18 11:34:28 | Query | 0.030 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-18 16:39:28 | Query | 0.006 | Sending data | select itemid,functionid,name,parameter,triggerid from functions | | 18 | 2024-04-18 19:12:28 | Query | 0.014 | Sending data | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-18 21:49:28 | Query | 0.004 | Writing to net | select i.itemid,i.hostid,i.templateid from items i inner join ho | | 18 | 2024-04-19 00:21:28 | Query | 0.032 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-19 02:59:28 | Query | 0.017 | Writing to net | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-19 05:39:28 | Query | 0.052 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-19 08:19:28 | Query | 0.000 | Statistics | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-19 13:26:28 | Query | 0.075 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-19 15:57:28 | Query | 0.027 | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-19 18:33:28 | Query | 0.010 | Sending data | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-19 21:10:28 | Query | 0.008 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-19 23:50:28 | Query | 0.067 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 02:28:28 | Query | 0.008 | Sending data | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-20 05:08:28 | Query | 0.052 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 07:44:28 | Query | 0.123 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 10:21:28 | Query | 0.144 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 12:55:28 | Query | 0.004 | Sending data | select i.itemid,i.hostid,i.templateid from items i where i.flags | | 18 | 2024-04-20 15:35:28 | Query | 0.092 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 18:12:28 | Query | 0.041 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 20:47:28 | Query | 0.113 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 23:25:28 | Query | 0.101 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 02:03:28 | Query | 0.120 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 04:42:28 | Query | 0.099 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 07:18:28 | Query | 0.015 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 12:32:28 | Query | 0.018 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 15:06:28 | Query | 0.091 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 20:16:28 | Query | 0.012 | Sending data | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-22 06:44:28 | Query | 0.161 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 09:21:28 | Query | 0.000 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 11:54:28 | Query | 0.020 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 14:23:28 | Query | 0.067 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 16:59:28 | Query | 0.128 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 22:05:28 | Query | 0.078 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 00:38:28 | Query | 0.084 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 03:15:28 | Query | 0.098 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 05:52:28 | Query | 0.000 | starting | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 08:27:28 | Query | 0.011 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-23 10:58:28 | Query | 0.000 | Sending data | select i.itemid,i.hostid,i.templateid from items i inner join ho | | 18 | 2024-04-23 13:31:28 | Query | 0.110 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 16:01:28 | Query | 0.023 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 18:35:28 | Query | 0.095 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 21:10:28 | Query | 0.017 | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-23 23:44:28 | Query | 0.014 | Sending data | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-24 02:21:28 | Query | 0.024 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-24 07:33:28 | Query | 0.046 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | +---------------+---------------------+---------+-------+----------------+----------------------------------------------------------------------+

It is also important that we only see the entries in this report when the thread has done SOMETHING (we have hidden state Sleep). It is also interesting that we do not see this (persistent) connection before 17 April, but at the moment I have NO explanation for this from an operational point of view (restart etc.). Probably the application (Zabbix) has to explain this.

Global variables

The information in the global_variables table is also interesting:

SQL> SELECT variable_name, ts, variable_value FROM global_variables WHERE machine_name = @machine_name AND variable_name IN ( SELECT variable_name FROM global_variables WHERE machine_name = @machine_name GROUP BY variable_name HAVING COUNT(*) > 1 ) ORDER BY ts, variable_name ; +---------------------------+---------------------+----------------+ | variable_name | ts | variable_value | +---------------------------+---------------------+----------------+ | auto_increment_increment | 2024-03-09 22:10:42 | 1 | | auto_increment_offset | 2024-03-09 22:10:42 | 1 | | read_only | 2024-03-09 22:10:42 | OFF | | slave_parallel_max_queued | 2024-03-09 22:10:42 | 131072 | | slave_parallel_threads | 2024-03-09 22:10:42 | 0 | | slave_parallel_workers | 2024-03-09 22:10:42 | 0 | | slave_skip_errors | 2024-03-09 22:10:42 | OFF | | system_time_zone | 2024-03-09 22:10:42 | CET | | read_only | 2024-03-27 09:42:50 | ON | | slave_skip_errors | 2024-03-27 12:33:13 | 1032 | | slave_skip_errors | 2024-03-27 12:35:13 | OFF | | slave_skip_errors | 2024-03-27 12:42:13 | 1032 | | slave_skip_errors | 2024-03-27 12:50:13 | OFF | | slave_parallel_threads | 2024-04-02 10:17:32 | 8 | | slave_parallel_workers | 2024-04-02 10:17:32 | 8 | | slave_parallel_max_queued | 2024-04-02 10:22:32 | 1048576 | | slave_parallel_max_queued | 2024-04-02 10:23:32 | 4194304 | | slave_parallel_max_queued | 2024-04-02 10:25:32 | 16777216 | | slave_parallel_threads | 2024-04-02 10:25:32 | 16 | | slave_parallel_workers | 2024-04-02 10:25:32 | 16 | | slave_parallel_threads | 2024-04-02 10:28:32 | 32 | | slave_parallel_workers | 2024-04-02 10:28:32 | 32 | | auto_increment_increment | 2024-04-02 10:39:32 | 2 | | auto_increment_offset | 2024-04-02 10:39:32 | 2 | | slave_parallel_max_queued | 2024-04-02 10:57:32 | 131072 | | slave_parallel_threads | 2024-04-02 10:57:32 | 0 | | slave_parallel_workers | 2024-04-02 10:57:32 | 0 | | system_time_zone | 2024-04-02 10:57:32 | CEST | | slave_parallel_max_queued | 2024-04-16 14:06:32 | 16777216 | | slave_parallel_threads | 2024-04-16 14:06:32 | 8 | | slave_parallel_workers | 2024-04-16 14:06:32 | 8 | | slave_parallel_max_queued | 2024-04-16 14:26:32 | 131072 | | slave_parallel_threads | 2024-04-16 14:26:32 | 0 | | slave_parallel_workers | 2024-04-16 14:26:32 | 0 | | slave_parallel_max_queued | 2024-04-17 09:03:32 | 16777216 | | slave_parallel_threads | 2024-04-17 09:03:32 | 16 | | slave_parallel_workers | 2024-04-17 09:03:32 | 16 | | slave_parallel_max_queued | 2024-04-24 08:26:32 | 131072 | | slave_parallel_threads | 2024-04-24 08:26:32 | 0 | | slave_parallel_workers | 2024-04-24 08:26:32 | 0 | | read_only | 2024-04-24 08:42:32 | OFF | +---------------------------+---------------------+----------------+

Here you can see very precisely when and what was done to the database:

On 9 March, dbstat was installed for the first time.
Then on 27 March (before Easter) there seem to have been problems with the replication (here the new version of dbstat was installed, which allows simultaneous collection on master and slave. This led to replication errors, which were partially rectified).
On 2 April (after Easter) we then tried to catch up with parallel replication. You can also see that AUTO_INCREMENT_OFFSET and AUTO_INCREMENT_INCREMENT have been changed. Here we have corrected an error in the database configuration...
You can also see that the time zone has changed from CET to CEST (summer time!) Why only on 2 April is not entirely clear to me. (Maybe because it came via replication?)
Then on 16 and 17 April we tried to reproduce a "bug" in the parallel replication. Apparently we did not reset the value. Because only after the restart on 24 April (usual fortnightly maintenance window) was the value reset again.
On 24 April, you can also see that the database has now assumed the role of the active master (read_only = off). A gracefull switchover has therefore taken place...

Conclusion: A very useful feature to see when something has been changed. Although I have followed all these operations closely, I am still amazed at the informative value of this feature. I would like to see it installed in all databases...

Metadata Lock and InnoDB Transaction Lock

Unfortunately, due to the low traffic on our databases, we do not see too much exciting stuff here.

Here are the metadata locks that we have "caught" on the master in the last 24 hours:

+---------------+---------------------+--------+-----------------+--------------+---------------+-----------------+----------------------------------------------------------------------+ | connection_id | ts | user | host | table_schema | table_name | state | SUBSTR(REGEXP_REPLACE(REPLACE(query, "\n", ' '), '\ +', ' '), 1, 64) | +---------------+---------------------+--------+-----------------+--------------+---------------+-----------------+----------------------------------------------------------------------+ | 18 | 2024-04-23 14:16:47 | zabbix | localhost:51252 | zabbix | triggers | Writing to net | select triggerid,description,expression,error,priority,type,valu | | 1325025 | 2024-04-23 16:01:47 | zabbix | localhost:50150 | | | init for update | delete from history_text where itemid=85477 and clock<1678167661 | | 1325025 | 2024-04-23 16:01:47 | zabbix | localhost:50150 | zabbix | history_text | init for update | delete from history_text where itemid=85477 and clock<1678167661 | | 1365229 | 2024-04-24 02:13:47 | root | localhost:38096 | dbstat | global_status | Writing to net | SELECT /*!40001 SQL_NO_CACHE */ `machine_name`, `variable_name`, | | 18 | 2024-04-24 03:10:47 | zabbix | localhost:51252 | zabbix | item_tag | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 1368524 | 2024-04-24 04:41:47 | zabbix | localhost:38112 | | | | NULL | | 1368524 | 2024-04-24 04:41:47 | zabbix | localhost:38112 | zabbix | history_uint | | NULL | | 18 | 2024-04-24 05:46:47 | zabbix | localhost:51252 | zabbix | item_tag | Sending data | select itemtagid,itemid,tag,value from item_tag | +---------------+---------------------+--------+-----------------+--------------+---------------+-----------------+----------------------------------------------------------------------+

We have not found any InnoDB locks in the last 24 hours for 7 days in a row.

It would be interesting to see a system where more is happening...

Global status

If a normal database monitoring such as the FromDual Performance Monitor for MariaDB and MySQL (fpmmm) with Zabbix is used, this feature is not absolutely necessary. However, most of our customers do not have any useful monitoring in use. Therefore, this feature would be very useful for post-mortem analyses...

For example InnoDB Row Lock Waits, minute-granular over the last 30 days (analogue to sar from sysstat):

Here you can see that the database was restarted on 10 April between 08:37 and 08:41. You could also find this out in another way, but unfortunately this is often not possible for various reasons (error log rotated away, etc.).

The trend break around 2 April is also interesting. At this time we were experimenting with parallel replication. It should not have been a failover (see GLOBAL VARIABLES, above).

Although parallel replication was later deactivated again, there were more locks. A similar situation around the 16th/17th of April, here too we played around with parallel replication, which seems to have had an effect on the locking behaviour.

Even with this feature, there are many ways to analyse the database. Unfortunately, our database is relatively boring: Mainly monotonous traffic (which is plentiful due to the monitoring) and very little exceptional traffic.

Remark: This text has been translated with the assistance of DeepL.

Taxonomy upgrade extras: performance monitoring performance monitoring metadata lock locking performance_schema

dbstat für MariaDB nach einem Monat produktiver Nutzung

Oli Sennhauser - Thu, 2024-04-25 12:39

Inhaltsverzeichnis

Rückblick
Einen Monat später
Grösse der Tabellen
Prozessliste
Globale Variablen
Metadata Lock und InnoDB Transaction Lock
Globaler Status

Rückblick

Nachdem wir vor gut 5 Wochen dbstat für MariaDB (und MySQL) vorgestellt haben, haben wir es natürlich auch auf unseren Systemen ausgerollt um das Verhalten im täglichen Einsatz zu testen (eat your own dog food).

Das ging soweit ganz gut, bis wir auf unserem MariaDB aktiv/passiv Master/Master Replikationscluster auf die Idee kamen, dbstat auch auf dem passiven dbstat Node zu aktivieren (eine ähnliche Situation hätte man auch bei einem Galera Cluster). Dabei stellten wir fest, dass das Design von dbstat noch Potential hatte. Nachdem dieses Problem behoben war (v0.0.2 und v0.0.3) und auch das Problem gelöst war, wie man Events auf Master UND Slave aktivieren kann (MDEV-33782: Event is always disabled on slave), schien auf den ersten Blick alles in Ordnung. Leider haben wir bei der Korrektur nicht bedacht, dass auch die Daten hätten angepasst werden müssen. Dies hatte zur Folge, dass unsere Replikation über die Osterfeiertage zum Stillstand kam, was dann beim Aufholen zu einem weiteren Problem führte (MDEV-33923: MariaDB parallel replication causes Foreign Key errors).

Nachdem auch dieser kleine Zwischenfall behoben war lief dbstat auf unserem Master/Master Replikationscluster seitdem einwandfrei... Das Produkt dbstat ist Open Source (GPLv2) und kann von GitHub heruntergeladen werden.

Einen Monat später

Datenbanken sollten NICHT mit der Zeit wachsen sondern nur mit der Anzahl {Kunden, Produkte, etc.}, sobald das gewünschte Gleichgewicht (steady state) erreicht ist. In unserer dbstat-Installation haben wir diesen Wert auf 30 Tage gesetzt. Es wird also langsam an der Zeit, dass sich die Grösse von dbstat stabilisiert und nicht weiter wächst...

Ausserdem wäre es spannend zu verstehen, welchen praktischen Nutzen dbstat hat. Deshalb haben wir uns jetzt an die Arbeit gemacht und versuchen, die Ergebnisse von dbstat auszuwerten.

Hier zunächst noch einmal ein Überblick über die 11 laufenden Events:

Zunächst ist das Wachstum von dbstat selbst interessant. Aber natürlich kann diese Auswertung auch für jede andere Datenbank, Tabelle oder Catalog (kommt in MariaDB 11.7?) durchgeführt werden:

SQL> SET SESSION sql_mode='STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION,only_full_group_by'; SQL> SET @machine_name = @@hostname; SQL> SELECT `table_schema`, SUBSTR(`ts`, 1, 10) AS date , ROUND(SUM(`data_length`)/1024/1024, 1) AS data_mb , ROUND(SUM(`index_length`)/1024/1024, 1) AS index_mb , ROUND(SUM(`data_free`)/1024/1024, 1) AS free_mb , ROUND((SUM(`data_length`) + SUM(`index_length`) + SUM(`data_free`))/1024/1024, 1) AS total_mb , ROUND(SUM(`table_rows`)/1000/1000, 1) AS rows_m FROM `table_size` WHERE `machine_name` = @machine_name AND `table_catalog` = 'def' AND `table_schema` = 'dbstat' GROUP BY `table_catalog`, `table_schema`, `date` ORDER BY `table_catalog`, `table_schema`, `date` ASC ; +--------------+------------+---------+----------+---------+----------+--------+ | table_schema | date | data_mb | index_mb | free_mb | total_mb | rows_m | +--------------+------------+---------+----------+---------+----------+--------+ | dbstat | 2024-03-26 | 762.8 | 1128.6 | 18.0 | 1909.4 | 10.9 | | dbstat | 2024-03-27 | 835.8 | 1241.6 | 17.0 | 2094.4 | 11.1 | | dbstat | 2024-03-28 | 837.8 | 1241.6 | 14.0 | 2093.4 | 11.8 | | dbstat | 2024-03-29 | 960.7 | 1443.6 | 18.0 | 2422.4 | 14.2 | | dbstat | 2024-03-30 | 960.7 | 1443.6 | 17.0 | 2421.4 | 15.0 | | dbstat | 2024-03-31 | 1057.7 | 1604.6 | 20.0 | 2682.4 | 16.9 | | dbstat | 2024-04-01 | 1057.7 | 1602.6 | 21.0 | 2681.4 | 17.6 | | dbstat | 2024-04-02 | 1172.7 | 1797.6 | 22.0 | 2992.3 | 17.8 | | dbstat | 2024-04-03 | 1442.8 | 2333.7 | 12.0 | 3788.5 | 22.8 | | dbstat | 2024-04-04 | 1649.8 | 2723.7 | 13.0 | 4386.5 | 24.4 | | dbstat | 2024-04-05 | 1649.8 | 2722.7 | 14.0 | 4386.5 | 26.0 | | dbstat | 2024-04-06 | 1821.8 | 3034.8 | 13.0 | 4869.6 | 24.6 | | dbstat | 2024-04-07 | 1821.8 | 3034.8 | 14.0 | 4870.6 | 26.2 | | dbstat | 2024-04-08 | 1989.9 | 3344.8 | 12.0 | 5346.6 | 29.9 | | dbstat | 2024-04-09 | 1990.9 | 3343.8 | 14.0 | 5348.6 | 31.5 | | dbstat | 2024-04-10 | 2193.9 | 3712.8 | 13.0 | 5919.7 | 31.6 | | dbstat | 2024-04-11 | 2193.9 | 3712.8 | 15.0 | 5921.7 | 31.1 | | dbstat | 2024-04-12 | 2405.8 | 4119.1 | 12.0 | 6537.0 | 34.9 | | dbstat | 2024-04-13 | 2405.8 | 4119.1 | 14.0 | 6538.9 | 35.7 | | dbstat | 2024-04-14 | 2480.8 | 4278.9 | 15.0 | 6774.8 | 36.2 | | dbstat | 2024-04-15 | 2560.8 | 4443.7 | 12.0 | 7016.5 | 37.5 | | dbstat | 2024-04-16 | 2560.8 | 4443.7 | 12.0 | 7016.5 | 38.2 | | dbstat | 2024-04-17 | 2640.8 | 4610.6 | 18.0 | 7269.4 | 38.5 | | dbstat | 2024-04-18 | 2640.9 | 4611.6 | 14.0 | 7266.5 | 39.7 | | dbstat | 2024-04-19 | 2743.9 | 4826.5 | 14.0 | 7584.3 | 36.9 | | dbstat | 2024-04-20 | 2826.9 | 4995.5 | 14.0 | 7836.4 | 38.3 | | dbstat | 2024-04-21 | 2830.9 | 4997.4 | 18.0 | 7846.3 | 39.2 | | dbstat | 2024-04-22 | 2919.9 | 5177.4 | 14.0 | 8111.3 | 43.2 | | dbstat | 2024-04-23 | 2923.0 | 5177.3 | 16.0 | 8116.3 | 44.1 | | dbstat | 2024-04-24 | 3020.0 | 5376.3 | 16.0 | 8412.3 | 41.0 | | dbstat | 2024-04-25 | 3024.0 | 5377.3 | 17.0 | 8418.3 | 40.9 | +--------------+------------+---------+----------+---------+----------+--------+

Nimmt man zum Vergleich den Plattenplatz im O/S:

# du -shc *.ibd 8.6G global_status.ibd 308K global_variables.ibd 692K metadata_lock.ibd 97M processlist.ibd 18M table_size.ibd 212K trx_and_lck.ibd 8.7G total

sieht man, dass die Werte aus der Datenbank in etwa stimmen (5% Fehler)...

Wichtig: Die Datenbank dbstat erreicht nach ca. einem Monat eine Grösse von ca. 9 Gbyte auf einem nicht besonders grossen Datenbanksystem.

Man sieht auch, dass sich die Grösse der Datenbank gerade erst stabilisiert.

Wenn man genauer wissen will, welche Tabellen für welchen Teil des Datenvolumens verantwortlich sind, kann man auch in die Daten hineinzoomen bzw. hineindrillen (drill down):

SQL> SELECT `table_name`, SUBSTR(`ts`, 1, 10) AS date , ROUND(`data_length`/1024/1024, 1) AS data_mb , ROUND(`index_length`/1024/1024, 1) AS index_mb , ROUND(`data_free`/1024/1024, 1) AS free_mb , ROUND((`data_length` + `index_length` + `data_free`)/1024/1024, 1) AS total_mb , ROUND((`data_length` + `index_length` + `data_free`)/1024/1024/8418.26*100, 1) AS pct , ROUND(`table_rows`/1000/1000, 1) AS rows_m FROM `table_size` WHERE `machine_name` = @machine_name AND `table_catalog` = 'def' AND `table_schema` = 'dbstat' AND SUBSTR(`ts`, 1, 10) = CURRENT_DATE() ORDER BY rows_m DESC ; +------------------+------------+---------+----------+---------+----------+------+--------+ | table_name | date | data_mb | index_mb | free_mb | total_mb | pct | rows_m | +------------------+------------+---------+----------+---------+----------+------+--------+ | global_status | 2024-04-25 | 2949.9 | 5356.9 | 5.0 | 8311.8 | 98.7 | 40.4 | | processlist | 2024-04-25 | 68.2 | 17.1 | 7.0 | 92.2 | 1.1 | 0.4 | | global_variables | 2024-04-25 | 0.1 | 0.1 | 0.0 | 0.2 | 0.0 | 0.0 | | metadata_lock | 2024-04-25 | 0.4 | 0.2 | 0.0 | 0.6 | 0.0 | 0.0 | | table_size | 2024-04-25 | 5.4 | 3.1 | 5.0 | 13.5 | 0.2 | 0.0 | | trx_and_lck | 2024-04-25 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | +------------------+------------+---------+----------+---------+----------+------+--------+

Anmerkung: Bitte entschuldigen Sie die Nichverwendung der Window-Funktion!

Der einzige wirkliche Treiber für die Datenmenge auf diesem System ist die Tabelle global_status. Dies ist auch zu erwarten (siehe Mengengerüst von dbstat).

SQL> SELECT SUBSTR(ts, 1, 10) AS date, table_rows/1000/1000 AS k_rows , ROUND(data_length/1024/1024, 1) AS data_mb, ROUND(index_length/1024/1024, 1) AS index_mb, ROUND(data_free/1024/1024, 1) AS free_mb , ROUND((data_length + index_length + data_free)/1024/1024, 1) AS total_mb FROM table_size WHERE `machine_name` = @machine_name AND `table_catalog` = 'def' AND `table_schema` = 'dbstat' AND table_name = 'global_status' AND ts > DATE_SUB(CURRENT_DATE, INTERVAL 10 DAY) ; +------------+-------------+---------+----------+---------+----------+ | date | k_rows | data_mb | index_mb | free_mb | total_mb | +------------+-------------+---------+----------+---------+----------+ | 2024-04-15 | 37.13876300 | 2512.9 | 4433.0 | 4.0 | 6949.9 | | 2024-04-16 | 37.94217200 | 2512.9 | 4433.0 | 4.0 | 6949.9 | + 0M | 2024-04-17 | 38.19867500 | 2592.9 | 4600.0 | 7.0 | 7199.9 | + 250M | 2024-04-18 | 39.39108500 | 2592.9 | 4600.0 | 5.0 | 7197.9 | - 2M | 2024-04-19 | 36.52539600 | 2691.9 | 4813.0 | 5.0 | 7509.8 | + 312M | 2024-04-20 | 37.99073500 | 2770.9 | 4980.9 | 6.0 | 7757.8 | + 248M | 2024-04-21 | 38.79420200 | 2770.9 | 4980.9 | 7.0 | 7758.8 | + 1M | 2024-04-22 | 42.82606200 | 2855.9 | 5158.9 | 6.0 | 8020.8 | + 263M | 2024-04-23 | 43.62953000 | 2855.9 | 5158.9 | 7.0 | 8021.8 | + 1M | 2024-04-24 | 40.54342200 | 2949.9 | 5356.9 | 7.0 | 8313.8 | + 292M | 2024-04-25 | 40.43067700 | 2949.9 | 5356.9 | 5.0 | 8311.8 | - 2M +------------+-------------+---------+----------+---------+----------+

Anmerkung: Sorry, ich sollte mich wirklich mit den Window-Funktionen vertraut machen...

Wenn wir die Daten etwas genauer analysieren, sehen wir, dass sich die Anzahl der Rows in den letzten 4 Tagen langsam stabilisiert hat (Achtung: table_rows wird berechnet (aus der Anzahl der Blöcke und der durchschnittlichen Zeilenlänge?) und ist kein exakter Wert), aber die "Datenmenge" hat bis gestern weiter zugenommen, was wahrscheinlich auf das "Zerfleddern" der Tabellen und Indizes zurückzuführen ist...

Der Primärschlüssel der Tabelle global_status wurde gewählt, um die Lokalisierung der Daten zu optimieren:

PRIMARY KEY (`machine_name`,`variable_name`,`ts`),

Die Lage sollte sich in den nächsten Tagen beruhigen. In 2 bis 4 Wochen müssen wir die Lage erneut prüfen.

Zusammenfassung: Ich würde sagen, dass dieses Feature die Anforderungen erfüllt und hilft, das Datenwachstum zu verstehen.

Liste der Prozesse

Da wir keine ernsthaften Lastprobleme in unseren Datenbanken haben, ist diese Funktion in unserem Fall nicht so interessant. Wir können zum Beispiel sehen, was eine (persistente) Verbindung gemacht hat:

SQL> SELECT connection_id, ts, command, time, state, SUBSTR(REGEXP_REPLACE(REPLACE(query, "\n", ' '), '\ +', ' '), 1, 64) FROM processlist WHERE machine_name = @machine_name AND command != 'Sleep' AND connection_id = @connection_id AND state NOT IN ( 'Waiting for next activation' , 'Master has sent all binlog to slave; waiting for more updates' , 'Waiting for master to send event' , 'Slave has read all relay log; waiting for more updates' ) ORDER BY ts ASC ; +---------------+---------------------+---------+-------+----------------+----------------------------------------------------------------------+ | connection_id | ts | command | time | state | SUBSTR(REGEXP_REPLACE(REPLACE(query, "\n", ' '), '\ +', ' '), 1, 64) | +---------------+---------------------+---------+-------+----------------+----------------------------------------------------------------------+ | 18 | 2024-04-17 12:30:28 | Query | 0.029 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-17 14:58:28 | Query | 0.009 | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-18 06:24:28 | Query | 0.003 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-18 11:34:28 | Query | 0.030 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-18 16:39:28 | Query | 0.006 | Sending data | select itemid,functionid,name,parameter,triggerid from functions | | 18 | 2024-04-18 19:12:28 | Query | 0.014 | Sending data | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-18 21:49:28 | Query | 0.004 | Writing to net | select i.itemid,i.hostid,i.templateid from items i inner join ho | | 18 | 2024-04-19 00:21:28 | Query | 0.032 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-19 02:59:28 | Query | 0.017 | Writing to net | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-19 05:39:28 | Query | 0.052 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-19 08:19:28 | Query | 0.000 | Statistics | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-19 13:26:28 | Query | 0.075 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-19 15:57:28 | Query | 0.027 | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-19 18:33:28 | Query | 0.010 | Sending data | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-19 21:10:28 | Query | 0.008 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-19 23:50:28 | Query | 0.067 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 02:28:28 | Query | 0.008 | Sending data | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-20 05:08:28 | Query | 0.052 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 07:44:28 | Query | 0.123 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 10:21:28 | Query | 0.144 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 12:55:28 | Query | 0.004 | Sending data | select i.itemid,i.hostid,i.templateid from items i where i.flags | | 18 | 2024-04-20 15:35:28 | Query | 0.092 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 18:12:28 | Query | 0.041 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 20:47:28 | Query | 0.113 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-20 23:25:28 | Query | 0.101 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 02:03:28 | Query | 0.120 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 04:42:28 | Query | 0.099 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 07:18:28 | Query | 0.015 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 12:32:28 | Query | 0.018 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 15:06:28 | Query | 0.091 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-21 20:16:28 | Query | 0.012 | Sending data | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-22 06:44:28 | Query | 0.161 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 09:21:28 | Query | 0.000 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 11:54:28 | Query | 0.020 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 14:23:28 | Query | 0.067 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 16:59:28 | Query | 0.128 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-22 22:05:28 | Query | 0.078 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 00:38:28 | Query | 0.084 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 03:15:28 | Query | 0.098 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 05:52:28 | Query | 0.000 | starting | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 08:27:28 | Query | 0.011 | Sending data | select pp.item_preprocid,pp.itemid,pp.type,pp.params,pp.step,h.h | | 18 | 2024-04-23 10:58:28 | Query | 0.000 | Sending data | select i.itemid,i.hostid,i.templateid from items i inner join ho | | 18 | 2024-04-23 13:31:28 | Query | 0.110 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 16:01:28 | Query | 0.023 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 18:35:28 | Query | 0.095 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-23 21:10:28 | Query | 0.017 | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 18 | 2024-04-23 23:44:28 | Query | 0.014 | Sending data | select triggerid,description,expression,error,priority,type,valu | | 18 | 2024-04-24 02:21:28 | Query | 0.024 | Sending data | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | | 18 | 2024-04-24 07:33:28 | Query | 0.046 | Writing to net | select i.itemid,i.hostid,i.status,i.type,i.value_type,i.key_,i.s | +---------------+---------------------+---------+-------+----------------+----------------------------------------------------------------------+

Wichtig ist auch, dass wir in dieser Auswertung nur die Einträge sehen, wenn der Thread ETWAS gemacht hat (State Sleep haben wir ausgeblendet). Interessant ist auch, dass wir diese (persistente) Verbindung nicht vor dem 17. April sehen, aber ich habe im Moment KEINE Erklärung dafür aus operativer Sicht (Restart etc.). Wahrscheinlich muss die Anwendung (Zabbix) das erklären.

Globale Variablen

Interessant sind auch die Informationen in der Tabelle global_variables:

SQL> SELECT variable_name, ts, variable_value FROM global_variables WHERE machine_name = @machine_name AND variable_name IN ( SELECT variable_name FROM global_variables WHERE machine_name = @machine_name GROUP BY variable_name HAVING COUNT(*) > 1 ) ORDER BY ts, variable_name ; +---------------------------+---------------------+----------------+ | variable_name | ts | variable_value | +---------------------------+---------------------+----------------+ | auto_increment_increment | 2024-03-09 22:10:42 | 1 | | auto_increment_offset | 2024-03-09 22:10:42 | 1 | | read_only | 2024-03-09 22:10:42 | OFF | | slave_parallel_max_queued | 2024-03-09 22:10:42 | 131072 | | slave_parallel_threads | 2024-03-09 22:10:42 | 0 | | slave_parallel_workers | 2024-03-09 22:10:42 | 0 | | slave_skip_errors | 2024-03-09 22:10:42 | OFF | | system_time_zone | 2024-03-09 22:10:42 | CET | | read_only | 2024-03-27 09:42:50 | ON | | slave_skip_errors | 2024-03-27 12:33:13 | 1032 | | slave_skip_errors | 2024-03-27 12:35:13 | OFF | | slave_skip_errors | 2024-03-27 12:42:13 | 1032 | | slave_skip_errors | 2024-03-27 12:50:13 | OFF | | slave_parallel_threads | 2024-04-02 10:17:32 | 8 | | slave_parallel_workers | 2024-04-02 10:17:32 | 8 | | slave_parallel_max_queued | 2024-04-02 10:22:32 | 1048576 | | slave_parallel_max_queued | 2024-04-02 10:23:32 | 4194304 | | slave_parallel_max_queued | 2024-04-02 10:25:32 | 16777216 | | slave_parallel_threads | 2024-04-02 10:25:32 | 16 | | slave_parallel_workers | 2024-04-02 10:25:32 | 16 | | slave_parallel_threads | 2024-04-02 10:28:32 | 32 | | slave_parallel_workers | 2024-04-02 10:28:32 | 32 | | auto_increment_increment | 2024-04-02 10:39:32 | 2 | | auto_increment_offset | 2024-04-02 10:39:32 | 2 | | slave_parallel_max_queued | 2024-04-02 10:57:32 | 131072 | | slave_parallel_threads | 2024-04-02 10:57:32 | 0 | | slave_parallel_workers | 2024-04-02 10:57:32 | 0 | | system_time_zone | 2024-04-02 10:57:32 | CEST | | slave_parallel_max_queued | 2024-04-16 14:06:32 | 16777216 | | slave_parallel_threads | 2024-04-16 14:06:32 | 8 | | slave_parallel_workers | 2024-04-16 14:06:32 | 8 | | slave_parallel_max_queued | 2024-04-16 14:26:32 | 131072 | | slave_parallel_threads | 2024-04-16 14:26:32 | 0 | | slave_parallel_workers | 2024-04-16 14:26:32 | 0 | | slave_parallel_max_queued | 2024-04-17 09:03:32 | 16777216 | | slave_parallel_threads | 2024-04-17 09:03:32 | 16 | | slave_parallel_workers | 2024-04-17 09:03:32 | 16 | | slave_parallel_max_queued | 2024-04-24 08:26:32 | 131072 | | slave_parallel_threads | 2024-04-24 08:26:32 | 0 | | slave_parallel_workers | 2024-04-24 08:26:32 | 0 | | read_only | 2024-04-24 08:42:32 | OFF | +---------------------------+---------------------+----------------+

Hier sieht man sehr genau, wann und was an der Datenbank gemacht wurde:

Am 9. März wurde dbstat zum ersten Mal installiert.
Dann am 27. März (vor Ostern) scheint es dann Probleme mit der Replikation gegeben zu haben (hier wurde die neue Version von dbstat installiert, die das gleichzeitige Sammeln auf Master und Slave erlaubt. Dies führte zu Replikationsfehlern, die teilweise behoben wurden).
Am 2. April (nach Ostern) haben wir dann versucht, den Rückstand mit der parallelen Replikation aufzuholen. Man sieht auch, dass AUTO_INCREMENT_OFFSET und AUTO_INCREMENT_INCREMENT geändert wurden. Hier haben wir einen Fehler in der Datenbankkonfiguration korrigiert...
Ausserdem ist zu sehen, dass die Zeitzone von CET auf CEST geändert hat (Sommerzeit!) Warum erst am 2. April ist mir nicht ganz klar. (Vielleicht weil das über die Replikation gekommen ist?)
Dann haben wir am 16. und 17. April haben wir versucht einen "Bug" in der parallelen Replikation zu reproduzieren. Anscheinend haben wir den Wert nicht zurückgesetzt. Denn erst nach dem Neustart am 24. April (übliches zweiwöchentliches Wartungsfenster) wurde der Wert wieder zurückgesetzt.
Am 24. April sieht man auch, dass die Datenbank jetzt die Rolle des aktiven Masters übernommen hat (read_only = off). Es hat also ein Gracefull-Switchover stattgefunden...

Fazit: Ein sehr nützliches Feature um zu sehen, wann was geändert wurde. Obwohl ich alle diese Operationen genau verfolgt habe, bin ich doch erstaunt über die Aussagekraft dieses Features. Ich würde mir wünschen, dass es in allen Datenbanken installiert wird...

Metadata Lock und InnoDB Transaction Lock

Aufgrund des geringen Traffics auf unseren Datenbanken sehen wir hier leider nicht allzu viel Spannendes.

Hier sind die Metadata Locks, die wir in den den letzten 24 Stunden auf dem Master "erwischt" haben:

+---------------+---------------------+--------+-----------------+--------------+---------------+-----------------+----------------------------------------------------------------------+ | connection_id | ts | user | host | table_schema | table_name | state | SUBSTR(REGEXP_REPLACE(REPLACE(query, "\n", ' '), '\ +', ' '), 1, 64) | +---------------+---------------------+--------+-----------------+--------------+---------------+-----------------+----------------------------------------------------------------------+ | 18 | 2024-04-23 14:16:47 | zabbix | localhost:51252 | zabbix | triggers | Writing to net | select triggerid,description,expression,error,priority,type,valu | | 1325025 | 2024-04-23 16:01:47 | zabbix | localhost:50150 | | | init for update | delete from history_text where itemid=85477 and clock<1678167661 | | 1325025 | 2024-04-23 16:01:47 | zabbix | localhost:50150 | zabbix | history_text | init for update | delete from history_text where itemid=85477 and clock<1678167661 | | 1365229 | 2024-04-24 02:13:47 | root | localhost:38096 | dbstat | global_status | Writing to net | SELECT /*!40001 SQL_NO_CACHE */ `machine_name`, `variable_name`, | | 18 | 2024-04-24 03:10:47 | zabbix | localhost:51252 | zabbix | item_tag | Writing to net | select itemtagid,itemid,tag,value from item_tag | | 1368524 | 2024-04-24 04:41:47 | zabbix | localhost:38112 | | | | NULL | | 1368524 | 2024-04-24 04:41:47 | zabbix | localhost:38112 | zabbix | history_uint | | NULL | | 18 | 2024-04-24 05:46:47 | zabbix | localhost:51252 | zabbix | item_tag | Sending data | select itemtagid,itemid,tag,value from item_tag | +---------------+---------------------+--------+-----------------+--------------+---------------+-----------------+----------------------------------------------------------------------+

InnoDB-Locks haben wir in den letzten 24 Stunden repektive 7 Tage keine gefunden.

Es wäre interessant, ein System zu sehen, auf dem mehr passiert...

Globaler Status

Wenn ein normales Datenbankmonitoring wie z.B. der FromDual Performance Monitor für MariaDB und MySQL (fpmmm) mit Zabbix verwendet wird, ist dieses Feature nicht unbedingt notwendig. Die meisten unserer Kunden haben jedoch kein brauchbares Monitoring im Einsatz. Daher wäre dieses Feature sehr nützlich für Post-Morten-Analysen...

Zum Beispiel InnoDB-Lock-Waits, minuten-granular über die letzten 30 Tage (analog zu sar aus sysstat):

Hier sieht man, dass die Datenbank am 10. April zwischen 08:37 und 08:41 neu gestartet wurde. Das könnte man zwar auch anders herausfinden, ist aber leider oft aus verschiedenen Gründen nicht möglich (Error Log wegrotiert, etc.).

Interessant ist auch der Trendbruch um den 2. April. Zu diesem Zeitpunkt haben wir mit der parallelen Replikation rumexperimentiert. Es sollte kein Failover gewesen sein (siehe GLOBAL VARIABLES, weiter oben).

Obwohl die parallele Replikation später wieder deaktiviert wurde, gab es mehr Locks. Eine ähnliche Situation um den 16./17. April auch hier haben wir mit der parallelen Replikation herumgespielt, was sich auf das Locking-Verhalten ausgewirkt zu haben scheint.

Auch mit diesem Feature gibt es viele Möglichkeiten die Datenbank zu untersuchen. Leider ist unsere Datenbank relativ langweilig: Hauptsächlich monotoner Traffic (der durch das Monitoring reichlich vorhanden ist) und aussergwöhnlicher Traffic in sehr geringem Umfang.

Anmerkung: Dieser Text wurde mit der Unterstützung von DeepL optimiert.

Taxonomy upgrade extras: performance monitoring performance monitoring metadata lock locking performance_schema

Categories: FromDual all (de), FromDual TechFeed (de)

dbstat nach einem Monat produktiver Nutzung

Oli Sennhauser - Thu, 2024-04-25 12:39

Inhaltsverzeichnis

Rückblick
Einen Monat später
Grösse der Tabellen
Prozessliste
Globale Variablen
Metadata Lock und InnoDB Transaction Lock
Globaler Status