MDEV-23888: Potential server hang on replication with InnoDB
In MDEV-21452, SAFE_MUTEX flagged an ordering problem that involved trx_t::mutex, LOCK_global_system_variables, and LOCK_commit_ordered when running ./mtr --no-reorder\ binlog.binlog_checksum,mix binlog.binlog_commit_wait,mix Because LOCK_commit_ordered is acquired by replication code before innobase_commit_ordered() is invoked, and because LOCK_commit_ordered should be below LOCK_global_system_variables in the global latching order, it turns out that we must avoid acquiring LOCK_global_system_variables in any low-level code. It also turns out that lock_rec_lock() acquires lock_sys_t::mutex and then carries on to call lock_rec_enqueue_waiting(), which may invoke THDVAR() via thd_lock_wait_timeout(). This call is problematic if THDVAR() had never been invoked in that thread earlier. innobase_trx_init(): Let us invoke THDVAR() at the start of an InnoDB transaction so that future invocations of THDVAR() will avoid LOCK_global_system_variables acquisition on the same THD. Because the first call to intern_sys_var_ptr() will initialize all session variables by not passing the offset to sync_dynamic_session_variables(), this will indeed make any future THDVAR() invocation mutex-free. There are some THDVAR() calls in other code (related to indexed virtual columns, fulltext indexes, and DDL operations). No SAFE_MUTEX warning was known for those, but there does not appear to be any replication test coverage for indexed virtual columns or fulltext indexes. DDL should be covered, and perhaps DDL code paths were already invoking THDVAR() while not holding any InnoDB mutex. Side note: MySQL should avoid this type of deadlocks since mysql/mysql-server@4d275c89954685e2ed1b368812b3b5a29ddf9389. MariaDB never defined alloc_and_copy_thd_dynamic_variables(), because we prefer to avoid overhead during connection creation. An important part of the deadlock could be the current handling of SET GLOBAL binlog_checksum=NONE; and similar assignments. In binlog_checksum_update(), we would hold LOCK_global_system_variables while potentially acquiring LOCK_commit_ordered in MYSQL_BIN_LOG::open(). Even if that code was changed later to release LOCK_global_system_variables during the write to mysql_bin_log, it could be a good idea for performance to avoid invoking the expensive code path of THDVAR() while holding any InnoDB mutexes, such as lock_sys.mutex in lock_rec_enqueue_waiting(). Thanks to Andrei Elkin for debugging the SAFE_MUTEX issue, and to Sergei Golubchik for the suggestion to invoke THDVAR() early.
Showing
Please register or sign in to comment