MDEV-28897 Wrong table.get_ref_count() upon concurrent truncate and backup stage operation

The issue was that flush_tables() didn't take a MDL lock on cached TABLE_SHARE before calling open_table() to do a HA_EXTRA_FLUSH call. Most engines seams to have no issue with it, but apparantly this conflicts with InnoDB in 10.6 when using TRUNCATE Fixed by taking a MDL lock before trying to open the table in flush_tables(). There is no test case as it hard to repeat the scheduling that causes the error. I did run the test case in MDEV-28897 to verify that the bug is fixed.

MDEV-28897 Wrong table.get_ref_count() upon concurrent truncate and backup stage operation
The issue was that flush_tables() didn't take a MDL lock on cached TABLE_SHARE before calling open_table() to do a HA_EXTRA_FLUSH call. Most engines seams to have no issue with it, but apparantly this conflicts with InnoDB in 10.6 when using TRUNCATE Fixed by taking a MDL lock before trying to open the table in flush_tables(). There is no test case as it hard to repeat the scheduling that causes the error. I did run the test case in MDEV-28897 to verify that the bug is fixed.
5e40934d · Monty · 02a313dc · 5e40934d
Commit 5e40934d authored Jun 28, 2022 by Monty
Show whitespace changes
Inline Side-by-side

Showing with 34 additions and 15 deletions

sql/sql_base.cc sql/sql_base.cc +34 -15

No files found.
--- a/sql/sql_base.cc
+++ b/sql/sql_base.cc
@@ -515,7 +515,7 @@ class flush_tables_error_handler : public Internal_error_handler
                        Sql_condition ** cond_hdl)
  {
    *cond_hdl= NULL;
-    if (sql_errno == ER_OPEN_AS_READONLY)
+    if (sql_errno == ER_OPEN_AS_READONLY || sql_errno == ER_LOCK_WAIT_TIMEOUT)
    {
      handled_errors++;
      return TRUE;
@@ -598,6 +598,23 @@ bool flush_tables(THD *thd, flush_tables_type flag)
      tc_release_table(table);
    }
    else
+    {
+      /*
+        No free TABLE instances available. We have to open a new one.
+
+        Try to take a MDL lock to ensure we can open a new table instance.
+        If the lock fails, it means that some DDL operation or flush tables
+        with read lock is ongoing.
+        In this case we cannot sending the HA_EXTRA_FLUSH signal.
+      */
+
+      MDL_request mdl_request;
+      MDL_REQUEST_INIT(&mdl_request, MDL_key::TABLE,
+                       share->db.str,
+                       share->table_name.str,
+                       MDL_SHARED, MDL_EXPLICIT);
+
+      if (!thd->mdl_context.acquire_lock(&mdl_request, 0))
      {
        /*
          HA_OPEN_FOR_FLUSH is used to allow us to open the table even if
@@ -619,6 +636,8 @@ bool flush_tables(THD *thd, flush_tables_type flag)
          */
          closefrm(tmp_table);
        }
+        thd->mdl_context.release_lock(mdl_request.ticket);
+      }
    }
    tdc_release_share(share);
  }