MDEV-34458 wait_for_read in buf_page_get_low hurts performance

The performance regression seen while loading BP is caused by the deadlock fix given in MDEV-33543. The area of impact is wider but is more visible when BP is being loaded initially via DMLs. Specifically the response time could be impacted in DML doing pessimistic operation on index(split/merge) and the leaf pages are not found in buffer pool. It is more likely to occur with small BP size. The origin of the issue dates back to MDEV-30400 that introduced btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for leaf page searches. In btr_latch_prev, we use RW_NO_LATCH to get the previous page fixed in BP without latching. When the page is not in BP, we try to acquire and wait for S latch violating the latching order. This deadlock was analyzed in MDEV-33543 and fixed by using the already present wait logic in buf_page_get_gen() instead of waiting for latch. The wait logic is inferior to usual S latch wait and is simply a repeated sleep 100 of micro-sec (The actual sleep time could be more depending on platforms). The bug was seen with "change-buffering" code path and the idea was that this path should be less exercised. The judgement was not correct and the path is actually quite frequent and does impact performance when pages are not in BP and being loaded by DML expanding/shrinking large data. Fix: While trying to get a page with RW_NO_LATCH and we are attempting "out of order" latch, return from buf_page_get_gen immediately instead of waiting and follow the ordered latching path.

MDEV-34458 wait_for_read in buf_page_get_low hurts performance
The performance regression seen while loading BP is caused by the deadlock fix given in MDEV-33543. The area of impact is wider but is more visible when BP is being loaded initially via DMLs. Specifically the response time could be impacted in DML doing pessimistic operation on index(split/merge) and the leaf pages are not found in buffer pool. It is more likely to occur with small BP size. The origin of the issue dates back to MDEV-30400 that introduced btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for leaf page searches. In btr_latch_prev, we use RW_NO_LATCH to get the previous page fixed in BP without latching. When the page is not in BP, we try to acquire and wait for S latch violating the latching order. This deadlock was analyzed in MDEV-33543 and fixed by using the already present wait logic in buf_page_get_gen() instead of waiting for latch. The wait logic is inferior to usual S latch wait and is simply a repeated sleep 100 of micro-sec (The actual sleep time could be more depending on platforms). The bug was seen with "change-buffering" code path and the idea was that this path should be less exercised. The judgement was not correct and the path is actually quite frequent and does impact performance when pages are not in BP and being loaded by DML expanding/shrinking large data. Fix: While trying to get a page with RW_NO_LATCH and we are attempting "out of order" latch, return from buf_page_get_gen immediately instead of waiting and follow the ordered latching path.
73ad436e · mariadb-DebarunBanerjee · dcd8a648 · 73ad436e · 73ad436e · 73ad436e
Commit 73ad436e authored Jul 03, 2024 by mariadb-DebarunBanerjee
5 changed files
--- a/storage/innobase/btr/btr0btr.cc
+++ b/storage/innobase/btr/btr0btr.cc
@@ -1262,7 +1262,7 @@ void btr_drop_temporary_table(const dict_table_t &table)
  {
    if (buf_block_t *block= buf_page_get_low({SRV_TMP_SPACE_ID, index->page}, 0,
                                             RW_X_LATCH, nullptr, BUF_GET, &mtr,
-                                             nullptr, false))
+                                             nullptr, false, nullptr))
    {
      btr_free_but_not_root(block, MTR_LOG_NO_REDO);
      mtr.set_log_mode(MTR_LOG_NO_REDO);

--- a/storage/innobase/btr/btr0cur.cc
+++ b/storage/innobase/btr/btr0cur.cc
@@ -971,12 +971,21 @@ static int btr_latch_prev(buf_block_t *block, page_id_t page_id,
  buffer-fixes on both blocks will prevent eviction. */

 retry:
+  /* Pass no_wait pointer to ensure that we don't wait on the current page
+  latch while holding the next page latch to avoid latch ordering violation. */
+  bool no_wait= false;
+  int ret= 1;
+
  buf_block_t *prev= buf_page_get_gen(page_id, zip_size, RW_NO_LATCH, nullptr,
-                                      BUF_GET, mtr, err, false);
+                                      BUF_GET, mtr, err, false, &no_wait);
  if (UNIV_UNLIKELY(!prev))
+  {
+    /* Check if we had to return because we couldn't wait on latch. */
+    if (no_wait)
+      goto ordered_latch;
    return 0;
+  }

-  int ret= 1;
  static_assert(MTR_MEMO_PAGE_S_FIX == mtr_memo_type_t(BTR_SEARCH_LEAF), "");
  static_assert(MTR_MEMO_PAGE_X_FIX == mtr_memo_type_t(BTR_MODIFY_LEAF), "");

@@ -996,6 +1005,7 @@ static int btr_latch_prev(buf_block_t *block, page_id_t page_id,
  {
    ut_ad(mtr->at_savepoint(mtr->get_savepoint() - 1)->page.id() == page_id);
    mtr->release_last_page();
+ordered_latch:
    if (rw_latch == RW_S_LATCH)
      block->page.lock.s_unlock();
    else

--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -2416,6 +2416,9 @@ BUF_PEEK_IF_IN_POOL, or BUF_GET_IF_IN_POOL_OR_WATCH
 while reading the page from file
 then it makes sure that it does merging of change buffer changes while
 reading the page from file.
+@param[in,out]	no_wait			If not NULL on input, then we must not
+wait for current page latch. On output, the value is set to true if we had to
+return because we could not wait on page latch.
 @return pointer to the block or NULL */
 TRANSACTIONAL_TARGET
 buf_block_t*
@@ -2427,7 +2430,8 @@ buf_page_get_low(
 	ulint			mode,
 	mtr_t*			mtr,
 	dberr_t*		err,
-	bool			allow_ibuf_merge)
+	bool			allow_ibuf_merge,
+	bool*			no_wait)
 {
 	unsigned	access_time;
 	ulint		retries = 0;
@@ -2584,14 +2588,17 @@ buf_page_get_low(
 		in buf_page_t::read_complete() or
 		buf_pool_t::corrupted_evict(), or
 		after buf_zip_decompress() in this function. */
-		if (rw_latch != RW_NO_LATCH) {
+		if (!no_wait) {
 			block->page.lock.s_lock();
 		} else if (!block->page.lock.s_lock_try()) {
-			/* For RW_NO_LATCH, we should not try to acquire S or X
-			latch directly as we could be violating the latching
-			order resulting in deadlock. Instead we try latching the
-			page and retry in case of a failure. */
-			goto wait_for_read;
+			ut_ad(rw_latch == RW_NO_LATCH);
+			/* We should not wait trying to acquire S latch for
+			current page while holding latch for the next page.
+			It would violate the latching order resulting in
+			possible deadlock. Caller must handle the failure. */
+			block->page.unfix();
+			*no_wait= true;
+			return nullptr;
 		}
 		state = block->page.state();
 		ut_ad(state < buf_page_t::READ_FIX
@@ -2647,7 +2654,6 @@ buf_page_get_low(
 	if (UNIV_UNLIKELY(!block->page.frame)) {
 		if (!block->page.lock.x_lock_try()) {
 wait_for_unzip:
-wait_for_read:
 			/* The page is being read or written, or
 			another thread is executing buf_zip_decompress()
 			in buf_page_get_low() on it. */
@@ -2936,6 +2942,9 @@ BUF_PEEK_IF_IN_POOL, or BUF_GET_IF_IN_POOL_OR_WATCH
 @param[out]	err			DB_SUCCESS or error code
 @param[in]	allow_ibuf_merge	Allow change buffer merge while
 reading the pages from file.
+@param[in,out]	no_wait			If not NULL on input, then we must not
+wait for current page latch. On output, the value is set to true if we had to
+return because we could not wait on page latch.
 @return pointer to the block or NULL */
 buf_block_t*
 buf_page_get_gen(
@@ -2946,12 +2955,14 @@ buf_page_get_gen(
 	ulint			mode,
 	mtr_t*			mtr,
 	dberr_t*		err,
-	bool			allow_ibuf_merge)
+	bool			allow_ibuf_merge,
+	bool*			no_wait)
 {
  buf_block_t *block= recv_sys.recover(page_id);
  if (UNIV_LIKELY(!block))
    return buf_page_get_low(page_id, zip_size, rw_latch,
-                            guess, mode, mtr, err, allow_ibuf_merge);
+                            guess, mode, mtr, err, allow_ibuf_merge,
+                            no_wait);
  else if (UNIV_UNLIKELY(block == reinterpret_cast<buf_block_t*>(-1)))
  {
  corrupted:

--- a/storage/innobase/include/buf0buf.h
+++ b/storage/innobase/include/buf0buf.h
@@ -209,6 +209,9 @@ BUF_PEEK_IF_IN_POOL, or BUF_GET_IF_IN_POOL_OR_WATCH
 @param[out]	err			DB_SUCCESS or error code
 @param[in]	allow_ibuf_merge	Allow change buffer merge while
 reading the pages from file.
+@param[in,out]	no_wait			If not NULL on input, then we must not
+wait for current page latch. On output, the value is set to true if we had to
+return because we could not wait on page latch.
 @return pointer to the block or NULL */
 buf_block_t*
 buf_page_get_gen(
@@ -219,7 +222,8 @@ buf_page_get_gen(
 	ulint			mode,
 	mtr_t*			mtr,
 	dberr_t*		err = NULL,
-	bool			allow_ibuf_merge = false)
+	bool			allow_ibuf_merge = false,
+	bool*			no_wait = nullptr)
 	MY_ATTRIBUTE((nonnull(6), warn_unused_result));

 /** This is the low level function used to get access to a database page.
@@ -236,6 +240,11 @@ BUF_PEEK_IF_IN_POOL, or BUF_GET_IF_IN_POOL_OR_WATCH
 while reading the page from file
 then it makes sure that it does merging of change buffer changes while
 reading the page from file.
+@param[in]	holds_next_page_latch	True if caller holds next page latch.
+We must not wait for current page latch.
+@param[in,out]	no_wait			If not NULL on input, then we must not
+wait for current page latch. On output, the value is set to true if we had to
+return because we could not wait on page latch.
 @return pointer to the block or NULL */
 buf_block_t*
 buf_page_get_low(
@@ -246,7 +255,8 @@ buf_page_get_low(
 	ulint			mode,
 	mtr_t*			mtr,
 	dberr_t*		err,
-	bool			allow_ibuf_merge);
+	bool			allow_ibuf_merge,
+	bool*			no_wait);

 /** Initialize a page in the buffer pool. The page is usually not read
 from a file even if it cannot be found in the buffer buf_pool. This is one

--- a/storage/innobase/row/row0import.cc
+++ b/storage/innobase/row/row0import.cc
@@ -2165,7 +2165,7 @@ dberr_t PageConverter::operator()(buf_block_t* block) UNIV_NOTHROW
 	we no longer evict the pages on DISCARD TABLESPACE. */
 	buf_page_get_low(block->page.id(), get_zip_size(), RW_NO_LATCH,
 			 nullptr, BUF_PEEK_IF_IN_POOL,
-			 nullptr, nullptr, false);
+			 nullptr, nullptr, false, nullptr);

 	uint16_t page_type;