MDEV-33515 log_sys.lsn_lock causes excessive context switching

The log_sys.lsn_lock is a very contended resource with a small critical section in log_sys.append_prepare(). On many processor microarchitectures, replacing the system call based log_sys.lsn_lock with a pure spin lock would fare worse during high concurrency workloads, wasting a significant amount of CPU cycles in the spin loop. On other microarchitectures, we would see a significant amount of time being spent in native_queued_spin_lock_slowpath() in the Linux kernel, plus context switching between user and kernel address space. This was pointed out by Steve Shaw from Intel Corporation. Depending on the workload and the hardware implementation, it may be useful to use a pure spin lock in log_sys.append_prepare(). We will introduce a parameter. The statement SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50; would enable a spin lock that will execute that many MY_RELAX_CPU() operations (such as the x86 PAUSE instruction) between successive attempts of acquiring the spin lock. The use of a system call based log_sys.lsn_lock (which is the default setting) can be enabled by SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0; This patch will also introduce #ifdef LOG_LATCH_DEBUG (part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate tracking of log_sys.latch ownership and reorganize the fields of log_sys to improve the locality of reference and to reduce the chances of false sharing. When a spin lock is being used, it will be maintained in the most significant bit of log_sys.buf_free. This is useful, because that is one of the fields that is covered by the lock. For IA-32 or AMD64, we implement the spin lock specially via log_t::lsn_lock_bts(), employing the i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would translate into an inefficient loop around LOCK CMPXCHG. mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay. mtr_t::finisher: Pointer to the currently used mtr_t::finish_write() implementation. This allows to avoid introducing conditional branches. We no longer invoke log_sys.is_pmem() at the mini-transaction level, but we would do that in log_write_up_to(). mtr_t::finisher_update(): Update finisher when spin_wait_delay is changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or vice versa).

MDEV-33515 log_sys.lsn_lock causes excessive context switching
The log_sys.lsn_lock is a very contended resource with a small critical section in log_sys.append_prepare(). On many processor microarchitectures, replacing the system call based log_sys.lsn_lock with a pure spin lock would fare worse during high concurrency workloads, wasting a significant amount of CPU cycles in the spin loop. On other microarchitectures, we would see a significant amount of time being spent in native_queued_spin_lock_slowpath() in the Linux kernel, plus context switching between user and kernel address space. This was pointed out by Steve Shaw from Intel Corporation. Depending on the workload and the hardware implementation, it may be useful to use a pure spin lock in log_sys.append_prepare(). We will introduce a parameter. The statement SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=50; would enable a spin lock that will execute that many MY_RELAX_CPU() operations (such as the x86 PAUSE instruction) between successive attempts of acquiring the spin lock. The use of a system call based log_sys.lsn_lock (which is the default setting) can be enabled by SET GLOBAL INNODB_LOG_SPIN_WAIT_DELAY=0; This patch will also introduce #ifdef LOG_LATCH_DEBUG (part of cmake -DWITH_INNODB_EXTRA_DEBUG=ON) for more accurate tracking of log_sys.latch ownership and reorganize the fields of log_sys to improve the locality of reference and to reduce the chances of false sharing. When a spin lock is being used, it will be maintained in the most significant bit of log_sys.buf_free. This is useful, because that is one of the fields that is covered by the lock. For IA-32 or AMD64, we implement the spin lock specially via log_t::lsn_lock_bts(), employing the i386 LOCK BTS instruction. A straightforward std::atomic::fetch_or() would translate into an inefficient loop around LOCK CMPXCHG. mtr_t::spin_wait_delay: The value of innodb_log_spin_wait_delay. mtr_t::finisher: Pointer to the currently used mtr_t::finish_write() implementation. This allows to avoid introducing conditional branches. We no longer invoke log_sys.is_pmem() at the mini-transaction level, but we would do that in log_write_up_to(). mtr_t::finisher_update(): Update finisher when spin_wait_delay is changed from or to 0 (the spin lock is changed to log_sys.lsn_lock or vice versa).
bf0b82d2 · Marko Mäkelä · a2dd4c14 · bf0b82d2 · bf0b82d2 · bf0b82d2
Commit bf0b82d2 authored Mar 22, 2024 by Marko Mäkelä
12 changed files
--- a/extra/mariabackup/xtrabackup.cc
+++ b/extra/mariabackup/xtrabackup.cc
@@ -5320,9 +5320,10 @@ static bool xtrabackup_backup_func()
 	}
 	/* get current checkpoint_lsn */
 	{
+		log_sys.latch.wr_lock(SRW_LOCK_CALL);
 		mysql_mutex_lock(&recv_sys.mutex);
-
 		dberr_t err = recv_sys.find_checkpoint();
+		log_sys.latch.wr_unlock();

 		if (err != DB_SUCCESS) {
 			msg("Error: cannot read redo log header");

--- a/mysql-test/suite/sys_vars/r/sysvars_innodb.result
+++ b/mysql-test/suite/sys_vars/r/sysvars_innodb.result
@@ -1027,6 +1027,18 @@ NUMERIC_BLOCK_SIZE	NULL
 ENUM_VALUE_LIST	NULL
 READ_ONLY	YES
 COMMAND_LINE_ARGUMENT	REQUIRED
+VARIABLE_NAME	INNODB_LOG_SPIN_WAIT_DELAY
+SESSION_VALUE	NULL
+DEFAULT_VALUE	0
+VARIABLE_SCOPE	GLOBAL
+VARIABLE_TYPE	INT UNSIGNED
+VARIABLE_COMMENT	Delay between log buffer spin lock polls (0 to use a blocking latch)
+NUMERIC_MIN_VALUE	0
+NUMERIC_MAX_VALUE	6000
+NUMERIC_BLOCK_SIZE	0
+ENUM_VALUE_LIST	NULL
+READ_ONLY	NO
+COMMAND_LINE_ARGUMENT	OPTIONAL
 VARIABLE_NAME	INNODB_LRU_FLUSH_SIZE
 SESSION_VALUE	NULL
 DEFAULT_VALUE	32

--- a/storage/innobase/CMakeLists.txt
+++ b/storage/innobase/CMakeLists.txt
@@ -71,7 +71,7 @@ ADD_FEATURE_INFO(INNODB_ROOT_GUESS WITH_INNODB_ROOT_GUESS

 OPTION(WITH_INNODB_EXTRA_DEBUG "Enable extra InnoDB debug checks" OFF)
 IF(WITH_INNODB_EXTRA_DEBUG)
-  ADD_DEFINITIONS(-DUNIV_ZIP_DEBUG)
+  ADD_DEFINITIONS(-DUNIV_ZIP_DEBUG -DLOG_LATCH_DEBUG)
 ENDIF()
 ADD_FEATURE_INFO(INNODB_EXTRA_DEBUG WITH_INNODB_EXTRA_DEBUG "Extra InnoDB debug checks")


--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1915,7 +1915,7 @@ inline void log_t::write_checkpoint(lsn_t end_lsn) noexcept
      {
        my_munmap(buf, file_size);
        buf= resize_buf;
-        buf_free= START_OFFSET + (get_lsn() - resizing);
+        set_buf_free(START_OFFSET + (get_lsn() - resizing));
      }
      else
 #endif
@@ -1957,9 +1957,7 @@ inline void log_t::write_checkpoint(lsn_t end_lsn) noexcept
 static bool log_checkpoint_low(lsn_t oldest_lsn, lsn_t end_lsn)
 {
  ut_ad(!srv_read_only_mode);
-#ifndef SUX_LOCK_GENERIC
-  ut_ad(log_sys.latch.is_write_locked());
-#endif
+  ut_ad(log_sys.latch_have_wr());
  ut_ad(oldest_lsn <= end_lsn);
  ut_ad(end_lsn == log_sys.get_lsn());


--- a/storage/innobase/fil/fil0fil.cc
+++ b/storage/innobase/fil/fil0fil.cc
@@ -927,9 +927,7 @@ bool fil_space_free(uint32_t id, bool x_latched)

 			log_sys.latch.wr_unlock();
 		} else {
-#ifndef SUX_LOCK_GENERIC
-			ut_ad(log_sys.latch.is_write_locked());
-#endif
+			ut_ad(log_sys.latch_have_wr());
 			if (space->max_lsn) {
 				ut_d(space->max_lsn = 0);
 				fil_system.named_spaces.remove(*space);
@@ -3036,9 +3034,7 @@ void
 fil_names_dirty(
 	fil_space_t*	space)
 {
-#ifndef SUX_LOCK_GENERIC
-	ut_ad(log_sys.latch.is_write_locked());
-#endif
+	ut_ad(log_sys.latch_have_wr());
 	ut_ad(recv_recovery_is_on());
 	ut_ad(log_sys.get_lsn() != 0);
 	ut_ad(space->max_lsn == 0);
@@ -3052,9 +3048,7 @@ fil_names_dirty(
 tablespace was modified for the first time since fil_names_clear(). */
 ATTRIBUTE_NOINLINE ATTRIBUTE_COLD void mtr_t::name_write()
 {
-#ifndef SUX_LOCK_GENERIC
-  ut_ad(log_sys.latch.is_write_locked());
-#endif
+  ut_ad(log_sys.latch_have_wr());
  ut_d(fil_space_validate_for_mtr_commit(m_user_space));
  ut_ad(!m_user_space->max_lsn);
  m_user_space->max_lsn= log_sys.get_lsn();
@@ -3078,9 +3072,7 @@ ATTRIBUTE_COLD lsn_t fil_names_clear(lsn_t lsn)
 {
 	mtr_t	mtr;

-#ifndef SUX_LOCK_GENERIC
-	ut_ad(log_sys.latch.is_write_locked());
-#endif
+	ut_ad(log_sys.latch_have_wr());
 	ut_ad(lsn);
 	ut_ad(log_sys.is_latest());


--- a/storage/innobase/handler/ha_innodb.cc
+++ b/storage/innobase/handler/ha_innodb.cc
@@ -18478,6 +18478,24 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
  mysql_mutex_lock(&LOCK_global_system_variables);
 }

+static void innodb_log_spin_wait_delay_update(THD *thd, st_mysql_sys_var*,
+                                              void *var, const void *save)
+{
+  ut_ad(var == &mtr_t::spin_wait_delay);
+
+  unsigned delay= *static_cast<const unsigned*>(save);
+
+  if (!delay != !mtr_t::spin_wait_delay)
+  {
+    log_sys.latch.wr_lock(SRW_LOCK_CALL);
+    mtr_t::spin_wait_delay= delay;
+    mtr_t::finisher_update();
+    log_sys.latch.wr_unlock();
+  }
+  else
+    mtr_t::spin_wait_delay= delay;
+}
+
 /** Update innodb_status_output or innodb_status_output_locks,
 which control InnoDB "status monitor" output to the error log.
 @param[out]	var	current value
@@ -19312,6 +19330,12 @@ static MYSQL_SYSVAR_ULONGLONG(log_file_size, srv_log_file_size,
  nullptr, innodb_log_file_size_update,
  96 << 20, 4 << 20, std::numeric_limits<ulonglong>::max(), 4096);

+static MYSQL_SYSVAR_UINT(log_spin_wait_delay, mtr_t::spin_wait_delay,
+  PLUGIN_VAR_OPCMDARG,
+  "Delay between log buffer spin lock polls (0 to use a blocking latch)",
+  nullptr, innodb_log_spin_wait_delay_update,
+  0, 0, 6000, 0);
+
 static MYSQL_SYSVAR_UINT(old_blocks_pct, innobase_old_blocks_pct,
  PLUGIN_VAR_RQCMDARG,
  "Percentage of the buffer pool to reserve for 'old' blocks.",
@@ -19771,6 +19795,7 @@ static struct st_mysql_sys_var* innobase_system_variables[]= {
  MYSQL_SYSVAR(log_file_buffering),
 #endif
  MYSQL_SYSVAR(log_file_size),
+  MYSQL_SYSVAR(log_spin_wait_delay),
  MYSQL_SYSVAR(log_group_home_dir),
  MYSQL_SYSVAR(max_dirty_pages_pct),
  MYSQL_SYSVAR(max_dirty_pages_pct_lwm),

--- a/storage/innobase/include/dyn0buf.h
+++ b/storage/innobase/include/dyn0buf.h
@@ -57,11 +57,7 @@ class mtr_buf_t {
 		/**
 		Gets the number of used bytes in a block.
 		@return	number of bytes used */
-		ulint used() const
-			MY_ATTRIBUTE((warn_unused_result))
-		{
-			return(static_cast<ulint>(m_used & ~DYN_BLOCK_FULL_FLAG));
-		}
+		uint32_t used() const { return m_used; }

 		/**
 		Gets pointer to the start of data.

--- a/storage/innobase/include/log0log.h
+++ b/storage/innobase/include/log0log.h
@@ -165,60 +165,92 @@ struct log_t
  static constexpr lsn_t FIRST_LSN= START_OFFSET;

 private:
-  /** The log sequence number of the last change of durable InnoDB files */
+  /** the lock bit in buf_free */
+  static constexpr size_t buf_free_LOCK= ~(~size_t{0} >> 1);
  alignas(CPU_LEVEL1_DCACHE_LINESIZE)
+  /** first free offset within buf used;
+  the most significant bit is set by lock_lsn() to protect this field
+  as well as write_to_buf, waits */
+  std::atomic<size_t> buf_free;
+public:
+  /** number of write requests (to buf); protected by lock_lsn() or lsn_lock */
+  size_t write_to_buf;
+  /** log record buffer, written to by mtr_t::commit() */
+  byte *buf;
+private:
+  /** The log sequence number of the last change of durable InnoDB files;
+  protected by lock_lsn() or lsn_lock or latch.wr_lock() */
  std::atomic<lsn_t> lsn;
  /** the first guaranteed-durable log sequence number */
  std::atomic<lsn_t> flushed_to_disk_lsn;
-  /** log sequence number when log resizing was initiated, or 0 */
-  std::atomic<lsn_t> resize_lsn;
-  /** set when there may be need to initiate a log checkpoint.
-  This must hold if lsn - last_checkpoint_lsn > max_checkpoint_age. */
-  std::atomic<bool> need_checkpoint;
+public:
+  /** number of append_prepare_wait(); protected by lock_lsn() or lsn_lock */
+  size_t waits;
+  /** innodb_log_buffer_size (size of buf,flush_buf if !is_pmem(), in bytes) */
+  size_t buf_size;
+  /** log file size in bytes, including the header */
+  lsn_t file_size;

-#if defined(__aarch64__)
-  /* On ARM, we do more spinning */
+#ifdef LOG_LATCH_DEBUG
+  typedef srw_lock_debug log_rwlock;
+  typedef srw_mutex log_lsn_lock;
+
+  bool latch_have_wr() const { return latch.have_wr(); }
+  bool latch_have_rd() const { return latch.have_rd(); }
+  bool latch_have_any() const { return latch.have_any(); }
+#else
+# ifndef UNIV_DEBUG
+# elif defined SUX_LOCK_GENERIC
+  bool latch_have_wr() const { return true; }
+  bool latch_have_rd() const { return true; }
+  bool latch_have_any() const { return true; }
+# else
+  bool latch_have_wr() const { return latch.is_write_locked(); }
+  bool latch_have_rd() const { return latch.is_locked(); }
+  bool latch_have_any() const { return latch.is_locked(); }
+# endif
+# ifdef __aarch64__
+  /* On ARM, we spin more */
  typedef srw_spin_lock log_rwlock;
  typedef pthread_mutex_wrapper<true> log_lsn_lock;
-#else
+# else
  typedef srw_lock log_rwlock;
  typedef srw_mutex log_lsn_lock;
+# endif
 #endif
-
-public:
-  /** rw-lock protecting writes to buf; normal mtr_t::commit()
-  outside any log checkpoint is covered by a shared latch */
+  /** exclusive latch for checkpoint, shared for mtr_t::commit() to buf */
  alignas(CPU_LEVEL1_DCACHE_LINESIZE) log_rwlock latch;
-private:
-  /** mutex protecting buf_free et al, together with latch */
-  log_lsn_lock lsn_lock;
-public:
-  /** first free offset within buf use; protected by lsn_lock */
-  Atomic_relaxed<size_t> buf_free;
-  /** number of write requests (to buf); protected by lsn_lock */
-  size_t write_to_buf;
-  /** number of append_prepare_wait(); protected by lsn_lock */
-  size_t waits;
-private:
+
+  /** number of std::swap(buf, flush_buf) and writes from buf to log;
+  protected by latch.wr_lock() */
+  ulint write_to_log;
+
  /** Last written LSN */
  lsn_t write_lsn;
-public:
-  /** log record buffer, written to by mtr_t::commit() */
-  byte *buf;
+  /** recommended maximum buf_free size, after which the buffer is flushed */
+  size_t max_buf_free;
+
  /** buffer for writing data to ib_logfile0, or nullptr if is_pmem()
  In write_buf(), buf and flush_buf are swapped */
  byte *flush_buf;
-  /** number of std::swap(buf, flush_buf) and writes from buf to log;
-  protected by latch.wr_lock() */
-  ulint write_to_log;
-
+  /** set when there may be need to initiate a log checkpoint.
+  This must hold if lsn - last_checkpoint_lsn > max_checkpoint_age. */
+  std::atomic<bool> need_checkpoint;
+  /** whether a checkpoint is pending; protected by latch.wr_lock() */
+  Atomic_relaxed<bool> checkpoint_pending;
  /** Log sequence number when a log file overwrite (broken crash recovery)
  was noticed. Protected by latch.wr_lock(). */
  lsn_t overwrite_warned;

-  /** innodb_log_buffer_size (size of buf,flush_buf if !is_pmem(), in bytes) */
-  size_t buf_size;
+  /** latest completed checkpoint (protected by latch.wr_lock()) */
+  Atomic_relaxed<lsn_t> last_checkpoint_lsn;
+  /** next checkpoint LSN (protected by latch.wr_lock()) */
+  lsn_t next_checkpoint_lsn;
+  /** next checkpoint number (protected by latch.wr_lock()) */
+  ulint next_checkpoint_no;

+  /** Log file */
+  log_file_t log;
 private:
  /** Log file being constructed during resizing; protected by latch */
  log_file_t resize_log;
@@ -229,18 +261,14 @@ struct log_t
  /** Buffer for writing to resize_log; @see flush_buf */
  byte *resize_flush_buf;

-  void init_lsn_lock() {lsn_lock.init(); }
-  void lock_lsn() { lsn_lock.wr_lock(); }
-  void unlock_lsn() {lsn_lock.wr_unlock(); }
-  void destroy_lsn_lock() { lsn_lock.destroy(); }
+  /** Special implementation of lock_lsn() for IA-32 and AMD64 */
+  void lsn_lock_bts() noexcept;
+  /** Acquire a lock for updating buf_free and related fields.
+  @return the value of buf_free */
+  size_t lock_lsn() noexcept;

-public:
-  /** recommended maximum size of buf, after which the buffer is flushed */
-  size_t max_buf_free;
-
-  /** log file size in bytes, including the header */
-  lsn_t file_size;
-private:
+  /** log sequence number when log resizing was initiated, or 0 */
+  std::atomic<lsn_t> resize_lsn;
  /** the log sequence number at the start of the log file */
  lsn_t first_lsn;
 #if defined __linux__ || defined _WIN32
@@ -250,8 +278,6 @@ struct log_t
 public:
  /** format of the redo log: e.g., FORMAT_10_8 */
  uint32_t format;
-  /** Log file */
-  log_file_t log;
 #if defined __linux__ || defined _WIN32
  /** whether file system caching is enabled for the log */
  my_bool log_buffered;
@@ -279,21 +305,28 @@ struct log_t
 					/*!< this is the maximum allowed value
 					for lsn - last_checkpoint_lsn when a
 					new query step is started */
-  /** latest completed checkpoint (protected by latch.wr_lock()) */
-  Atomic_relaxed<lsn_t> last_checkpoint_lsn;
-  /** next checkpoint LSN (protected by log_sys.latch) */
-  lsn_t next_checkpoint_lsn;
-  /** next checkpoint number (protected by latch.wr_lock()) */
-  ulint next_checkpoint_no;
-  /** whether a checkpoint is pending */
-  Atomic_relaxed<bool> checkpoint_pending;

  /** buffer for checkpoint header */
  byte *checkpoint_buf;
 	/* @} */

+private:
+  /** A lock when the spin-only lock_lsn() is not being used */
+  log_lsn_lock lsn_lock;
+public:
+
  bool is_initialised() const noexcept { return max_buf_free != 0; }

+  /** whether there is capacity in the log buffer */
+  bool buf_free_ok() const noexcept
+  {
+    return (buf_free.load(std::memory_order_relaxed) & ~buf_free_LOCK) <
+      max_buf_free;
+  }
+
+  void set_buf_free(size_t f) noexcept
+  { ut_ad(f < buf_free_LOCK); buf_free.store(f, std::memory_order_relaxed); }
+
 #ifdef HAVE_PMEM
  bool is_pmem() const noexcept { return !flush_buf; }
 #else
@@ -302,7 +335,7 @@ struct log_t

  bool is_opened() const noexcept { return log.is_opened(); }

-  /** @return target write LSN to react on buf_free >= max_buf_free */
+  /** @return target write LSN to react on !buf_free_ok() */
  inline lsn_t get_write_target() const;

  /** @return LSN at which log resizing was started and is still in progress
@@ -402,9 +435,7 @@ struct log_t

  void set_recovered_lsn(lsn_t lsn) noexcept
  {
-#ifndef SUX_LOCK_GENERIC
-    ut_ad(latch.is_write_locked());
-#endif /* SUX_LOCK_GENERIC */
+    ut_ad(latch_have_wr());
    write_lsn= lsn;
    this->lsn.store(lsn, std::memory_order_relaxed);
    flushed_to_disk_lsn.store(lsn, std::memory_order_relaxed);
@@ -444,17 +475,23 @@ struct log_t

 private:
  /** Wait in append_prepare() for buffer to become available
-  @param lsn  log sequence number to write up to
-  @param ex   whether log_sys.latch is exclusively locked */
-  ATTRIBUTE_COLD void append_prepare_wait(lsn_t lsn, bool ex) noexcept;
+  @tparam spin  whether to use the spin-only lock_lsn()
+  @param b      the value of buf_free
+  @param ex     whether log_sys.latch is exclusively locked
+  @param lsn    log sequence number to write up to
+  @return the new value of buf_free */
+  template<bool spin>
+  ATTRIBUTE_COLD size_t append_prepare_wait(size_t b, bool ex, lsn_t lsn)
+    noexcept;
 public:
  /** Reserve space in the log buffer for appending data.
+  @tparam spin  whether to use the spin-only lock_lsn()
  @tparam pmem  log_sys.is_pmem()
  @param size   total length of the data to append(), in bytes
  @param ex     whether log_sys.latch is exclusively locked
  @return the start LSN and the buffer position for append() */
-  template<bool pmem>
-  inline std::pair<lsn_t,byte*> append_prepare(size_t size, bool ex) noexcept;
+  template<bool spin,bool pmem>
+  std::pair<lsn_t,byte*> append_prepare(size_t size, bool ex) noexcept;

  /** Append a string of bytes to the redo log.
  @param d     destination
@@ -462,9 +499,7 @@ struct log_t
  @param size  length of str, in bytes */
  void append(byte *&d, const void *s, size_t size) noexcept
  {
-#ifndef SUX_LOCK_GENERIC
-    ut_ad(latch.is_locked());
-#endif
+    ut_ad(latch_have_any());
    ut_ad(d + size <= buf + (is_pmem() ? file_size : buf_size));
    memcpy(d, s, size);
    d+= size;

--- a/storage/innobase/include/mtr0mtr.h
+++ b/storage/innobase/include/mtr0mtr.h
@@ -700,9 +700,27 @@ struct mtr_t {
  std::pair<lsn_t,page_flush_ahead> do_write();

  /** Append the redo log records to the redo log buffer.
+  @tparam spin whether to use the spin-only log_sys.lock_lsn()
+  @tparam pmem log_sys.is_pmem()
+  @param mtr   mini-transaction
  @param len   number of bytes to write
  @return {start_lsn,flush_ahead} */
-  std::pair<lsn_t,page_flush_ahead> finish_write(size_t len);
+  template<bool spin,bool pmem> static
+  std::pair<lsn_t,page_flush_ahead> finish_writer(mtr_t *mtr, size_t len);
+
+  /** The applicable variant of finish_writer() */
+  static std::pair<lsn_t,page_flush_ahead> (*finisher)(mtr_t *, size_t);
+
+  std::pair<lsn_t,page_flush_ahead> finish_write(size_t len)
+  { return finisher(this, len); }
+public:
+  /** Poll interval in log_sys.lock_lsn(); 0 to use log_sys.lsn_lock.
+  Protected by LOCK_global_system_variables; changes to and from 0
+  are additionally protected by exclusive log_sys.latch. */
+  static unsigned spin_wait_delay;
+  /** Update finisher when spin_wait_delay is changing to or from 0. */
+  static void finisher_update();
+private:

  /** Release all latches. */
  void release();

--- a/storage/innobase/log/log0log.cc
+++ b/storage/innobase/log/log0log.cc
@@ -69,9 +69,7 @@ log_t	log_sys;

 void log_t::set_capacity()
 {
-#ifndef SUX_LOCK_GENERIC
-	ut_ad(log_sys.latch.is_write_locked());
-#endif
+	ut_ad(log_sys.latch_have_wr());
 	/* Margin for the free space in the smallest log, before a new query
 	step which modifies the database, is started */

@@ -134,7 +132,6 @@ bool log_t::create()
 #endif

  latch.SRW_LOCK_INIT(log_latch_key);
-  init_lsn_lock();

  last_checkpoint_lsn= FIRST_LSN;
  log_capacity= 0;
@@ -143,7 +140,7 @@ bool log_t::create()
  next_checkpoint_lsn= 0;
  checkpoint_pending= false;

-  buf_free= 0;
+  set_buf_free(0);

  ut_ad(is_initialised());
 #ifndef HAVE_PMEM
@@ -244,6 +241,7 @@ void log_t::attach_low(log_file_t file, os_offset_t size)
 # endif
      log_maybe_unbuffered= true;
      log_buffered= false;
+      mtr_t::finisher_update();
      return true;
    }
  }
@@ -278,6 +276,7 @@ void log_t::attach_low(log_file_t file, os_offset_t size)
                        block_size);
 #endif

+  mtr_t::finisher_update();
 #ifdef HAVE_PMEM
  checkpoint_buf= static_cast<byte*>(aligned_malloc(block_size, block_size));
  memset_aligned<64>(checkpoint_buf, 0, block_size);
@@ -313,9 +312,7 @@ void log_t::header_write(byte *buf, lsn_t lsn, bool encrypted)

 void log_t::create(lsn_t lsn) noexcept
 {
-#ifndef SUX_LOCK_GENERIC
-  ut_ad(latch.is_write_locked());
-#endif
+  ut_ad(latch_have_wr());
  ut_ad(!recv_no_log_write);
  ut_ad(is_latest());
  ut_ad(this == &log_sys);
@@ -332,12 +329,12 @@ void log_t::create(lsn_t lsn) noexcept
  {
    mprotect(buf, size_t(file_size), PROT_READ | PROT_WRITE);
    memset_aligned<4096>(buf, 0, 4096);
-    buf_free= START_OFFSET;
+    set_buf_free(START_OFFSET);
  }
  else
 #endif
  {
-    buf_free= 0;
+    set_buf_free(0);
    memset_aligned<4096>(flush_buf, 0, buf_size);
    memset_aligned<4096>(buf, 0, buf_size);
  }
@@ -813,9 +810,7 @@ ATTRIBUTE_COLD void log_t::resize_write_buf(size_t length) noexcept
 @return the current log sequence number */
 template<bool release_latch> inline lsn_t log_t::write_buf() noexcept
 {
-#ifndef SUX_LOCK_GENERIC
-  ut_ad(latch.is_write_locked());
-#endif
+  ut_ad(latch_have_wr());
  ut_ad(!is_pmem());
  ut_ad(!srv_read_only_mode);

@@ -931,7 +926,7 @@ wait and check if an already running write is covering the request.
 void log_write_up_to(lsn_t lsn, bool durable,
                     const completion_callback *callback)
 {
-  ut_ad(!srv_read_only_mode || (log_sys.buf_free < log_sys.max_buf_free));
+  ut_ad(!srv_read_only_mode || log_sys.buf_free_ok());
  ut_ad(lsn != LSN_MAX);
  ut_ad(lsn != 0);

@@ -1292,6 +1287,7 @@ log_print(
 void log_t::close()
 {
  ut_ad(this == &log_sys);
+  ut_ad(!(buf_free & buf_free_LOCK));
  if (!is_initialised()) return;
  close_file();

@@ -1309,7 +1305,6 @@ void log_t::close()
 #endif

  latch.destroy();
-  destroy_lsn_lock();

  recv_sys.close();


--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -2518,11 +2518,9 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists)
  noexcept
 {
 restart:
-#ifndef SUX_LOCK_GENERIC
-  ut_ad(log_sys.latch.is_write_locked() ||
+  ut_ad(log_sys.latch_have_wr() ||
        srv_operation == SRV_OPERATION_BACKUP ||
        srv_operation == SRV_OPERATION_BACKUP_NO_DEFER);
-#endif
  mysql_mutex_assert_owner(&mutex);
  ut_ad(log_sys.next_checkpoint_lsn);
  ut_ad(log_sys.is_latest());
@@ -4050,9 +4048,7 @@ static bool recv_scan_log(bool last_phase)
  lsn_t rewound_lsn= 0;
  for (ut_d(lsn_t source_offset= 0);;)
  {
-#ifndef SUX_LOCK_GENERIC
-    ut_ad(log_sys.latch.is_write_locked());
-#endif
+    ut_ad(log_sys.latch_have_wr());
 #ifdef UNIV_DEBUG
    const bool wrap{source_offset + recv_sys.len == log_sys.file_size};
 #endif
@@ -4447,9 +4443,7 @@ recv_init_crash_recovery_spaces(bool rescan, bool& missing_tablespace)
 static dberr_t recv_rename_files()
 {
  mysql_mutex_assert_owner(&recv_sys.mutex);
-#ifndef SUX_LOCK_GENERIC
-  ut_ad(log_sys.latch.is_write_locked());
-#endif
+  ut_ad(log_sys.latch_have_wr());

  dberr_t err= DB_SUCCESS;

@@ -4732,7 +4726,7 @@ dberr_t recv_recovery_from_checkpoint_start()
 				 PROT_READ | PROT_WRITE);
 #endif
 		}
-		log_sys.buf_free = recv_sys.offset;
+		log_sys.set_buf_free(recv_sys.offset);
 		if (recv_needed_recovery
 	            && srv_operation <= SRV_OPERATION_EXPORT_RESTORED) {
 			/* Write a FILE_CHECKPOINT marker as the first thing,

--- a/storage/innobase/mtr/mtr0mtr.cc
+++ b/storage/innobase/mtr/mtr0mtr.cc