1. 11 Apr, 2024 1 commit
    • Marko Mäkelä's avatar
      MDEV-33325 Crash in flst_read_addr on corrupted data · 263932d5
      Marko Mäkelä authored
      flst_read_addr(): Remove assertions. Instead, we will check these
      conditions in the callers and avoid a crash in case of corruption.
      We will check the conditions more carefully, because the callers
      know more exact bounds for the page numbers and the byte offsets
      withing pages.
      
      flst_remove(), flst_add_first(), flst_add_last(): Add a parameter
      for passing fil_space_t::free_limit. None of the lists may point to
      pages that are beyond the current initialized length of the
      tablespace.
      
      trx_rseg_mem_restore(): Access the first page of the tablespace,
      so that we will correctly recover rseg->space->free_limit
      in case some log based recovery is pending.
      
      ibuf_remove_free_page(): Only look up the root page once, and
      validate the last page number.
      
      Reviewed by: Debarun Banerjee
      263932d5
  2. 10 Apr, 2024 2 commits
  3. 09 Apr, 2024 5 commits
    • Jan Lindström's avatar
      MDEV-25731 : Assertion `mode_ == m_local' failed in void... · 33af5575
      Jan Lindström authored
      MDEV-25731 : Assertion `mode_ == m_local' failed in void wsrep::client_state::streaming_params(wsrep::streaming_context::fragment_unit, size_t)
      
      Problem was that if wsrep_load_data_splitting was used
      streaming replication (SR) parameters were set
      for MyISAM table. Galera does not currently support SR for
      MyISAM.
      
      Fix is to ignore wsrep_load_data_splitting setting (with
      warning) if table is not InnoDB table.
      
      This is 10.6+ case of fix.
      Signed-off-by: default avatarJulius Goryavsky <julius.goryavsky@mariadb.com>
      33af5575
    • Marko Mäkelä's avatar
      MDEV-33802 Weird read view after ROLLBACK of another transaction · 4aa92911
      Marko Mäkelä authored
      Even after commit b8a67198 there
      is an anomaly where a locking read could return inconsistent results.
      If a locking read would have to wait for a record lock, then by the
      definition of a read view, the modifications made by the current lock
      holder cannot be visible in the read view. This is because the read
      view must exclude any transactions that had not been committed at the
      time when the read view was created.
      
      lock_rec_convert_impl_to_expl_for_trx(), lock_rec_convert_impl_to_expl():
      Return an unsafe-to-dereference pointer to a transaction that holds or
      held the lock, or nullptr if the lock was available.
      
      lock_clust_rec_modify_check_and_lock(),
      lock_sec_rec_read_check_and_lock(),
      lock_clust_rec_read_check_and_lock():
      Return DB_RECORD_CHANGED if innodb_strict_isolation=ON and the
      lock was being held by another transaction.
      
      The test case, which is based on a bug report by Zhuang Liu,
      covers the function lock_sec_rec_read_check_and_lock().
      
      Reviewed by: Vladislav Lesin
      4aa92911
    • Marko Mäkelä's avatar
      MDEV-33588 buf::Block_hint is a performance hog · a4cda66e
      Marko Mäkelä authored
      In so-called optimistic buffer pool lookups, we must not
      dereference a block descriptor before we have made sure that
      it is accessible. While buf_pool_t::resize() is running,
      block descriptors could become invalid.
      
      The buf::Block_hint class was essentially duplicating
      a buf_pool.page_hash lookup that was executed in
      buf_page_optimistic_get() anyway. For better locality of
      reference, we had better execute that lookup only once.
      
      buf_page_optimistic_fix(): Prepare for buf_page_optimistic_get().
      This basically is a simpler version of Buf::Block_hint.
      
      buf_page_optimistic_get(): Assume that buf_page_optimistic_fix()
      has been called and the page identifier verified. Should the block
      be evicted, the block->modify_clock will be invalidated; we do not
      need to check the block->page.id() again. It suffices to check
      the block->modify_clock after acquiring the page latch.
      
      btr_pcur_t::old_page_id: Store the expected page identifier
      for buf_page_optimistic_fix().
      
      btr_pcur_t::block_when_stored: Remove. This was duplicating
      page_cur_t::block.
      
      btr_pcur_optimistic_latch_leaves(): Remove redundant parameters.
      First, invoke buf_page_optimistic_fix() on the requested page.
      If needed, acquire a latch on the left page. Finally, acquire
      a latch on the target page and recheck the block->modify_clock.
      If the page had been freed while we were not holding a page latch,
      fall back to the slow path. Validate the FIL_PAGE_PREV after
      acquiring a latch on the current page. The block->modify_clock
      is only being incremented when records are deleted or pages
      reorganized or evicted; it does not guard against concurrent
      page splits.
      
      Reviewed by: Debarun Banerjee
      a4cda66e
    • Kristian Nielsen's avatar
      MDEV-33668: More precise dependency tracking of XA XID in parallel replication · d90a2b44
      Kristian Nielsen authored
      Keep track of each recently active XID, recording which worker it was queued
      on. If an XID might still be active, choose the same worker to queue event
      groups that refer to the same XID to avoid conflicts.
      
      Otherwise, schedule the XID freely in the next round-robin slot.
      
      This way, XA PREPARE can normally be scheduled without restrictions (unless
      duplicate XID transactions come close together). This improves scheduling
      and parallelism over the old method, where the worker thread to schedule XA
      PREPARE on was fixed based on a hash value of the XID.
      
      XA COMMIT will normally be scheduled on the same worker as XA PREPARE, but
      can be a different one if the XA PREPARE is far back in the event history.
      
      Testcase and code for trimming dynamic array due to Andrei.
      Reviewed-by: default avatarAndrei Elkin <andrei.elkin@mariadb.com>
      Signed-off-by: default avatarKristian Nielsen <knielsen@knielsen-hq.org>
      d90a2b44
    • Kristian Nielsen's avatar
      MDEV-33668: Refactor parallel replication round-robin scheduling to use explicit FIFO · f9ecaa87
      Kristian Nielsen authored
      This is a preparatory patch to facilitate the next commit to improve
      the scheduling of XA transactions in parallel replication.
      
      When choosing the scheduling bucket for the next event group in
      rpl_parallel_entry::choose_thread(), use an explicit FIFO for the
      round-robin selection instead of a simple cyclic counter i := (i+1) % N.
      
      This allows to schedule XA COMMIT/ROLLBACK dependencies explicitly without
      changing the round-robin scheduling of other event groups.
      Reviewed-by: default avatarAndrei Elkin <andrei.elkin@mariadb.com>
      Signed-off-by: default avatarKristian Nielsen <knielsen@knielsen-hq.org>
      f9ecaa87
  4. 08 Apr, 2024 3 commits
    • Brandon Nesterenko's avatar
      MDEV-33672: Gtid_log_event Construction from File Should Ensure Event Length When Using Extra Flags · 89c907bd
      Brandon Nesterenko authored
      A GTID event can have variable length, with contributing factors
      such as the variable length from the flags2 and optional extra flags
      fields. These fields are bitmaps, where each set bit indicates an
      additional value that should be appended to the event, e.g.
      multi-engine transactions append a number to indicate the number of
      additional engines a transaction uses. However, if a flags bit is
      set, and no additional fields are appended to the event, MDEV-33672
      reports that the server can still try to read from memory as if it
      did exist. Note, however, in debug builds, this condition is
      asserted for FL_EXTRA_MULTI_ENGINE.
      
      This patch fixes this to check that the length of the event is
      aligned with the expectation set by the flags for FL_PREPARED_XA,
      FL_COMPLETED_XA, and FL_EXTRA_MULTI_ENGINE.
      
      Reviewed By
      ============
      Kristian Nielsen <knielsen@knielsen-hq.org>
      89c907bd
    • Alexander Barkov's avatar
      MDEV-31251 MDEV-30968 breaks running mariabackup on older mariadb (opendir(NULL)) · 11986ec6
      Alexander Barkov authored
      The problem happened when running mariabackup agains a pre-MDEV-30971 server,
      i.e. not having yet the system variable @@aria_log_dir_path.
      
      As a result, backup_start() called the function backup_files_from_datadir()
      with a NULL value, which further caused a crash.
      
      Fix:
      Perform this call:
      
          backup_files_from_datadir(.., aria_log_dir_path, ..)
      
      only if aria_log_dir_path is not NULL. Otherwise,
      assume that Aria log files are in their default location,
      so they've just copied by the previous call:
      
          backup_files_from_datadir(.., fil_path_to_mysql_datadir, ..)
      
      Thanks to Walter Doekes for a patch proposal.
      11986ec6
    • Marko Mäkelä's avatar
      MDEV-33819 The purge of committed history is mis-parsing some log · 73291de7
      Marko Mäkelä authored
      In commit aa719b50 (part of MDEV-32050)
      a bug was introduced in the function purge_sys_t::choose_next_log(),
      which reimplements some logic that previously was part of
      trx_purge_read_undo_rec(). We must invoke trx_undo_get_first_rec()
      with the page number and offset of the undo log header, but we were
      incorrectly invoking it on the current undo page number, which caused
      us to parse undo records starting at an incorrect offset.
      
      purge_sys_t::choose_next_log(): Pass the correct parameter to
      trx_undo_page_get_first_rec().
      
      trx_undo_page_get_next_rec(), trx_undo_page_get_first_rec(),
      trx_undo_page_get_last_rec(): Add debug assertions and make the
      code more robust by returning nullptr on corruption. Should we
      detect any corrupted undo logs during the purge of committed
      transaction history, the sanest thing to do is to pretend that
      the end of an undo log was reached. If any garbage is left in
      the tables, it will be ignored by anything else than
      CHECK TABLE ... EXTENDED, and it can be removed by OPTIMIZE TABLE.
      
      Thanks to Matthias Leich for providing an "rr replay" trace where
      this bug could be found.
      
      Reviewed by: Vladislav Lesin
      73291de7
  5. 05 Apr, 2024 1 commit
    • Vlad Lesin's avatar
      MDEV-33757 Get rid of TrxUndoRsegs code · a202371f
      Vlad Lesin authored
      Post-push fix: purge queue array can't be fixed size, because the elements
      of the array is the analogue of undo logs, which must be processed in
      the order of transaction commits, and the array can contain more
      elements, than trx_sys.rseg_array. Also it's necessary to maintain
      min-heap property by the trx_no of transaction, which produced the first
      non-purged undo log in all rsegs. That's why the element of purge queue
      aray must contain not only trx_sys.rseg_array index, but also trx_no of
      committed transacion, i.e. the pair (trx_no, trx_sys.rseg_array index),
      which is encoded as uint64_t((trx_no << 8) | (trx_sys.rseg_array index)).
      
      Reviewed by: Marko Mäkelä
      a202371f
  6. 03 Apr, 2024 2 commits
    • Brandon Nesterenko's avatar
      MDEV-33799: mysql_manager_submit Segfault at Startup Still Possible During Recovery · 9a4991a0
      Brandon Nesterenko authored
      MDEV-26473 fixed a segmentation fault at startup between the handle
      manager thread and the binlog background thread, such that the
      binlog background thread could be started and submit a job to the
      handle manager, before it had initialized. Where MDEV-26473 made it
      so the handle manager would initialize before the main thread
      started the normal binary logs, it did not account for the recovery
      case. That is, there is still a possibility of a segmentation fault
      when a server is recovering using the binary logs such that it can
      open the binary logs, start the binlog background thread, and submit
      a job to the handle manager before it is initialized.
      
      This patch fixes this by moving the initialization of the mysql
      handler manager to happen prior to recovery.
      
      Reviewed By:
      ============
      Andrei Elkin <andrei.elkin@mariadb.com>
      9a4991a0
    • Vlad Lesin's avatar
      MDEV-33757 Get rid of TrxUndoRsegs code · 722df777
      Vlad Lesin authored
      TrxUndoRsegs is wrapper for vector of trx_rseg_t*. It has two
      constructors, both initialize the vector with only one element. And they
      are used to push transactions rseg(the singular) to purge queue. There is
      no function to add elements to the vector. The default constructor is used
      only for declaration of NullElement.
      
      The TrxUndoRsegs was introduced in WL#6915 in MySQL 5.7 and. MySQL 5.7
      would unnecessarily let the purge of history parse the
      temporary undo records, and then look up the table (via a global hash
      table), and only at the point of processing the parsed undo log record
      determine that the table is a temporary table and the undo record must be
      thrown away.
      
      In MariaDB 10.2 we have two disjoint sets of rollback segments (128 for
      persistent, 128 for temporary), and purge does not even see the temporary
      tables. The only reason why temporary tables are visible to other threads
      is a SQL layer bug (MDEV-17805).
      
      purge_sys_t::choose_next_log(): merge the relevant part
      of TrxUndoRsegsIterator::set_next() to the start of
      purge_sys_t::choose_next_log().
      
      purge_sys_t::rseg_get_next_history_log(): add a tail call of
      purge_sys_t::choose_next_log() and adjust the callers, to simplify the
      control flow further.
      
      purge_sys.pq_mutex and purge_sys.purge_queue: make it private by adding
      some simple accessor function.
      
      trx_purge_cleanse_purge_queue(): make it a member of purge_sys_t to have
      have access to private purge_sys.pq_mutex and purge_sys.purge_queue,
      simplify the code with using simple array copy and clearing purge queue
      instead of poping each purge queue element.
      
      rseg_t::last_commit_and_offset: exchange trx_no and offset bits to avoid
      bitwise operations during pushing to/popping from purge queue.
      
      Thanks Marko Mäkelä for historical overview of TrxUndoRsegs development.
      
      Reviewed by: Marko Mäkelä
      722df777
  7. 27 Mar, 2024 7 commits
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · ccb7a1e9
      Marko Mäkelä authored
      ccb7a1e9
    • Alexander Barkov's avatar
      MDEV-33772 Bad SEPARATOR value in GROUP_CONCAT on character set conversion · 0fc123c5
      Alexander Barkov authored
      Item_func_group_concat::print() did not take into account
      that Item_func_group_concat::separator can be of a different character set
      than the "String *str" (when the printing is being done to).
      Therefore, printing did not work correctly for:
      - non-ASCII separators when GROUP_CONCAT is done on 8bit data
        or multi-byte data with mbminlen==1.
      - all separators (even including simple ones like comma)
        when GROUP_CONCAT is done on ucs2/utf16/utf32 data (mbminlen>1).
      
      Because of this problem, VIEW definitions did not print correctly to
      their FRM files. This later led to a wrong SELECT and SHOW CREATE output.
      
      Fix:
      
      - Adding new String methods:
      
        bool append_for_single_quote_using_mb_wc(const char *str, size_t length,
                                                 CHARSET_INFO *cs);
      
        bool append_for_single_quote_opt_convert(const char *str,
                                                 size_t length,
                                                 CHARSET_INFO *cs)
      
        which perform both escaping and character set conversion at the same time.
      
      - Adding a new String method escaped_wc_for_single_quote(),
        to reuse the code between the old and the new methods.
      
      - Fixing Item_func_group_concat::print() to use the new
        method append_for_single_quote_opt_convert().
      0fc123c5
    • Dave Gosselin's avatar
      MDEV-33460 select '123' 'x'; unexpected result · 58df2097
      Dave Gosselin authored
      Queries that select concatenated constant strings now have
      colname and value that match.  For example,
        SELECT '123' 'x';
      will return a result where the column name and value both
      are '123x'.
      
      Review: Daniel Black
      58df2097
    • Daniel Black's avatar
      MDEV-33301 memlock with systemd still not working · 76a27155
      Daniel Black authored
      .. even with MDEV-9095 fix
      
      CapabilityBounding sets require filesystem setcap attributes
      for the executable to gain privileges during execution.
      
      A side effect of this however is the getauxvec(AT_SECURE) gets
      set, and the secure_getenv from OpenSSL internals on
      OPENSSL_CONF environment variable will get ignored (openssl gh issue
      21770).
      
      According to capabilities(7), Ambient capabilities don't trigger
      ld.so triggering the secure execution mode.
      
      Include SELinux and Apparmor capabilities for ipc_lock
      76a27155
    • Daniel Black's avatar
      Revert "MDEV-33636: RPM caps is on mariadbd exe" · ee2ed1a0
      Daniel Black authored
      This was the orginal implementation that reverted with a bunch of
      commits.
      
      This reverts commit a13e521b.
      
      Revert "cmake: append to the array correctly"
      This reverts commit 51e3f1da.
      
      Revert "build failure with cmake < 3.10"
      This reverts commit 49cf702e.
      
      Revert "MDEV-33301 memlock with systemd still not working"
      This reverts commit 8a1904d7.
      ee2ed1a0
    • Jan Lindström's avatar
      MDEV-33039 Galera test failure on mysql-wsrep-features#165 · c5ac9836
      Jan Lindström authored
      We should not set debug sync point when holding a mutex
      to avoid mutex ordering failure.
      Signed-off-by: default avatarJulius Goryavsky <julius.goryavsky@mariadb.com>
      c5ac9836
    • Denis Protivensky's avatar
      MDEV-33136: Properly BF-abort user transactions with explicit locks · 7bf3c312
      Denis Protivensky authored
      User transactions may acquire explicit MDL locks from InnoDB level
      when persistent statistics is re-read for a table.
      If such a transaction would be subject to BF-abort, it was improperly
      detected as a system transaction and wouldn't get aborted.
      
      The fix: Check if a transaction holding explicit MDL locks is a user
      transaction in the MDL conflict handling code.
      Signed-off-by: default avatarJulius Goryavsky <julius.goryavsky@mariadb.com>
      7bf3c312
  8. 26 Mar, 2024 2 commits
    • Vladislav Vaintroub's avatar
      MDEV-33506 Show original IP in the "aborted" message. · 318000cf
      Vladislav Vaintroub authored
      Add "real ip:<ip_or_localhost>" part to the aborted message
      Only for proxy-protocoled connection, so it does not  not to cause
      confusion to normal users.
      318000cf
    • Jan Lindström's avatar
      MDEV-33278 : Assertion failure in thd_get_thread_id at lock_wait_wsrep · b762541d
      Jan Lindström authored
      Problem is that not all conflicting transactions have THD object.
      Therefore, it must be checked that victim has THD
      before it's identification is added to victim list as victim's
      thread identification is later requested using thd_get_thread_id
      function that requires that we have valid pointer to THD object
      in trx->mysql_thd.
      
      Victim might not have trx->mysql_thd in two cases:
      
      (1) An incomplete transaction that was recovered from undo logs
      on server startup (and not yet rolled back).
      
      (2) Transaction that is in XA PREPARE state and whose client
      connection was disconnected.
      
      Neither of these can complete before lock_wait_wsrep()
      releases lock_sys.latch.
      
      (1) trx_t::commit_in_memory() is clearing both
      trx_t::state and trx_t::is_recovered before it invokes
      lock_release(trx_t*) (which would be blocked by the exclusive
      lock_sys.latch that we are holding here). Hence, it is not
      possible to write a debug assertion to document this scenario.
      
      (2) If is in XA PREPARE state, it would eventually be rolled
      back and the lock conflict would be resolved when an XA COMMIT
      or XA ROLLBACK statement is executed in some other connection.
      Signed-off-by: default avatarJulius Goryavsky <julius.goryavsky@mariadb.com>
      b762541d
  9. 25 Mar, 2024 3 commits
  10. 22 Mar, 2024 4 commits
    • Marko Mäkelä's avatar
      MDEV-32364 fixup: crash in ut_dontdump() · 70b90772
      Marko Mäkelä authored
      70b90772
    • Marko Mäkelä's avatar
      MDEV-33591 MONITOR_INC_VALUE_CUMULATIVE is executed regardless of "if" condition · f0590db5
      Marko Mäkelä authored
      MONITOR_INC_VALUE_CUMULATIVE is a multiline macro, so the second statement
      will be executed always, regardless of "if" condition.
      
      These problems first started with
      commit b1ab211d (MDEV-15053).
      
      Thanks to Yury Chaikou from ServiceNow for the report.
      f0590db5
    • Marko Mäkelä's avatar
      MDEV-33454 release row locks for non-modified rows at XA PREPARE · 17e59ed3
      Marko Mäkelä authored
      From the correctness point of view, it should be safe to release
      all locks on index records that were not modified by the transaction.
      Doing so should make the locks after XA PREPARE fully compatible
      with what would happen if the server were restarted: InnoDB table
      IX locks and exclusive record locks would be resurrected based on
      undo log records.
      
      Concurrently running transactions that are waiting for a lock may invoke
      lock_rec_convert_impl_to_expl() to create an explicit record lock object
      on behalf of the lock-owning transaction so that they can attaching
      their waiting lock request on the explicit record lock object. Explicit
      locks would be released by trx_t::release_locks() during commit or
      rollback.
      
      Any clustered index record whose DB_TRX_ID belongs to a transaction that
      is in active or XA PREPARE state will be implicitly locked by that
      transaction. On XA PREPARE, we can release explicit exclusive locks on
      records whose DB_TRX_ID does not match the current transaction identifier.
      
      lock_rec_unlock_unmodified(): Release record locks that are not implicitly
      held by the current transaction.
      
      lock_release_on_prepare_try(), lock_release_on_prepare():
      Invoke lock_rec_unlock_unmodified().
      
      row_trx_id_offset(): Declare non-static.
      
      lock_rec_unlock(): Replaces lock_rec_unlock_supremum().
      
      Reviewed by: Vladislav Lesin
      17e59ed3
    • Marko Mäkelä's avatar
      MDEV-33613 InnoDB may still hang when temporarily running out of buffer pool · fa8a46eb
      Marko Mäkelä authored
      By design, InnoDB has always hung when permanently running out of
      buffer pool, for example when several threads are waiting to allocate
      a block, and all of the buffer pool is buffer-fixed by the active threads.
      
      The hang that we are fixing here occurs when the buffer pool is only
      temporarily running out and the situation could be rescued by writing out
      some dirty pages or evicting some clean pages.
      
      buf_LRU_get_free_block(): Simplify the way how we wait for
      the buf_flush_page_cleaner thread. This fixes occasional hangs
      of the test encryption.innochecksum that were introduced by
      commit a55b951e (MDEV-26827).
      To play it safe, we use a timed wait when waiting for the
      buf_flush_page_cleaner() thread to perform its job. Should that
      thread get stuck, we will invoke buf_pool.LRU_warn() in order to
      display a message that pages could not be freed, and keep trying
      to wake up the buf_flush_page_cleaner() thread.
      
      The INFORMATION_SCHEMA.INNODB_METRICS counters
      buffer_LRU_single_flush_failure_count and
      buffer_LRU_get_free_waits will be removed.
      The latter is represented by buffer_pool_wait_free.
      
      Also removed will be the message
      "InnoDB: Difficult to find free blocks in the buffer pool"
      because in d34479dc we
      introduced a more precise message
      "InnoDB: Could not free any blocks in the buffer pool"
      in the buf_flush_page_cleaner thread.
      
      buf_pool_t::LRU_warn(): Issue the warning message that we could
      not free any blocks in the buffer pool. This may also be invoked
      by buf_LRU_get_free_block() if buf_flush_page_cleaner() appears
      to be stuck.
      
      buf_pool_t::n_flush_dec(): Remove.
      
      buf_pool_t::n_flush_dec_holding_mutex(): Rename to n_flush_dec().
      
      buf_flush_LRU_list_batch(): Increment the eviction counter for blocks
      of temporary, discarded or dropped tablespaces.
      
      buf_flush_LRU(): Make static, and remove the constant parameter
      evict=false. The only caller will be the buf_flush_page_cleaner()
      thread.
      
      IORequest::is_LRU(): Remove. The only case of evicting pages on
      write completion will be when we are writing out pages of the
      temporary tablespace. Those pages are not in buf_pool.flush_list,
      only in buf_pool.LRU.
      
      buf_page_t::flush(): Remove the parameter evict.
      
      buf_page_t::write_complete(): Change the parameter "bool temporary"
      to "bool persistent" and add a parameter for an already read state().
      
      Reviewed by: Debarun Banerjee
      fa8a46eb
  11. 21 Mar, 2024 1 commit
    • Brandon Nesterenko's avatar
      MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency · 75c7c6dc
      Brandon Nesterenko authored
      When using semi-sync replication with
      rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the
      primary can significantly reduce compared to AFTER_SYNC's
      performance for workloads with many concurrent users executing
      transactions. This is because all connections on the primary share
      the same cond_wait variable/mutex pair, so any time an ACK is
      received from a replica, all waiting connections are awoken to check
      if the ACK was for itself, which is done in mutual exclusion.
      
      This patch changes this such that the waiting THD will use its own
      local condition variable, and the ACK receiver thread only signals
      connections which have been ACKed for wakeup. That is, the
      THD::LOCK_wakeup_ready condition variable is re-used for this
      purpose, and the Active_tranx queue nodes are extended to hold the
      waiting thread, so it can be signalled once ACKed.
      
      Additionally:
      
       1)  Removed part of MDEV-11853 additions, which allowed suspended
      connection threads awaiting their semi-sync ACKs to live until their
      ACKs had been received. This part, however, wasn't needed.  That is,
      all that was needed was for the Ack_thread to survive.  So now the
      connection threads are killed during phase 1. Thereby
      THD::is_awaiting_semisync_ack, and all its related code was removed.
      
       2) COND_binlog_send is repurposed to signal on the condition when
      Active_tranx is emptied during clear_active_tranx_nodes.
      
       3) At master shutdown (when waiting for slaves), instead of the
      main loop individually waiting for each ACK, await_slave_reply()
      (renamed await_all_slave_replies()) just waits once for the
      repurposed COND_binlog_send to signal it is empty.
      
       4) Test rpl_semi_sync_shutdown_await_ack is updates as following:
         4.1) Added test case (adapted from Kristian Nielsen) to ensure
      that if a thread awaiting its ACK is killed while SHUTDOWN WAIT FOR
      ALL SLAVES is issued, the primary will still wait for the ACK from
      the killed thread.
         4.2) As connections which by-passed phase 1 of thread killing no
      longer are delayed for kill until phase 2, we can no longer query
      yes/no tx after receiving an ACK/timeout. The check for these
      variables is removed.
         4.3) Comment descriptions are updated which mention that the
      connection is alive; and adjusted to be the Ack_thread.
      
      Reviewed By:
      ============
      Kristian Nielsen <knielsen@knielsen-hq.org>
      75c7c6dc
  12. 20 Mar, 2024 1 commit
    • Marko Mäkelä's avatar
      MDEV-26642/MDEV-26643/MDEV-32898 Implement innodb_snapshot_isolation · b8a67198
      Marko Mäkelä authored
      https://jepsen.io/analyses/mysql-8.0.34 highlights that the
      transaction isolation levels in the InnoDB storage engine do not
      correspond to any widely accepted definitions, such as
      "Generalized Isolation Level Definitions"
      https://pmg.csail.mit.edu/papers/icde00.pdf
      (PL-1 = READ UNCOMMITTED, PL-2 = READ COMMITTED, PL-2.99 = REPEATABLE READ,
      PL-3 = SERIALIZABLE).
      Only READ UNCOMMITTED in InnoDB seems to match the above definition.
      
      The issue is that InnoDB does not detect write/write conflicts
      (Section 4.4.3, Definition 6) in the above.
      
      It appears that as soon as we implement write/write conflict detection
      (SET SESSION innodb_snapshot_isolation=ON), the default isolation level
      (SET TRANSACTION ISOLATION LEVEL REPEATABLE READ) will become
      Snapshot Isolation (similar to Postgres), as defined in Section 4.2 of
      "A Critique of ANSI SQL Isolation Levels", MSR-TR-95-51, June 1995
      https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
      
      Locking reads inside InnoDB used to read the latest committed version,
      ignoring what should actually be visible to the transaction.
      The added test innodb.lock_isolation illustrates this. The statement
      	UPDATE t SET a=3 WHERE b=2;
      is executed in a transaction that was started before a read view or
      a snapshot of the current transaction was created, and committed before
      the current transaction attempts to execute
      	UPDATE t SET b=3;
      If SET innodb_snapshot_isolation=ON is in effect when the second
      transaction was started, the second transaction will be aborted with
      the error ER_CHECKREAD. By default (innodb_snapshot_isolation=OFF),
      the second transaction would execute inconsistently, displaying an
      incorrect SELECT COUNT(*) FROM t in its read view.
      
      If innodb_snapshot_isolation=ON, if an attempt to acquire a lock on a
      record that does not exist in the current read view is made, an error
      DB_RECORD_CHANGED (HA_ERR_RECORD_CHANGED, ER_CHECKREAD) will
      be raised. This error will be treated in the same way as a deadlock:
      the transaction will be rolled back.
      
      lock_clust_rec_read_check_and_lock(): If the current transaction has
      a read view where the record is not visible and
      innodb_snapshot_isolation=ON, fail before trying to acquire the lock.
      
      row_sel_build_committed_vers_for_mysql(): If innodb_snapshot_isolation=ON,
      disable the "semi-consistent read" logic that had been implemented by
      myself on the directions of Heikki Tuuri in order to address
      https://bugs.mysql.com/bug.php?id=3300 that was motivated by a customer
      wanting UPDATE to skip locked rows that do not match the WHERE condition.
      It looks like my changes were included in the MySQL 5.1.5
      commit ad126d90; at that time, employees
      of Innobase Oy (a recent acquisition of Oracle) had lost write access to
      the repository.
      
      The only reason why we set innodb_snapshot_isolation=OFF by default is
      backward compatibility with applications, such as the one that motivated
      the implementation of "semi-consistent read" back in 2005. In a later
      major release, we can default to innodb_snapshot_isolation=ON.
      
      Thanks to Peter Alvaro, Kyle Kingsbury and Alexey Gotsman for their work
      on https://github.com/jepsen-io/ and to Kyle and Alexey for explanations
      and some testing of this fix.
      
      Thanks to Vladislav Lesin for the initial test for MDEV-26643,
      as well as reviewing these changes.
      b8a67198
  13. 19 Mar, 2024 4 commits
    • Brandon Nesterenko's avatar
      MDEV-33716: rpl.rpl_semi_sync_slave_enabled_consistent Fails with Error Condition Reached · ca07f629
      Brandon Nesterenko authored
      Though the test itself doesn't create any transactions
      directly, the added test suppressions are replicated,
      and when the SQL thread is stopped mid-execution,
      it is set into an error state because these are
      non-transactional events being aborted.
      
      This patch fixes the test by ensuring that the test
      suppressions are fully replicated before continuing
      ca07f629
    • Thirunarayanan Balathandayuthapani's avatar
      MDEV-33542 Inplace algorithm occupies more disk space compared to copy algorithm · c3a6248b
      Thirunarayanan Balathandayuthapani authored
      Problem:
      =======
      - In case of large file size, InnoDB eagerly adds the new extent
      even though there are many existing unused pages of the segment.
      Reason is that in case of larger file size, threshold
      (1/8 of reserved pages) for adding new extent has been
      reached frequently.
      
      Solution:
      =========
      - Try to utilise the unused pages in the segment before adding
      the new extent in the file segment.
      
      need_for_new_extent(): In case of larger file size, try to use
      the 4 * FSP_EXTENT_SIZE as threshold to allocate the new extent.
      
      fseg_alloc_free_page_low(): Rewrote the function to allocate
      the page in the following order.
      1) Try to get the page from existing segment extent.
      2) Check whether the segment needs new extent
      (need_for_new_extent()) and allocate the new extent,
      find the page.
      3) Take individual page from the unused page from
      segment or tablespace.
      4) Allocate a new extent and take first page from it.
      
      Removed FSEG_FILLFACTOR, FSEG_FRAG_LIMIT variable.
      c3a6248b
    • Vladislav Vaintroub's avatar
      MDEV-33723 Mroonga ignored WITHOUT_DYNAMIC_PLUGINS · 7d36919f
      Vladislav Vaintroub authored
      
      Make WITHOUT_DYNAMIC_PLUGINS ignore mrooonga also in its own DIY version
      of MYSQL_ADD_PLUGIN
      7d36919f
    • Vladislav Vaintroub's avatar
      MDEV-23224 Windows threadpool - use better threadpool_max_threads default. · 5b4e69c0
      Vladislav Vaintroub authored
      Use max_connections in calculation, top prevent possible deadlock, if
      max_connection is high.
      5b4e69c0
  14. 18 Mar, 2024 4 commits
    • Vladislav Vaintroub's avatar
      Post-fix 567c0973 · 01d994b3
      Vladislav Vaintroub authored
      Do *not* check if socket is closed by another thread. This is
      race-condition prone, unnecessary, and harmful. VIO state was introduced
      to debug the errors, not to change the behavior.
      
      Rather than checking if socket is closed, add a DBUG_ASSERT that it is
      *not* closed, because this is an actual logic error, and can potentially
      lead to all sorts of funny behavior like writing error packets to Innodb
      files.
      
      Unlike closesocket(), shutdown(2) is not actually race-condition prone,
      and it breaks poll() and read(), and it worked for longer than a decade,
      and it does not need any state check in the code.
      01d994b3
    • Daniel Black's avatar
      MDEV-33636: RPM caps is on mariadbd exe · a13e521b
      Daniel Black authored
      Postfix on 51e3f1da that
      mariadbd should be the executable name rather than capabilities
      on a symlink.
      a13e521b
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · 50715bd2
      Marko Mäkelä authored
      50715bd2
    • Marko Mäkelä's avatar
      Work around missing MSAN instrumentation · 4592af2e
      Marko Mäkelä authored
      Let us skip the recently added test main.mysql-interactive if
      an instrumented ncurses library is not available.
      
      In InnoDB, let us work around an uninstrumented libnuma, by
      declaring that the objects returned by numa_get_mems_allowed()
      are initialized.
      4592af2e