1. 16 Mar, 2023 5 commits
    • Marko Mäkelä's avatar
      MDEV-26055: Improve adaptive flushing · 9593cccf
      Marko Mäkelä authored
      Adaptive flushing is enabled by setting innodb_max_dirty_pages_pct_lwm>0
      (not default) and innodb_adaptive_flushing=ON (default).
      There is also the parameter innodb_adaptive_flushing_lwm
      (default: 10 per cent of the log capacity). It should enable some
      adaptive flushing even when innodb_max_dirty_pages_pct_lwm=0.
      That is not being changed here.
      
      This idea was first presented by Inaam Rana several years ago,
      and I discussed it with Jean-François Gagné at FOSDEM 2023.
      
      buf_flush_page_cleaner(): When we are not near the log capacity limit
      (neither buf_flush_async_lsn nor buf_flush_sync_lsn are set),
      also try to move clean blocks from the buf_pool.LRU list to buf_pool.free
      or initiate writes (but not the eviction) of dirty blocks, until
      the remaining I/O capacity has been consumed.
      
      buf_flush_LRU_list_batch(): Add the parameter bool evict, to specify
      whether dirty least recently used pages (from buf_pool.LRU) should
      be evicted immediately after they have been written out. Callers outside
      buf_flush_page_cleaner() will pass evict=true, to retain the existing
      behaviour.
      
      buf_do_LRU_batch(): Add the parameter bool evict.
      Return counts of evicted and flushed pages.
      
      buf_flush_LRU(): Add the parameter bool evict.
      Assume that the caller holds buf_pool.mutex and
      will invoke buf_dblwr.flush_buffered_writes() afterwards.
      
      buf_flush_list_holding_mutex(): A low-level variant of buf_flush_list()
      whose caller must hold buf_pool.mutex and invoke
      buf_dblwr.flush_buffered_writes() afterwards.
      
      buf_flush_wait_batch_end_acquiring_mutex(): Remove. It is enough to have
      buf_flush_wait_batch_end().
      
      page_cleaner_flush_pages_recommendation(): Avoid some floating-point
      arithmetics.
      
      buf_flush_page(), buf_flush_check_neighbor(), buf_flush_check_neighbors(),
      buf_flush_try_neighbors(): Rename the parameter "bool lru" to "bool evict".
      
      buf_free_from_unzip_LRU_list_batch(): Remove the parameter.
      Only actual page writes will contribute towards the limit.
      
      buf_LRU_free_page(): Evict freed pages of temporary tables.
      
      buf_pool.done_free: Broadcast whenever a block is freed
      (and buf_pool.try_LRU_scan is set).
      
      buf_pool_t::io_buf_t::reserve(): Retry indefinitely.
      During the test encryption.innochecksum we easily run out of
      these buffers for PAGE_COMPRESSED or ENCRYPTED pages.
      
      Tested by Matthias Leich and Axel Schwenke
      9593cccf
    • Marko Mäkelä's avatar
      MDEV-30357 Performance regression in locking reads from secondary indexes · 4105017a
      Marko Mäkelä authored
      lock_sec_rec_some_has_impl(): Remove a harmful condition that caused the
      performance regression and should not have been added in
      commit b6e41e38 in the first place.
      Locking transactions that have not modified any persistent tables
      can carry the transaction identifier 0.
      
      trx_t::max_inactive_id: A cache for trx_sys_t::find_same_or_older().
      The value is not reset on transaction commit so that previous results
      can be reused for subsequent transactions. The smallest active
      transaction ID can only increase over time, not decrease.
      
      trx_sys_t::find_same_or_older(): Remember the maximum previous id for which
      rw_trx_hash.iterate() returned false, to avoid redundant iterations.
      
      lock_sec_rec_read_check_and_lock(): Add an early return in case we are
      already holding a covering table lock.
      
      lock_rec_convert_impl_to_expl(): Add a template parameter to avoid
      a redundant run-time check on whether the index is secondary.
      
      lock_rec_convert_impl_to_expl_for_trx(): Move some code from
      lock_rec_convert_impl_to_expl(), to reduce code duplication due
      to the added template parameter.
      
      Reviewed by: Vladislav Lesin
      Tested by: Matthias Leich
      4105017a
    • Marko Mäkelä's avatar
      MDEV-29835 InnoDB hang on B-tree split or merge · f2096478
      Marko Mäkelä authored
      This is a follow-up to
      commit de4030e4 (MDEV-30400),
      which fixed some hangs related to B-tree split or merge.
      
      btr_root_block_get(): Use and update the root page guess. This is just
      a minor performance optimization, not affecting correctness.
      
      btr_validate_level(): Remove the parameter "lockout", and always
      acquire an exclusive dict_index_t::lock in CHECK TABLE without QUICK.
      This is needed in order to avoid latching order violation in
      btr_page_get_father_node_ptr_for_validate().
      
      btr_cur_need_opposite_intention(): Return true in case
      btr_cur_compress_recommendation() would hold later during the
      mini-transaction, or if a page underflow or overflow is possible.
      If we return true, our caller will escalate to aqcuiring an exclusive
      dict_index_t::lock, to prevent a latching order violation and deadlock
      during btr_compress() or btr_page_split_and_insert().
      
      btr_cur_t::search_leaf(), btr_cur_t::open_leaf():
      Also invoke btr_cur_need_opposite_intention() on the leaf page.
      
      btr_cur_t::open_leaf(): When escalating to exclusive index locking,
      acquire exclusive latches on all pages as well.
      
      innobase_instant_try(): Return an error code if the root page cannot
      be retrieved.
      
      In addition to the normal stress testing with Random Query Generator (RQG)
      this has been tested with
      ./mtr --mysqld=--loose-innodb-limit-optimistic-insert-debug=2
      but with the injection in btr_cur_optimistic_insert() for non-leaf pages
      adjusted so that it would use the value 3. (Otherwise, infinite page
      splits could occur in some mtr tests.)
      
      Tested by: Matthias Leich
      f2096478
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · 85cbfaef
      Marko Mäkelä authored
      85cbfaef
    • Marko Mäkelä's avatar
      MDEV-30860 Race condition between buffer pool flush and log file deletion in... · 1495f057
      Marko Mäkelä authored
      MDEV-30860 Race condition between buffer pool flush and log file deletion in mariadb-backup --prepare
      
      srv_start(): If we are going to close the log file in
      mariadb-backup --prepare, call buf_flush_sync() before
      calling recv_sys.debug_free() to ensure that the log file
      will not be accessed.
      
      This fixes a rather rare failure in the test
      mariabackup.innodb_force_recovery where buf_flush_page_cleaner()
      would invoke log_checkpoint_low() because !recv_recovery_is_on()
      would hold due to the fact that recv_sys.debug_free() had
      already been called. Then, the log write for the checkpoint
      would fail because srv_start() had invoked log_sys.log.close_file().
      1495f057
  2. 10 Mar, 2023 2 commits
    • Vlad Lesin's avatar
      MDEV-30775 Performance regression in fil_space_t::try_to_close() introduced in MDEV-23855 · 7d6b3d40
      Vlad Lesin authored
      fil_node_open_file_low() tries to close files from the top of
      fil_system.space_list if the number of opened files is exceeded.
      
      It invokes fil_space_t::try_to_close(), which iterates the list searching
      for the first opened space. Then it just closes the space, leaving it in
      the same position in fil_system.space_list.
      
      On heavy files opening, like during 'SHOW TABLE STATUS ...' execution,
      if the number of opened files limit is reached,
      fil_space_t::try_to_close() iterates more and more closed spaces before
      reaching any opened space for each fil_node_open_file_low() call. What
      causes performance regression if the number of spaces is big enough.
      
      The fix is to keep opened spaces at the top of fil_system.space_list,
      and move closed files at the end of the list.
      
      For this purpose fil_space_t::space_list_last_opened pointer is
      introduced. It points to the last inserted opened space in
      fil_space_t::space_list. When space is opened, it's inserted to the
      position just after the pointer points to in fil_space_t::space_list to
      preserve the logic, inroduced in MDEV-23855. Any closed space is added
      to the end of fil_space_t::space_list.
      
      As opened spaces are located at the top of fil_space_t::space_list,
      fil_space_t::try_to_close() finds opened space faster.
      
      There can be the case when opened and closed spaces are mixed in
      fil_space_t::space_list if fil_system.freeze_space_list was set during
      fil_node_open_file_low() execution. But this should not cause any error,
      as fil_space_t::try_to_close() still iterates spaces in the list.
      
      There is no need in any test case for the fix, as it does not change any
      functionality, but just fixes performance regression.
      7d6b3d40
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · f169dfb4
      Marko Mäkelä authored
      f169dfb4
  3. 09 Mar, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30819 InnoDB fails to start up after downgrading from MariaDB 11.0 · 08267ba0
      Marko Mäkelä authored
      While downgrades are not supported and misguided attempts at it could
      cause serious corruption especially after
      commit b07920b6
      it might be useful if InnoDB would start up even after an upgrade to
      MariaDB Server 11.0 or later had removed the change buffer.
      
      innodb_change_buffering_update(): Disallow anything else than
      innodb_change_buffering=none when the change buffer is corrupted.
      
      ibuf_init_at_db_start(): Mention a possible downgrade in the corruption
      error message. If innodb_change_buffering=none, ignore the error but do
      not initialize ibuf.index.
      
      ibuf_free_excess_pages(), ibuf_contract(), ibuf_merge_space(),
      ibuf_update_max_tablespace_id(), ibuf_delete_for_discarded_space(),
      ibuf_print(): Check for !ibuf.index.
      
      ibuf_check_bitmap_on_import(): Remove some unnecessary code.
      This function is only accessing change buffer bitmap pages in a
      data file that is not attached to the rest of the database.
      It is not accessing the change buffer tree itself, hence it does
      not need any additional mutex protection.
      
      This has been tested both by starting up MariaDB Server 10.8 on
      a 11.0 data directory, and by running ./mtr --big-test while
      ibuf_init_at_db_start() was tweaked to always fail.
      08267ba0
  4. 08 Mar, 2023 1 commit
  5. 07 Mar, 2023 1 commit
  6. 06 Mar, 2023 3 commits
  7. 02 Mar, 2023 1 commit
  8. 28 Feb, 2023 3 commits
    • Sergei Golubchik's avatar
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · 085d0ac2
      Marko Mäkelä authored
      085d0ac2
    • Marko Mäkelä's avatar
      MDEV-30753 Possible corruption due to trx_purge_free_segment() · c14a3943
      Marko Mäkelä authored
      Starting with commit 0de3be8c (MDEV-30671),
      the field TRX_UNDO_NEEDS_PURGE lost its previous meaning.
      The following scenario is possible:
      
      (1) InnoDB is killed at a point of time corresponding to the durable
      execution of some fseg_free_step_not_header() but not
      trx_purge_remove_log_hdr().
      (2) After restart, the affected pages are allocated for something else.
      (3) Purge will attempt to access the newly reallocated pages when looking
      for some old undo log records.
      
      trx_purge_free_segment(): Invoke trx_purge_remove_log_hdr() as the first
      thing, to be safe. If the server is killed, some pages will never be
      freed. That is the lesser evil. Also, before each mtr.start(), invoke
      log_free_check() to prevent ib_logfile0 overrun.
      c14a3943
  9. 27 Feb, 2023 2 commits
    • Monty's avatar
      Added detection of memory overwrite with multi_malloc · 57c526ff
      Monty authored
      This patch also fixes some bugs detected by valgrind after this
      patch:
      
      - Not enough copy_func elements was allocated by Create_tmp_table() which
        causes an memory overwrite in Create_tmp_table::add_fields()
        I added an ASSERT() to be able to detect this also without valgrind.
        The bug was that TMP_TABLE_PARAM::copy_fields was not correctly set
        when calling create_tmp_table().
      - Aria::empty_bits is not allocated if there is no varchar/char/blob
        fields in the table.  Fixed code to take this into account.
        This cannot cause any issues as this is just a memory access
        into other Aria memory and the content of the memory would not be used.
      - Aria::last_key_buff was not allocated big enough. This may have caused
        issues with rtrees and ma_extra(HA_EXTRA_REMEMBER_POS) as they
        would use the same memory area.
      - Aria and MyISAM didn't take extended key parts into account, which
        caused problems when copying rec_per_key from engine to sql level.
      - Mark asan builds with 'asan' in version strihng to detect these in
        not_valgrind_build.inc.
        This is needed to not have main.sp-no-valgrind fail with asan.
      57c526ff
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · 3e2ad0e9
      Marko Mäkelä authored
      3e2ad0e9
  10. 24 Feb, 2023 1 commit
    • Marko Mäkelä's avatar
      MDEV-30671 InnoDB undo log truncation fails to wait for purge of history · 0de3be8c
      Marko Mäkelä authored
      It is not safe to invoke trx_purge_free_segment() or execute
      innodb_undo_log_truncate=ON before all undo log records in
      the rollback segment has been processed.
      
      A prominent failure that would occur due to premature freeing of
      undo log pages is that trx_undo_get_undo_rec() would crash when
      trying to copy an undo log record to fetch the previous version
      of a record.
      
      If trx_undo_get_undo_rec() was not invoked in the unlucky time frame,
      then the symptom would be that some committed transaction history is
      never removed. This would be detected by CHECK TABLE...EXTENDED that
      was impleented in commit ab019010.
      Such a garbage collection leak should be possible even when using
      innodb_undo_log_truncate=OFF, just involving trx_purge_free_segment().
      
      trx_rseg_t::needs_purge: Change the type from Boolean to a transaction
      identifier, noting the most recent non-purged transaction, or 0 if
      everything has been purged. On transaction start, we initialize this
      to 1 more than the transaction start ID. On recovery, the field may be
      adjusted to the transaction end ID (TRX_UNDO_TRX_NO) if it is larger.
      
      The field TRX_UNDO_NEEDS_PURGE becomes write-only; only some debug
      assertions that would validate the value. The field reflects the old
      inaccurate Boolean field trx_rseg_t::needs_purge.
      
      trx_undo_mem_create_at_db_start(), trx_undo_lists_init(),
      trx_rseg_mem_restore(): Remove the parameter max_trx_id.
      Instead, store the maximum in trx_rseg_t::needs_purge,
      where trx_rseg_array_init() will find it.
      
      trx_purge_free_segment(): Contiguously hold a lock on
      trx_rseg_t to prevent any concurrent allocation of undo log.
      
      trx_purge_truncate_rseg_history(): Only invoke trx_purge_free_segment()
      if the rollback segment is empty and there are no pending transactions
      associated with it.
      
      trx_purge_truncate_history(): Only proceed with innodb_undo_log_truncate=ON
      if trx_rseg_t::needs_purge indicates that all history has been purged.
      
      Tested by: Matthias Leich
      0de3be8c
  11. 22 Feb, 2023 1 commit
  12. 21 Feb, 2023 1 commit
  13. 20 Feb, 2023 1 commit
    • Vlad Lesin's avatar
      MDEV-27701 Race on trx->lock.wait_lock between lock_rec_move() and lock_sys_t::cancel() · a474e327
      Vlad Lesin authored
      The initial issue was in assertion failure, which checked the equality
      of lock to cancel with trx->lock.wait_lock in lock_sys_t::cancel().
      
      If we analyze lock_sys_t::cancel() code from the perspective of
      trx->lock.wait_lock racing, we won't find the error there, except the
      cases when we need to reload it after the corresponding latches
      acquiring.
      
      So the fix is just to remove the assertion and reload
      trx->lock.wait_lock after acquiring necessary latches.
      
      Reviewed by: Marko Mäkelä <marko.makela@mariadb.com>
      a474e327
  14. 16 Feb, 2023 7 commits
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · 67a6ad0a
      Marko Mäkelä authored
      67a6ad0a
    • Marko Mäkelä's avatar
      d3f35aa4
    • Marko Mäkelä's avatar
      Fix clang -Winconsistent-missing-override · 0c79ae94
      Marko Mäkelä authored
      0c79ae94
    • Marko Mäkelä's avatar
      MDEV-30638 Deadlock between INSERT and InnoDB non-persistent statistics update · 201cfc33
      Marko Mäkelä authored
      This is a partial revert of
      commit 8b6a308e (MDEV-29883)
      and a follow-up to the
      merge commit 394fc71f (MDEV-24569).
      
      The latching order related to any operation that accesses the allocation
      metadata of an InnoDB index tree is as follows:
      
      1. Acquire dict_index_t::lock in non-shared mode.
      2. Acquire the index root page latch in non-shared mode.
      3. Possibly acquire further index page latches. Unless an exclusive
      dict_index_t::lock is held, this must follow the root-to-leaf,
      left-to-right order.
      4. Acquire a *non-shared* fil_space_t::latch.
      5. Acquire latches on the allocation metadata pages.
      6. Possibly allocate and write some pages, or free some pages.
      
      btr_get_size_and_reserved(), dict_stats_update_transient_for_index(),
      dict_stats_analyze_index(): Acquire an exclusive fil_space_t::latch
      in order to avoid a deadlock in fseg_n_reserved_pages() in case of
      concurrent access to multiple indexes sharing the same "inode page".
      
      fseg_page_is_allocated(): Acquire an exclusive fil_space_t::latch
      in order to avoid deadlocks. All callers are holding latches
      on a buffer pool page, or an index, or both.
      Before commit edbde4a1 (MDEV-24167)
      a third mode was available that would not conflict with the shared
      fil_space_t::latch acquired by ha_innobase::info_low(),
      i_s_sys_tablespaces_fill_table(),
      or i_s_tablespaces_encryption_fill_table().
      Because those calls should be rather rare, it makes sense to use
      the simple rw_lock with only shared and exclusive modes.
      
      fil_crypt_get_page_throttle(): Avoid invoking fseg_page_is_allocated()
      on an allocation bitmap page (which can never be freed), to avoid
      acquiring a shared latch on top of an exclusive one.
      
      mtr_t::s_lock_space(), MTR_MEMO_SPACE_S_LOCK: Remove.
      201cfc33
    • Marko Mäkelä's avatar
      MDEV-30134 Assertion failed in buf_page_t::unfix() in buf_pool_t::watch_unset() · 54c0ac72
      Marko Mäkelä authored
      buf_pool_t::watch_set(): Always buffer-fix a block if one was found,
      no matter if it is a watch sentinel or a buffer page. The type of
      the block descriptor will be rechecked in buf_page_t::watch_unset().
      Do not expect the caller to acquire the page hash latch. Starting with
      commit bd5a6403 it is safe to release
      buf_pool.mutex before acquiring a buf_pool.page_hash latch.
      
      buf_page_get_low(): Adjust to the changed buf_pool_t::watch_set().
      
      This simplifies the logic and fixes a bug that was reproduced when
      using debug builds and the setting innodb_change_buffering_debug=1.
      54c0ac72
    • Marko Mäkelä's avatar
      MDEV-30397: MariaDB crash due to DB_FAIL reported for a corrupted page · 9c157994
      Marko Mäkelä authored
      buf_read_page_low(): Map the buf_page_t::read_complete() return
      value DB_FAIL to DB_PAGE_CORRUPTED. The purpose of the DB_FAIL
      return value is to avoid error log noise when read-ahead brings
      in an unused page that is typically filled with NUL bytes.
      
      If a synchronous read is bringing in a corrupted page where the
      page frame does not contain the expected tablespace identifier and
      page number, that must be treated as an attempt to read a corrupted
      page. The correct error code for this is DB_PAGE_CORRUPTED.
      The error code DB_FAIL is not handled by row_mysql_handle_errors().
      
      This was missed in commit 0b47c126
      (MDEV-13542).
      9c157994
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · cc27e5fd
      Marko Mäkelä authored
      cc27e5fd
  15. 15 Feb, 2023 5 commits
    • Marko Mäkelä's avatar
      MDEV-30657 InnoDB: Not applying UNDO_APPEND due to corruption · 5300c0fb
      Marko Mäkelä authored
      This almost completely reverts
      commit acd23da4 and
      retains a safe optimization:
      
      recv_sys_t::parse(): Remove any old redo log records for the
      truncated tablespace, to free up memory earlier.
      If recovery consists of multiple batches, then recv_sys_t::apply()
      will must invoke recv_sys_t::trim() again to avoid wrongly
      applying old log records to an already truncated undo tablespace.
      5300c0fb
    • Vicențiu Ciorbaru's avatar
      MDEV-30324: Wrong result upon SELECT DISTINCT ... WITH TIES · 4afa3b64
      Vicențiu Ciorbaru authored
      WITH TIES would not take effect if SELECT DISTINCT was used in a
      context where an INDEX is used to resolve the ORDER BY clause.
      
      WITH TIES relies on the `JOIN::order` to contain the non-constant
      fields to test the equality of ORDER BY fiels required for WITH TIES.
      
      The cause of the problem was a premature removal of the `JOIN::order`
      member during a DISTINCT optimization. This lead to WITH TIES code assuming
      ORDER BY only contained "constant" elements.
      
      Disable this optimization when WITH TIES is in effect.
      
      (side-note: the order by removal does not impact any current tests, thus
      it will be removed in a future version)
      
      Reviewed by: monty@mariadb.org
      4afa3b64
    • Vicențiu Ciorbaru's avatar
      Whitespace fix · d2b773d9
      Vicențiu Ciorbaru authored
      d2b773d9
    • Vicențiu Ciorbaru's avatar
    • Monty's avatar
      MDEV-30333 Wrong result with not_null_range_scan and LEFT JOIN with empty table · 192427e3
      Monty authored
      There was a bug in JOIN::make_notnull_conds_for_range_scans() when
      clearing TABLE->tmp_set, which was used to mark fields that could not be
      null.
      
      This function was only used if 'not_null_range_scan=on' is set.
      
      The effect was that tmp_set contained a 'random value' and this caused
      the optimizer to think that some fields could not be null.
      FLUSH TABLES clears tmp_set and because of this things worked temporarily.
      
      Fixed by clearing tmp_set properly.
      192427e3
  16. 14 Feb, 2023 5 commits