1. 25 Oct, 2022 1 commit
  2. 24 Oct, 2022 1 commit
  3. 21 Oct, 2022 3 commits
    • Vlad Lesin's avatar
      MDEV-29622 Wrong assertions in lock_cancel_waiting_and_release() for deadlock resolving caller · 9c04d66d
      Vlad Lesin authored
      Suppose we have two transactions, trx 1 and trx 2.
      
      trx 2 does deadlock resolving from lock_wait(), it sets
      victim->lock.was_chosen_as_deadlock_victim=true for trx 1, but has not
      yet invoked lock_cancel_waiting_and_release().
      
      trx 1 checks the flag in lock_trx_handle_wait(), and starts rollback
      from row_mysql_handle_errors(). It can change trx->lock.wait_thr and
      trx->state as it holds trx_t::mutex, but trx 2 has not yet requested it,
      as lock_cancel_waiting_and_release() has not yet been called.
      
      After that trx 1 tries to release locks in trx_t::rollback_low(),
      invoking trx_t::rollback_finish(). lock_release() is blocked on try to
      acquire lock_sys.rd_lock(SRW_LOCK_CALL) in lock_release_try(), as
      lock_sys is blocked by trx 2, as deadlock resolution works under
      lock_sys.wr_lock(SRW_LOCK_CALL), see Deadlock::report() for details.
      
      trx 2 executes lock_cancel_waiting_and_release() for deadlock victim, i.
      e. for trx 1. lock_cancel_waiting_and_release() contains some
      trx->lock.wait_thr and trx->state assertions, which will fail, because
      trx 1 has changed them during rollback execution.
      
      So, according to the above scenario, it's legal to have
      trx->lock.wait_thr==0 and trx->state!=TRX_STATE_ACTIVE in
      lock_cancel_waiting_and_release(), if it was invoked from
      Deadlock::report(), and the fix is just in the assertion conditions
      changing.
      
      The fix is just in changing assertion condition.
      
      There is also lock_wait() cleanup around trx->error_state.
      
      If trx->error_state can be changed not by the owned thread, it must be
      protected with lock_sys.wait_mutex, as lock_wait() uses trx->lock.cond
      along with that mutex.
      
      Also if trx->error_state was changed before lock_sys.wait_mutex
      acquision, then it could be reset with the following code, what is
      wrong. Also we need to check trx->error_state before entering waiting
      loop, otherwise it can be the case when trx->error_state was set before
      lock_sys.wait_mutex acquision, but the thread will be waiting on
      trx->lock.cond.
      9c04d66d
    • Vlad Lesin's avatar
      MDEV-29635 race on trx->lock.wait_lock in deadlock resolution · acebe357
      Vlad Lesin authored
      Returning DB_SUCCESS unconditionally if !trx->lock.wait_lock in
      lock_trx_handle_wait() is wrong. Because even if
      trx->lock.was_chosen_as_deadlock_victim was not set before the first check
      in lock_trx_handle_wait(), it can be set after
      the check, and trx->lock.wait_lock can be reset by another thread from
      lock_reset_lock_and_trx_wait() if the transaction was chosen as deadlock
      victim. In this case lock_trx_handle_wait() will return DB_SUCCESS even
      the transaction was marked as deadlock victim, and continue execution
      instead of rolling back.
      
      The fix is to check trx->lock.was_chosen_as_deadlock_victim once more if
      trx->lock.wait_lock is reset, as trx->lock.wait_lock can be reset only
      after trx->lock.was_chosen_as_deadlock_victim was set if the transaction
      was chosen as deadlock victim.
      acebe357
    • Marko Mäkelä's avatar
      MDEV-24402: InnoDB CHECK TABLE ... EXTENDED · ab019010
      Marko Mäkelä authored
      Until now, the attribute EXTENDED of CHECK TABLE was ignored by InnoDB,
      and InnoDB only counted the records in each index according
      to the current read view. Unless the attribute QUICK was specified, the
      function btr_validate_index() would be invoked to validate the B-tree
      structure (the sibling and child links between index pages).
      
      The EXTENDED check will not only count all index records according to the
      current read view, but also ensure that any delete-marked records in the
      clustered index are waiting for the purge of history, and that all
      secondary index records point to a version of the clustered index record
      that is waiting for the purge of history. In other words, no index may
      contain orphan records. Normal MVCC reads and the non-EXTENDED version
      of CHECK TABLE would ignore these orphans.
      
      Unpurged records merely result in warnings (at most one per index),
      not errors, and no indexes will be flagged as corrupted due to such
      garbage. It will remain possible to SELECT data from such indexes or
      tables (which will skip such records) or to rebuild the table to
      reclaim some space.
      
      We introduce purge_sys.end_view that will be (almost) a copy of
      purge_sys.view at the end of a batch of purging committed transaction
      history. It is not an exact copy, because if the size of a purge batch
      is limited by innodb_purge_batch_size, some records that
      purge_sys.view would allow to be purged will be left over for
      subsequent batches.
      
      The purge_sys.view is relevant in the purge of committed transaction
      history, to determine if records are safe to remove. The new
      purge_sys.end_view is relevant in MVCC operations and in
      CHECK TABLE ... EXTENDED. It tells which undo log records are
      safe to access (have not been discarded at the end of a purge batch).
      
      purge_sys.clone_oldest_view<true>(): In trx_lists_init_at_db_start(),
      clone the oldest read view similar to purge_sys_t::clone_end_view()
      so that CHECK TABLE ... EXTENDED will not report bogus failures between
      InnoDB restart and the completed purge of committed transaction history.
      
      purge_sys_t::is_purgeable(): Replaces purge_sys_t::changes_visible()
      in the case that purge_sys.latch will not be held by the caller.
      Among other things, this guards access to BLOBs. It is not safe to
      dereference any BLOBs of a delete-marked purgeable record, because
      they may have already been freed.
      
      purge_sys_t::view_guard::view(): Return a reference to purge_sys.view
      that will be protected by purge_sys.latch, held by purge_sys_t::view_guard.
      
      purge_sys_t::end_view_guard::view(): Return a reference to
      purge_sys.end_view while it is protected by purge_sys.end_latch.
      Whenever a thread needs to retrieve an older version of a clustered
      index record, it will hold a page latch on the clustered index page
      and potentially also on a secondary index page that points to the
      clustered index page. If these pages contain purgeable records that
      would be accessed by a currently running purge batch, the progress of
      the purge batch would be blocked by the page latches. Hence, it is
      safe to make a copy of purge_sys.end_view while holding an index page
      latch, and consult the copy of the view to determine whether a record
      should already have been purged.
      
      btr_validate_index(): Remove a redundant check.
      
      row_check_index_match(): Check if a secondary index record and a
      version of a clustered index record match each other.
      
      row_check_index(): Replaces row_scan_index_for_mysql().
      Count the records in each index directly, duplicating the relevant
      logic from row_search_mvcc(). Initialize check_table_extended_view
      for CHECK ... EXTENDED while holding an index leaf page latch.
      If we encounter an orphan record, the copy of purge_sys.end_view that
      we make is safe for visibility checks, and trx_undo_get_undo_rec() will
      check for the safety to access each undo log record. Should that check
      fail, we should return DB_MISSING_HISTORY to report a corrupted index.
      The EXTENDED check tries to match each secondary index record with
      every available clustered index record version, by duplicating the logic
      of row_vers_build_for_consistent_read() and invoking
      trx_undo_prev_version_build() directly.
      
      Before invoking row_check_index_match() on delete-marked clustered index
      record versions, we will consult purge_sys.is_purgeable() in order to
      avoid accessing freed BLOBs.
      
      We will always check that the DB_TRX_ID or PAGE_MAX_TRX_ID does not
      exceed the global maximum. Orphan secondary index records will be
      flagged only if everything up to PAGE_MAX_TRX_ID has been purged.
      We warn also about clustered index records whose nonzero DB_TRX_ID
      should have been reset in purge or rollback.
      
      trx_set_rw_mode(): Move an assertion from ReadView::set_creator_trx_id().
      
      trx_undo_prev_version_build(): Remove two debug-only parameters,
      and return an error code instead of a Boolean.
      
      trx_undo_get_undo_rec(): Return a pointer to the undo log record,
      or nullptr if one cannot be retrieved. Instead of consulting the
      purge_sys.view, consult the purge_sys.end_view to determine which
      records can be accessed.
      
      trx_undo_get_rec_if_purgeable(): A variant of trx_undo_get_undo_rec()
      that will consult purge_sys.view instead of purge_sys.end_view.
      
      TRX_UNDO_CHECK_PURGEABILITY: A new parameter to
      trx_undo_prev_version_build(), passed by row_vers_old_has_index_entry()
      so that purge_sys.view instead of purge_sys.end_view will be consulted
      to determine whether a secondary index record may be safely purged.
      
      row_upd_changes_disowned_external(): Remove. This should be more
      expensive than briefly latching purge_sys in trx_undo_prev_version_build()
      (which may make use of transactional memory).
      
      row_sel_reset_old_vers_heap(): New function, split from
      row_sel_build_prev_vers_for_mysql().
      
      row_sel_build_prev_vers_for_mysql(): Reorder some parameters
      to simplify the call to row_sel_reset_old_vers_heap().
      
      row_search_for_mysql(): Replaced with direct calls to row_search_mvcc().
      
      sel_node_get_nth_plan(): Define inline in row0sel.h
      
      open_step(): Define at the call site, in simplified form.
      
      sel_node_reset_cursor(): Merged with the only caller open_step().
      ---
      ReadViewBase::check_trx_id_sanity(): Remove.
      Let us handle "future" DB_TRX_ID in a more meaningful way:
      
      row_sel_clust_sees(): Return DB_SUCCESS if the record is visible,
      DB_SUCCESS_LOCKED_REC if it is invisible, and DB_CORRUPTION if
      the DB_TRX_ID is in the future.
      
      row_undo_mod_must_purge(), row_undo_mod_clust(): Silently ignore
      corrupted DB_TRX_ID. We are in ROLLBACK, and we should have noticed
      that corruption when we were about to modify the record in the first
      place (leading us to refuse the operation).
      
      row_vers_build_for_consistent_read(): Return DB_CORRUPTION if
      DB_TRX_ID is in the future.
      
      Tested by: Matthias Leich
      Reviewed by: Vladislav Lesin
      ab019010
  4. 20 Oct, 2022 1 commit
  5. 18 Oct, 2022 1 commit
  6. 16 Oct, 2022 1 commit
  7. 15 Oct, 2022 2 commits
  8. 14 Oct, 2022 6 commits
  9. 13 Oct, 2022 6 commits
  10. 12 Oct, 2022 5 commits
    • Nikita Malyavin's avatar
      MDEV-29753 An error is wrongly reported during INSERT with vcol index · 128356b4
      Nikita Malyavin authored
      See also commits aa8a31da and 64678c for a Bug #22990029 fix.
      
      In this scenario INSERT chose to check if delete unmarking is available for
      a just deleted record. To build an update vector, it needed to calculate
      the vcols as well. Since this INSERT was not IGNORE-flagged, recalculation
      failed.
      
      Solutiuon: temporarily set abort_on_warning=true, while calculating the
      column for delete-unmarked insert.
      128356b4
    • Nikita Malyavin's avatar
      MDEV-29299 SELECT from table with vcol index reports warning · 3cd2c1e8
      Nikita Malyavin authored
      As of now innodb does not store trx_id for each record in secondary index.
      The idea behind is following: let us store only per-page max_trx_id, and
      delete-mark the records when they are deleted/updated.
      
      If the read starts, it rememders the lowest id of currently active
      transaction. Innodb refers to it as trx->read_view->m_up_limit_id.
      See also ReadView::open.
      
      When the page is fetched, its max_trx_id is compared to m_up_limit_id.
      If the value is lower, and the secondary index record is not delete-marked,
      then this page is just safe to read as is. Else, a clustered index could be
      needed ato access. See page_get_max_trx_id call in row_search_mvcc, and the
      corresponding switch (row_search_idx_cond_check(...)) below.
      
      Virtual columns are required to be updated in case if the record was
      delete-marked. The motivation behind it is documented in
      Row_sel_get_clust_rec_for_mysql::operator() near
      row_sel_sec_rec_is_for_clust_rec call.
      
      This was basically a description why virtual column computation can
      normally happen during SELECT, and, generally, a vcol index access.
      
      Sometimes stats tables are updated by innodb. This starts a new
      transaction, and it can happen that it didn't finish to the moment of
      SELECT execution, forcing virtual columns recomputation. If the result was
      a something that normally outputs a warning, like division by zero, then
      it could be outputted in a racy manner.
      
      The solution is to suppress the warnings when a column is computed
      for the described purpose.
      ignore_wrnings argument is added innobase_get_computed_value.
      Currently, it is only true for a call from
      row_sel_sec_rec_is_for_clust_rec.
      3cd2c1e8
    • Marko Mäkelä's avatar
      Merge 10.5 into 10.6 · a992c615
      Marko Mäkelä authored
      a992c615
    • Jan Lindström's avatar
      Fixes after 10.4 --> 10.5 merge · 5fffdbc8
      Jan Lindström authored
      * MDEV-29142 : Ignore inconsistency warning as we kill cluster
      * galera_parallel_apply_3nodes : Disabled because it is unstable
      * MDEV-26597 : Add missing code
      * galera_sr.galera_sr_ws_size2 : Remove incorrect assertion
      5fffdbc8
    • Marko Mäkelä's avatar
      Merge 10.4 into 10.5 · 977c385d
      Marko Mäkelä authored
      977c385d
  11. 11 Oct, 2022 9 commits
  12. 10 Oct, 2022 4 commits