An error occurred fetching the project authors.
  1. 22 Jun, 2022 1 commit
    • Marko Mäkelä's avatar
      MDEV-22388 Corrupted undo log record leads to server crash · 6f4d0659
      Marko Mäkelä authored
      trx_undo_rec_copy(): Return nullptr if the undo record is corrupted.
      
      trx_undo_rec_get_undo_no(): Define inline with the declaration.
      
      trx_purge_dummy_rec: Replaced with a -1 pointer.
      
      row_undo_rec_get(), UndorecApplier::apply_undo_rec(): Check
      if trx_undo_rec_copy() returned nullptr.
      
      trx_purge_get_next_rec(): Return nullptr upon encountering any
      corruption, to signal the end of purge.
      6f4d0659
  2. 06 Jun, 2022 1 commit
    • Marko Mäkelä's avatar
      MDEV-13542: Crashing on corrupted page is unhelpful · 0b47c126
      Marko Mäkelä authored
      The approach to handling corruption that was chosen by Oracle in
      commit 177d8b0c
      is not really useful. Not only did it actually fail to prevent InnoDB
      from crashing, but it is making things worse by blocking attempts to
      rescue data from or rebuild a partially readable table.
      
      We will try to prevent crashes in a different way: by propagating
      errors up the call stack. We will never mark the clustered index
      persistently corrupted, so that data recovery may be attempted by
      reading from the table, or by rebuilding the table.
      
      This should also fix MDEV-13680 (crash on btr_page_alloc() failure);
      it was extensively tested with innodb_file_per_table=0 and a
      non-autoextend system tablespace.
      
      We should now avoid crashes in many cases, such as when a page
      cannot be read or allocated, or an inconsistency is detected when
      attempting to update multiple pages. We will not crash on double-free,
      such as on the recovery of DDL in system tablespace in case something
      was corrupted.
      
      Crashes on corrupted data are still possible. The fault injection mechanism
      that is introduced in the subsequent commit may help catch more of them.
      
      buf_page_import_corrupt_failure: Remove the fault injection, and instead
      corrupt some pages using Perl code in the tests.
      
      btr_cur_pessimistic_insert(): Always reserve extents (except for the
      change buffer), in order to prevent a subsequent allocation failure.
      
      btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages().
      
      btr_assert_not_corrupted(), btr_corruption_report(): Remove.
      Similar checks are already part of btr_block_get().
      
      FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE.
      
      dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(),
      trx_undo_page_get_s_latched(): Replaced with error-checking calls.
      
      trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get().
      
      trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed.
      
      trx_sys_create_sys_pages(): Merged with trx_sysf_create().
      
      dict_check_tablespaces_and_store_max_id(): Do not access
      DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot().
      Merge dict_check_sys_tables() with this function.
      
      dir_pathname(): Replaces os_file_make_new_pathname().
      
      row_undo_ins_remove_sec(): Do not modify the undo page by adding
      a terminating NUL byte to the record.
      
      btr_decryption_failed(): Report decryption failures
      
      dict_set_corrupted_by_space(), dict_set_encrypted_by_space(),
      dict_set_corrupted_index_cache_only(): Remove.
      
      dict_set_corrupted(): Remove the constant parameter dict_locked=false.
      Never flag the clustered index corrupted in SYS_INDEXES, because
      that would deny further access to the table. It might be possible to
      repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case
      no B-tree leaf page is corrupted.
      
      dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(),
      row_purge_skip_uncommitted_virtual_index(): Remove, and refactor
      the callers to read dict_index_t::type only once.
      
      dict_table_is_corrupted(): Remove.
      
      dict_index_t::is_btree(): Determine if the index is a valid B-tree.
      
      BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove.
      
      UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger
      assertion failures, but error codes being returned.
      
      buf_corrupt_page_release(): Replaced with a direct call to
      buf_pool.corrupted_evict().
      
      fil_invalid_page_access_msg(): Never crash on an invalid read;
      let the caller of buf_page_get_gen() decide.
      
      btr_pcur_t::restore_position(): Propagate failure status to the caller
      by returning CORRUPTED.
      
      opt_search_plan_for_table(): Simplify the code.
      
      row_purge_del_mark(), row_purge_upd_exist_or_extern_func(),
      row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(),
      row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free()
      when no secondary indexes exist.
      
      row_undo_mod_upd_exist_sec(): Simplify the code.
      
      row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT
      if the clustered index (and therefore the table) is corrupted, similar
      to what we do in row_insert_for_mysql().
      
      fut_get_ptr(): Replace with buf_page_get_gen() calls.
      
      buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION
      if the page is marked as freed. For other modes than
      BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will
      trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED,
      we will return nullptr for freed pages, so that the callers
      can be simplified. The purge of transaction history will be
      a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on
      corrupted data.
      
      buf_page_get_low(): Never crash on a corrupted page, but simply
      return nullptr.
      
      fseg_page_is_allocated(): Replaces fseg_page_is_free().
      
      fts_drop_common_tables(): Return an error if the transaction
      was rolled back.
      
      fil_space_t::set_corrupted(): Report a tablespace as corrupted if
      it was not reported already.
      
      fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report
      out-of-bounds page access or other errors.
      
      Clean up mtr_t::page_lock()
      
      buf_page_get_low(): Validate the page identifier (to check for
      recently read corrupted pages) after acquiring the page latch.
      
      buf_page_t::read_complete(): Flag uninitialized (all-zero) pages
      with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch.
      
      mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi().
      
      recv_sys_t::free_corrupted_page(): Only set_corrupt_fs()
      if any log records exist for the page. We do not mind if read-ahead
      produces corrupted (or all-zero) pages that were not actually needed
      during recovery.
      
      recv_recover_page(): Return whether the operation succeeded.
      
      recv_sys_t::recover_low(): Simplify the logic. Check for recovery error.
      
      Thanks to Matthias Leich for testing this extensively and to the
      authors of https://rr-project.org for making it easy to diagnose
      and fix any failures that were found during the testing.
      0b47c126
  3. 18 Nov, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-27058: Reduce the size of buf_block_t and buf_page_t · aaef2e1d
      Marko Mäkelä authored
      buf_page_t::frame: Moved from buf_block_t::frame.
      All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED
      pages will have frame=nullptr, while all 'fat' buf_block_t
      will have a non-null frame pointing to aligned innodb_page_size bytes.
      This eliminates the need for separate states for
      BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE.
      
      buf_page_t::lock: Moved from buf_block_t::lock. That is, all block
      descriptors will have a page latch. The IO_PIN state that was used
      for discarding or creating the uncompressed page frame of a
      ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix
      and page X-latch.
      
      page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status
      of buf_page_t with a single std::atomic<uint32_t>. All modifications
      will use store(), fetch_add(), fetch_sub(). This space was previously
      wasted to alignment on 64-bit systems. We will use the following encoding
      that combines a state (partly read-fix or write-fix) and a buffer-fix
      count:
      
      buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED)
      buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY)
      buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH)
      buf_page_t::FREED=3 + fix: pages marked as freed in the file
      buf_page_t::UNFIXED=1U<<29 + fix: normal pages
      buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge
      buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite)
      buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched)
      buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched)
      buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf
      buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite)
      
      buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to
      UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch.
      
      buf_page_t::read_complete(): Renamed from buf_page_read_complete().
      Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch.
      
      buf_page_t::can_relocate(): If the page latch is being held or waited for,
      or the block is buffer-fixed or io-fixed, return false. (The condition
      on the page latch is new.)
      
      Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we
      will acquire the page latch before fix(), and unfix() before unlocking.
      
      buf_page_t::flush(): Replaces buf_flush_page(). Optimize the
      handling of FREED pages.
      
      buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held
      by the caller.
      
      buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates.
      
      buf_page_get_low(): Ignore guesses that are read-fixed because they
      may not yet be registered in buf_pool.page_hash and buf_pool.LRU.
      
      buf_page_optimistic_get(): Acquire latch before buffer-fixing.
      
      buf_page_make_young(): Leave read-fixed blocks alone, because they
      might not be registered in buf_pool.LRU yet.
      
      recv_sys_t::recover_deferred(), recv_sys_t::recover_low():
      Possibly fix MDEV-26326, by holding a page X-latch instead of
      only buffer-fixing the page.
      aaef2e1d
  4. 29 Jul, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-23484 Rollback unnecessarily acquires dict_sys.latch · 86a14289
      Marko Mäkelä authored
      row_undo(): Remove the unnecessary acquisition and release of
      dict_sys.latch. This was supposed to prevent the table from being
      dropped while the undo log record is being rolled back. But, thanks to
      trx_resurrect_table_locks() that was introduced in
      mysql/mysql-server@935ba09d52c1908bde273ad1940b5ab919d9763d
      and commit c291ddfd
      as well as commit 1bd681c8
      (MDEV-25506 part 3) tables will be protected against dropping
      due to table locks.
      
      This reverts commit 0049d5b5
      (which had reverted a previous attempt of fixing this) and addresses
      an earlier race condition with the following:
      
      prepare_inplace_alter_table_dict(): If recovered transactions hold
      locks on the table while we are executing online ADD INDEX, acquire
      a table lock so that the rollback of those recovered transactions will
      not interfere with the ADD INDEX.
      86a14289
  5. 23 Jun, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-25062: Reduce trx_rseg_t::mutex contention · 6e12ebd4
      Marko Mäkelä authored
      redo_rseg_mutex, noredo_rseg_mutex: Remove the PERFORMANCE_SCHEMA keys.
      The rollback segment mutex will be uninstrumented.
      
      trx_sys_t: Remove pointer indirection for rseg_array, temp_rseg.
      Align each element to the cache line.
      
      trx_sys_t::rseg_id(): Replaces trx_rseg_t::id.
      
      trx_rseg_t::ref: Replaces needs_purge, trx_ref_count, skip_allocation
      in a single std::atomic<uint32_t>.
      
      trx_rseg_t::latch: Replaces trx_rseg_t::mutex.
      
      trx_rseg_t::history_size: Replaces trx_sys_t::rseg_history_len
      
      trx_sys_t::history_size_approx(): Replaces trx_sys.rseg_history_len
      in those places where the exact count does not matter. We must not
      acquire any trx_rseg_t::latch while holding index page latches, because
      normally the trx_rseg_t::latch is acquired before any page latches.
      
      trx_sys_t::history_exists(): Replaces trx_sys.rseg_history_len!=0
      with an approximation.
      
      We remove some unnecessary trx_rseg_t::latch acquisition around
      trx_undo_set_state_at_prepare() and trx_undo_set_state_at_finish().
      Those operations will only access fields that remain constant
      after trx_rseg_t::init().
      6e12ebd4
  6. 04 May, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-18518 Multi-table CREATE and DROP transactions for InnoDB · 52aac131
      Marko Mäkelä authored
      InnoDB used to support at most one CREATE TABLE or DROP TABLE
      per transaction. This caused complications for DDL operations on
      partitioned tables (where each partition is treated as a separate
      table by InnoDB) and FULLTEXT INDEX (where each index is maintained
      in a number of internal InnoDB tables).
      
      dict_drop_index_tree(): Extend the MDEV-24589 logic and treat
      the purge or rollback of SYS_INDEXES records of clustered indexes
      specially: by dropping the tablespace if it exists. This is the only
      form of recovery that we will need.
      
      trx_undo_ddl_type: Document the DDL undo log record types better.
      
      trx_t::dict_operation: Change the type to bool.
      
      trx_t::ddl: Remove.
      
      trx_t::table_id, trx_undo_t::table_id: Remove.
      
      dict_build_table_def_step(): Remove trx_t::table_id logging.
      
      dict_table_close_and_drop(), row_merge_drop_table(): Remove.
      
      row_merge_lock_table(): Merged to the only callers, which can
      call lock_table_for_trx() directly.
      
      fts_aux_table_t, fts_aux_id, fts_space_set_t: Remove.
      
      fts_drop_orphaned_tables(): Remove.
      
      row_merge_rename_index_to_drop(): Remove. Thanks to MDEV-24589,
      we can simply delete the to-be-dropped indexes from SYS_INDEXES,
      while still being able to roll back the operation.
      
      ha_innobase_inplace_ctx: Make a few data members const.
      Preallocate trx.
      
      prepare_inplace_alter_table_dict(): Simplify the logic. Let the
      normal rollback take care of some cleanup.
      
      row_undo_ins_remove_clust_rec(): Simplify the parsing of SYS_COLUMNS.
      
      trx_rollback_active(): Remove the special DROP TABLE logic.
      
      trx_undo_mem_create_at_db_start(), trx_undo_reuse_cached():
      Always write TRX_UNDO_TABLE_ID as 0.
      52aac131
  7. 15 Apr, 2021 1 commit
  8. 13 Apr, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-24620 ASAN heap-buffer-overflow in btr_pcur_restore_position() · b8c8692f
      Marko Mäkelä authored
      Between btr_pcur_store_position() and btr_pcur_restore_position()
      it is possible that purge empties a table and enlarges
      index->n_core_fields and index->n_core_null_bytes.
      Therefore, we must cache index->n_core_fields in
      btr_pcur_t::old_n_core_fields so that btr_pcur_t::old_rec can be
      parsed correctly.
      
      Unfortunately, this is a huge change, because we will replace
      "bool leaf" parameters with "ulint n_core"
      (passing index->n_core_fields, or 0 for non-leaf pages).
      For special cases where we know that index->is_instant() cannot hold,
      we may also pass index->n_fields.
      b8c8692f
  9. 25 Jan, 2021 1 commit
    • Marko Mäkelä's avatar
      MDEV-515 Reduce InnoDB undo logging for insert into empty table · 3cef4f8f
      Marko Mäkelä authored
      We implement an idea that was suggested by Michael 'Monty' Widenius
      in October 2017: When InnoDB is inserting into an empty table or partition,
      we can write a single undo log record TRX_UNDO_EMPTY, which will cause
      ROLLBACK to clear the table.
      
      For this to work, the insert into an empty table or partition must be
      covered by an exclusive table lock that will be held until the transaction
      has been committed or rolled back, or the INSERT operation has been
      rolled back (and the table is empty again), in lock_table_x_unlock().
      
      Clustered index records that are covered by the TRX_UNDO_EMPTY record
      will carry DB_TRX_ID=0 and DB_ROLL_PTR=1<<55, and thus they cannot
      be distinguished from what MDEV-12288 leaves behind after purging the
      history of row-logged operations.
      
      Concurrent non-locking reads must be adjusted: If the read view was
      created before the INSERT into an empty table, then we must continue
      to imagine that the table is empty, and not try to read any records.
      If the read view was created after the INSERT was committed, then
      all records must be visible normally. To implement this, we introduce
      the field dict_table_t::bulk_trx_id.
      
      This special handling only applies to the very first INSERT statement
      of a transaction for the empty table or partition. If a subsequent
      statement in the transaction is modifying the initially empty table again,
      we must enable row-level undo logging, so that we will be able to
      roll back to the start of the statement in case of an error (such as
      duplicate key).
      
      INSERT IGNORE will continue to use row-level logging and locking, because
      implementing it would require the ability to roll back the latest row.
      Since the undo log that we write only allows us to roll back the entire
      statement, we cannot support INSERT IGNORE. We will introduce a
      handler::extra() parameter HA_EXTRA_IGNORE_INSERT to indicate to storage
      engines that INSERT IGNORE is being executed.
      
      In many test cases, we add an extra record to the table, so that during
      the 'interesting' part of the test, row-level locking and logging will
      be used.
      
      Replicas will continue to use row-level logging and locking until
      MDEV-24622 has been addressed. Likewise, this optimization will be
      disabled in Galera cluster until MDEV-24623 enables it.
      
      dict_table_t::bulk_trx_id: The latest active or committed transaction
      that initiated an insert into an empty table or partition.
      Protected by exclusive table lock and a clustered index leaf page latch.
      
      ins_node_t::bulk_insert: Whether bulk insert was initiated.
      
      trx_t::mod_tables: Use C++11 style accessors (emplace instead of insert).
      Unlike earlier, this collection will cover also temporary tables.
      
      trx_mod_table_time_t: Add start_bulk_insert(), end_bulk_insert(),
      is_bulk_insert(), was_bulk_insert().
      
      trx_undo_report_row_operation(): Before accessing any undo log pages,
      invoke trx->mod_tables.emplace() in order to determine whether undo
      logging was disabled, or whether this is the first INSERT and we are
      supposed to write a TRX_UNDO_EMPTY record.
      
      row_ins_clust_index_entry_low(): If we are inserting into an empty
      clustered index leaf page, set the ins_node_t::bulk_insert flag for
      the subsequent trx_undo_report_row_operation() call.
      
      lock_rec_insert_check_and_lock(), lock_prdt_insert_check_and_lock():
      Remove the redundant parameter 'flags' that can be checked in the caller.
      
      btr_cur_ins_lock_and_undo(): Simplify the logic. Correctly write
      DB_TRX_ID,DB_ROLL_PTR after invoking trx_undo_report_row_operation().
      
      trx_mark_sql_stat_end(), ha_innobase::extra(HA_EXTRA_IGNORE_INSERT),
      ha_innobase::external_lock(): Invoke trx_t::end_bulk_insert() so that
      the next statement will not be covered by table-level undo logging.
      
      ReadView::changes_visible(trx_id_t) const: New accessor for the case
      where the trx_id_t is not read from a potentially corrupted index page
      but directly from the memory. In this case, we can skip a sanity check.
      
      row_sel(), row_sel_try_search_shortcut(), row_search_mvcc():
      row_sel_try_search_shortcut_for_mysql(),
      row_merge_read_clustered_index(): Check dict_table_t::bulk_trx_id.
      
      row_sel_clust_sees(): Replaces lock_clust_rec_cons_read_sees().
      
      lock_sec_rec_cons_read_sees(): Replaced with lower-level code.
      
      btr_root_page_init(): Refactored from btr_create().
      
      dict_index_t::clear(), dict_table_t::clear(): Empty an index or table,
      for the ROLLBACK of an INSERT operation.
      
      ROW_T_EMPTY, ROW_OP_EMPTY: Note a concurrent ROLLBACK of an INSERT
      into an empty table.
      
      This is joint work with Thirunarayanan Balathandayuthapani,
      who created a working prototype.
      Thanks to Matthias Leich for extensive testing.
      3cef4f8f
  10. 20 Oct, 2020 1 commit
    • Marko Mäkelä's avatar
      Revert MDEV-23484 Rollback unnecessarily acquires dict_operation_lock · 0049d5b5
      Marko Mäkelä authored
      In row_undo_ins_remove_clust_rec() and similar places,
      an assertion !node->trx->dict_operation_lock_mode could fail,
      because an online ALTER is not allowed to run at the same time
      while DDL operations on the table are being rolled back.
      
      This race condition would be fixed by always acquiring an InnoDB
      table lock in ha_innobase::prepare_inplace_alter_table() or
      prepare_inplace_alter_table_dict(), or by ensuring that recovered
      transactions are protected by MDL that would block concurrent DDL
      until the rollback has been completed.
      
      This reverts commit 15093639
      and commit 22c4a751.
      0049d5b5
  11. 15 Oct, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-23399: Performance regression with write workloads · 7cffb5f6
      Marko Mäkelä authored
      The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
      the performance bottleneck to the page flushing.
      
      The configuration parameters will be changed as follows:
      
      innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
      innodb_lru_scan_depth=1536 (old: 1024)
      innodb_max_dirty_pages_pct=90 (old: 75)
      innodb_max_dirty_pages_pct_lwm=75 (old: 0)
      
      Note: The parameter innodb_lru_scan_depth will only affect LRU
      eviction of buffer pool pages when a new page is being allocated. The
      page cleaner thread will no longer evict any pages. It used to
      guarantee that some pages will remain free in the buffer pool. Now, we
      perform that eviction 'on demand' in buf_LRU_get_free_block().
      The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
       * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
       * As a buf_pool.free limit in buf_LRU_list_batch() for terminating
         the flushing that is initiated e.g., by buf_LRU_get_free_block()
      The parameter also used to serve as an initial limit for unzip_LRU
      eviction (evicting uncompressed page frames while retaining
      ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
      of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
      
      The status variables will be changed as follows:
      
      innodb_buffer_pool_pages_flushed: This includes also the count of
      innodb_buffer_pool_pages_LRU_flushed and should work reliably,
      updated one by one in buf_flush_page() to give more real-time
      statistics. The function buf_flush_stats(), which we are removing,
      was not called in every code path. For both counters, we will use
      regular variables that are incremented in a critical section of
      buf_pool.mutex. Note that show_innodb_vars() directly links to the
      variables, and reads of the counters will *not* be protected by
      buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
      
      The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
      removed, because the page cleaner no longer deals with writing or
      evicting least recently used pages, and because the single-page writes
      have been removed:
      * buffer_LRU_batch_flush_avg_time_slot
      * buffer_LRU_batch_flush_avg_time_thread
      * buffer_LRU_batch_flush_avg_time_est
      * buffer_LRU_batch_flush_avg_pass
      * buffer_LRU_single_flush_scanned
      * buffer_LRU_single_flush_num_scan
      * buffer_LRU_single_flush_scanned_per_call
      
      When moving to a single buffer pool instance in MDEV-15058, we missed
      some opportunity to simplify the buf_flush_page_cleaner thread. It was
      unnecessarily using a mutex and some complex data structures, even
      though we always have a single page cleaner thread.
      
      Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
      and 'shutdown' modes where it was waiting to be triggered by some
      other thread, adding unnecessary latency and potential for hangs in
      relatively rarely executed startup or shutdown code.
      
      The page cleaner was also running two kinds of batches in an
      interleaved fashion: "LRU flush" (writing out some least recently used
      pages and evicting them on write completion) and the normal batches
      that aim to increase the MIN(oldest_modification) in the buffer pool,
      to help the log checkpoint advance.
      
      The buf_pool.flush_list flushing was being blocked by
      buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
      of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
      been persistently written to the redo log, we would trigger a log
      flush and then resume the page flushing. This would unnecessarily
      limit the performance of the page cleaner thread and trigger the
      infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
      The settings might not be optimal" that were suppressed in
      commit d1ab8903 unless log_warnings>2.
      
      Our revised algorithm will make log_sys.get_flushed_lsn() advance at
      the start of buf_flush_lists(), and then execute a 'best effort' to
      write out all pages. The flush batches will skip pages that were modified
      since the log was written, or are are currently exclusively locked.
      The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
      will be removed, because by design, the buf_flush_page_cleaner() should
      not be blocked during a batch for extended periods of time.
      
      We will remove the single-page flushing altogether. Related to this,
      the debug parameter innodb_doublewrite_batch_size will be removed,
      because all of the doublewrite buffer will be used for flushing
      batches. If a page needs to be evicted from the buffer pool and all
      100 least recently used pages in the buffer pool have unflushed
      changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
      write out and evict innodb_lru_flush_size pages. At most one thread
      will execute buf_flush_lists() in buf_LRU_get_free_block(); other
      threads will wait for that LRU flushing batch to finish.
      
      To improve concurrency, we will replace the InnoDB ib_mutex_t and
      os_event_t native mutexes and condition variables in this area of code.
      Most notably, this means that the buffer pool mutex (buf_pool.mutex)
      is no longer instrumented via any InnoDB interfaces. It will continue
      to be instrumented via PERFORMANCE_SCHEMA.
      
      For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
      declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
      sections of buf_pool.flush_list_mutex should be shorter than those for
      buf_pool.mutex, because in the worst case, they cover a linear scan of
      buf_pool.flush_list, while the worst case of a critical section of
      buf_pool.mutex covers a linear scan of the potentially much longer
      buf_pool.LRU list.
      
      mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
      with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
      instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
      
      buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
      Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
      The number of active flush operations.
      
      buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
      instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
      and SAFE_MUTEX instrumentation.
      
      buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
      
      buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
      
      buf_pool_t::do_flush_list: Condition variable to wake up the
      buf_flush_page_cleaner when a log checkpoint needs to be written
      or the server is being shut down. Replaces buf_flush_event.
      We will keep using timed waits (the page cleaner thread will wake
      _at least_ once per second), because the calculations for
      innodb_adaptive_flushing depend on fixed time intervals.
      
      buf_dblwr: Allocate statically, and move all code to member functions.
      Use a native mutex and condition variable. Remove code to deal with
      single-page flushing.
      
      buf_dblwr_check_block(): Make the check debug-only. We were spending
      a significant amount of execution time in page_simple_validate_new().
      
      flush_counters_t::unzip_LRU_evicted: Remove.
      
      IORequest: Make more members const. FIXME: m_fil_node should be removed.
      
      buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
      (which we are removing).
      
      page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
      
      pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
      
      recv_writer_thread: Remove. Recovery works just fine without it, if we
      simply invoke buf_flush_sync() at the end of each batch in
      recv_sys_t::apply().
      
      recv_recovery_from_checkpoint_finish(): Remove. We can simply call
      recv_sys.debug_free() directly.
      
      srv_started_redo: Replaces srv_start_state.
      
      SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
      can communicate with the normal page cleaner loop via the new function
      flush_buffer_pool().
      
      buf_flush_remove(): Assert that the calling thread is holding
      buf_pool.flush_list_mutex. This removes unnecessary mutex operations
      from buf_flush_remove_pages() and buf_flush_dirty_pages(),
      which replace buf_LRU_flush_or_remove_pages().
      
      buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
      interface. Return the number of flushed pages. Clarified comments and
      renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
      buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
      function, which was their only caller, and remove 2 unnecessary
      buf_pool.mutex release/re-acquisition that we used to perform around
      the buf_flush_batch() call. At the start, if not all log has been
      durably written, wait for a background task to do it, or start a new
      task to do it. This allows the log write to run concurrently with our
      page flushing batch. Any pages that were skipped due to too recent
      FIL_PAGE_LSN or due to them being latched by a writer should be flushed
      during the next batch, unless there are further modifications to those
      pages. It is possible that a page that we must flush due to small
      oldest_modification also carries a recent FIL_PAGE_LSN or is being
      constantly modified. In the worst case, all writers would then end up
      waiting in log_free_check() to allow the flushing and the checkpoint
      to complete.
      
      buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
      Cache the last looked up tablespace. If neighbor flushing is not applicable,
      invoke buf_flush_page() directly, avoiding a page lookup in between.
      
      buf_flush_space(): Auxiliary function to look up a tablespace for
      page flushing.
      
      buf_flush_page(): Defer the computation of space->full_crc32(). Never
      call log_write_up_to(), but instead skip persistent pages whose latest
      modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
      pages on which we cannot acquire a shared latch without waiting.
      
      buf_flush_try_neighbors(): Do not bother checking buf_fix_count
      because buf_flush_page() will no longer wait for the page latch.
      Take the tablespace as a parameter, and only execute this function
      when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
      
      buf_flush_relocate_on_flush_list(): Declare as cold, and push down
      a condition from the callers.
      
      buf_flush_check_neighbor(): Take id.fold() as a parameter.
      
      buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
      because the flushing batch will skip pages whose modifications have
      not yet been written to the log or were latched for modification.
      
      buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
      
      buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
      the counters, and report n->evicted.
      Cache the last looked up tablespace. If neighbor flushing is not applicable,
      invoke buf_flush_page() directly, avoiding a page lookup in between.
      
      buf_do_LRU_batch(): Return the number of pages flushed.
      
      buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
      adaptive hash index entries are pointing to the block.
      
      buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
      will no longer perform any useful work for us, and we do not want it
      to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
      writes out and evicts at most innodb_lru_flush_size pages. (The
      function buf_do_LRU_batch() may complete after writing fewer pages if
      more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
      Eliminate some mutex release-acquire cycles, and wait for the LRU
      flush batch to complete before rescanning.
      
      buf_LRU_check_size_of_non_data_objects(): Simplify the code.
      
      buf_page_write_complete(): Remove the parameter evict, and always
      evict pages that were part of an LRU flush.
      
      buf_page_create(): Take a pre-allocated page as a parameter.
      
      buf_pool_t::free_block(): Free a pre-allocated block.
      
      recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
      while not holding recv_sys.mutex. During page allocation, we may
      initiate a page flush, which in turn may initiate a log flush, which
      would require acquiring log_sys.mutex, which should always be acquired
      before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
      not be holding recv_sys.mutex while allocating a buffer pool block.
      
      BtrBulk::logFreeCheck(): Skip a redundant condition.
      
      row_undo_step(): Do not invoke srv_inc_activity_count() for every row
      that is being rolled back. It should suffice to invoke the function in
      trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
      rollback completes.
      
      sync_check_enable(): Remove. We will enable innodb_sync_debug from the
      very beginning.
      
      Reviewed by: Vladislav Vaintroub
      7cffb5f6
  12. 20 Aug, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-23514 Race conditions between ROLLBACK and ALTER TABLE · 22c4a751
      Marko Mäkelä authored
      Since commit 15093639 (MDEV-23484)
      the rollback of InnoDB transactions is no longer protected by
      dict_operation_lock. Removing that protection revealed a race
      condition between transaction rollback and the rollback of an
      online table-rebuilding operation (OPTIMIZE TABLE, or any online
      ALTER TABLE that is rebuilding the table).
      
      row_undo_mod_clust(): Re-check dict_index_is_online_ddl() after
      acquiring index->lock, similar to how row_undo_ins_remove_clust_rec()
      is doing it. Because innobase_online_rebuild_log_free() is holding
      exclusive index->lock while invoking row_log_free(), this re-check
      will ensure that row_log_table_low() will not be invoked when
      index->online_log=NULL.
      
      A different race condition is possible between the rollback of a
      recovered transaction and the start of online secondary index creation.
      Because prepare_inplace_alter_table_dict() is not acquiring an InnoDB
      table lock in this case, and because recovered transactions are not
      covered by metadata locks (MDL), the dict_table_t::indexes could be
      modified by prepare_inplace_alter_table_dict() while the rollback of
      a recovered transaction is being executed. Normal transactions would
      be covered by MDL, and during prepare_inplace_alter_table_dict() we
      do hold MDL_EXCLUSIVE, that is, an online ALTER TABLE operation may
      not execute concurrently with other transactions that have accessed
      the table.
      
      row_undo(): To prevent a race condition with
      prepare_inplace_alter_table_dict(), acquire dict_operation_lock
      for all recovered transactions. Before MDEV-23484 we used to acquire
      it for all transactions, not only recovered ones.
      
      Note: row_merge_drop_indexes() would not invoke
      dict_index_remove_from_cache() while transactional locks
      exist on the table, or while any thread is holding an open table handle.
      OK, it does that for FULLTEXT INDEX, but ADD FULLTEXT INDEX is not
      supported as an online operation, and therefore
      prepare_inplace_alter_table_dict() would acquire a table S lock,
      which cannot succeed as long as recovered transactions on the table
      exist, because they would hold a conflicting IX lock on the table.
      22c4a751
  13. 18 Aug, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-23484 Rollback unnecessarily acquires dict_operation_lock for every row · 15093639
      Marko Mäkelä authored
      InnoDB transaction rollback includes an unnecessary work-around for
      a data corruption bug that was fixed by me in MySQL 5.6.12
      mysql/mysql-server@935ba09d52c1908bde273ad1940b5ab919d9763d
      and ported to MariaDB 10.0.8 by
      commit c291ddfd
      in 2013 and 2014, respectively.
      
      By acquiring and releasing dict_operation_lock in shared mode,
      row_undo() hopes to prevent the table from being dropped while
      the undo log record is being rolled back. But, thanks to mentioned fix,
      debug assertions (that we are adding) show that the rollback is
      protected by transactional locks (table IX lock, in addition to
      implicit or explicit exclusive locks on the records that had been modified).
      
      Because row_drop_table_for_mysql() would invoke
      row_add_table_to_background_drop_list() if any locks exist on the table,
      the mere existence of locks (which is guaranteed during ROLLBACK) is
      enough to protect the table from disappearing. Hence, acquiring and
      releasing dict_operation_lock for every row that is being rolled back is
      unnecessary.
      
      row_undo(): Remove the unnecessary acquisition and release of
      dict_operation_lock.
      
      Note: row_add_table_to_background_drop_list() is mostly working around
      bugs outside InnoDB:
      MDEV-21175 (insufficient MDL protection of FOREIGN KEY operations)
      MDEV-21602 (incorrect error handling of CREATE TABLE...SELECT).
      15093639
  14. 05 Jun, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-15053 Reduce buf_pool_t::mutex contention · b1ab211d
      Marko Mäkelä authored
      User-visible changes: The INFORMATION_SCHEMA views INNODB_BUFFER_PAGE
      and INNODB_BUFFER_PAGE_LRU will report a dummy value FLUSH_TYPE=0
      and will no longer report the PAGE_STATE value READY_FOR_USE.
      
      We will remove some fields from buf_page_t and move much code to
      member functions of buf_pool_t and buf_page_t, so that the access
      rules of data members can be enforced consistently.
      
      Evicting or adding pages in buf_pool.LRU will remain covered by
      buf_pool.mutex.
      
      Evicting or adding pages in buf_pool.page_hash will remain
      covered by both buf_pool.mutex and the buf_pool.page_hash X-latch.
      
      After this fix, buf_pool.page_hash lookups can entirely
      avoid acquiring buf_pool.mutex, only relying on
      buf_pool.hash_lock_get() S-latch.
      
      Similarly, buf_flush_check_neighbors() can will rely solely on
      buf_pool.mutex, no buf_pool.page_hash latch at all.
      
      The buf_pool.mutex is rather contended in I/O heavy benchmarks,
      especially when the workload does not fit in the buffer pool.
      
      The first attempt to alleviate the contention was the
      buf_pool_t::mutex split in
      commit 4ed7082e
      which introduced buf_block_t::mutex, which we are now removing.
      
      Later, multiple instances of buf_pool_t were introduced
      in commit c18084f7
      and recently removed by us in
      commit 1a6f708e (MDEV-15058).
      
      UNIV_BUF_DEBUG: Remove. This option to enable some buffer pool
      related debugging in otherwise non-debug builds has not been used
      for years. Instead, we have been using UNIV_DEBUG, which is enabled
      in CMAKE_BUILD_TYPE=Debug.
      
      buf_block_t::mutex, buf_pool_t::zip_mutex: Remove. We can mainly rely on
      std::atomic and the buf_pool.page_hash latches, and in some cases
      depend on buf_pool.mutex or buf_pool.flush_list_mutex just like before.
      We must always release buf_block_t::lock before invoking
      unfix() or io_unfix(), to prevent a glitch where a block that was
      added to the buf_pool.free list would apper X-latched. See
      commit c5883deb how this glitch
      was finally caught in a debug environment.
      
      We move some buf_pool_t::page_hash specific code from the
      ha and hash modules to buf_pool, for improved readability.
      
      buf_pool_t::close(): Assert that all blocks are clean, except
      on aborted startup or crash-like shutdown.
      
      buf_pool_t::validate(): No longer attempt to validate
      n_flush[] against the number of BUF_IO_WRITE fixed blocks,
      because buf_page_t::flush_type no longer exists.
      
      buf_pool_t::watch_set(): Replaces buf_pool_watch_set().
      Reduce mutex contention by separating the buf_pool.watch[]
      allocation and the insert into buf_pool.page_hash.
      
      buf_pool_t::page_hash_lock<bool exclusive>(): Acquire a
      buf_pool.page_hash latch.
      Replaces and extends buf_page_hash_lock_s_confirm()
      and buf_page_hash_lock_x_confirm().
      
      buf_pool_t::READ_AHEAD_PAGES: Renamed from BUF_READ_AHEAD_PAGES.
      
      buf_pool_t::curr_size, old_size, read_ahead_area, n_pend_reads:
      Use Atomic_counter.
      
      buf_pool_t::running_out(): Replaces buf_LRU_buf_pool_running_out().
      
      buf_pool_t::LRU_remove(): Remove a block from the LRU list
      and return its predecessor. Incorporates buf_LRU_adjust_hp(),
      which was removed.
      
      buf_page_get_gen(): Remove a redundant call of fsp_is_system_temporary(),
      for mode == BUF_GET_IF_IN_POOL_OR_WATCH, which is only used by
      BTR_DELETE_OP (purge), which is never invoked on temporary tables.
      
      buf_free_from_unzip_LRU_list_batch(): Avoid redundant assignments.
      
      buf_LRU_free_from_unzip_LRU_list(): Simplify the loop condition.
      
      buf_LRU_free_page(): Clarify the function comment.
      
      buf_flush_check_neighbor(), buf_flush_check_neighbors():
      Rewrite the construction of the page hash range. We will hold
      the buf_pool.mutex for up to buf_pool.read_ahead_area (at most 64)
      consecutive lookups of buf_pool.page_hash.
      
      buf_flush_page_and_try_neighbors(): Remove.
      Merge to its only callers, and remove redundant operations in
      buf_flush_LRU_list_batch().
      
      buf_read_ahead_random(), buf_read_ahead_linear(): Rewrite.
      Do not acquire buf_pool.mutex, and iterate directly with page_id_t.
      
      ut_2_power_up(): Remove. my_round_up_to_next_power() is inlined
      and avoids any loops.
      
      fil_page_get_prev(), fil_page_get_next(), fil_addr_is_null(): Remove.
      
      buf_flush_page(): Add a fil_space_t* parameter. Minimize the
      buf_pool.mutex hold time. buf_pool.n_flush[] is no longer updated
      atomically with the io_fix, and we will protect most buf_block_t
      fields with buf_block_t::lock. The function
      buf_flush_write_block_low() is removed and merged here.
      
      buf_page_init_for_read(): Use static linkage. Initialize the newly
      allocated block and acquire the exclusive buf_block_t::lock while not
      holding any mutex.
      
      IORequest::IORequest(): Remove the body. We only need to invoke
      set_punch_hole() in buf_flush_page() and nowhere else.
      
      buf_page_t::flush_type: Remove. Replaced by IORequest::flush_type.
      This field is only used during a fil_io() call.
      That function already takes IORequest as a parameter, so we had
      better introduce  for the rarely changing field.
      
      buf_block_t::init(): Replaces buf_page_init().
      
      buf_page_t::init(): Replaces buf_page_init_low().
      
      buf_block_t::initialise(): Initialise many fields, but
      keep the buf_page_t::state(). Both buf_pool_t::validate() and
      buf_page_optimistic_get() requires that buf_page_t::in_file()
      be protected atomically with buf_page_t::in_page_hash
      and buf_page_t::in_LRU_list.
      
      buf_page_optimistic_get(): Now that buf_block_t::mutex
      no longer exists, we must check buf_page_t::io_fix()
      after acquiring the buf_pool.page_hash lock, to detect
      whether buf_page_init_for_read() has been initiated.
      We will also check the io_fix() before acquiring hash_lock
      in order to avoid unnecessary computation.
      The field buf_block_t::modify_clock (protected by buf_block_t::lock)
      allows buf_page_optimistic_get() to validate the block.
      
      buf_page_t::real_size: Remove. It was only used while flushing
      pages of page_compressed tables.
      
      buf_page_encrypt(): Add an output parameter that allows us ot eliminate
      buf_page_t::real_size. Replace a condition with debug assertion.
      
      buf_page_should_punch_hole(): Remove.
      
      buf_dblwr_t::add_to_batch(): Replaces buf_dblwr_add_to_batch().
      Add the parameter size (to replace buf_page_t::real_size).
      
      buf_dblwr_t::write_single_page(): Replaces buf_dblwr_write_single_page().
      Add the parameter size (to replace buf_page_t::real_size).
      
      fil_system_t::detach(): Replaces fil_space_detach().
      Ensure that fil_validate() will not be violated even if
      fil_system.mutex is released and reacquired.
      
      fil_node_t::complete_io(): Renamed from fil_node_complete_io().
      
      fil_node_t::close_to_free(): Replaces fil_node_close_to_free().
      Avoid invoking fil_node_t::close() because fil_system.n_open
      has already been decremented in fil_space_t::detach().
      
      BUF_BLOCK_READY_FOR_USE: Remove. Directly use BUF_BLOCK_MEMORY.
      
      BUF_BLOCK_ZIP_DIRTY: Remove. Directly use BUF_BLOCK_ZIP_PAGE,
      and distinguish dirty pages by buf_page_t::oldest_modification().
      
      BUF_BLOCK_POOL_WATCH: Remove. Use BUF_BLOCK_NOT_USED instead.
      This state was only being used for buf_page_t that are in
      buf_pool.watch.
      
      buf_pool_t::watch[]: Remove pointer indirection.
      
      buf_page_t::in_flush_list: Remove. It was set if and only if
      buf_page_t::oldest_modification() is nonzero.
      
      buf_page_decrypt_after_read(), buf_corrupt_page_release(),
      buf_page_check_corrupt(): Change the const fil_space_t* parameter
      to const fil_node_t& so that we can report the correct file name.
      
      buf_page_monitor(): Declare as an ATTRIBUTE_COLD global function.
      
      buf_page_io_complete(): Split to buf_page_read_complete() and
      buf_page_write_complete().
      
      buf_dblwr_t::in_use: Remove.
      
      buf_dblwr_t::buf_block_array: Add IORequest::flush_t.
      
      buf_dblwr_sync_datafiles(): Remove. It was a useless wrapper of
      os_aio_wait_until_no_pending_writes().
      
      buf_flush_write_complete(): Declare static, not global.
      Add the parameter IORequest::flush_t.
      
      buf_flush_freed_page(): Simplify the code.
      
      recv_sys_t::flush_lru: Renamed from flush_type and changed to bool.
      
      fil_read(), fil_write(): Replaced with direct use of fil_io().
      
      fil_buffering_disabled(): Remove. Check srv_file_flush_method directly.
      
      fil_mutex_enter_and_prepare_for_io(): Return the resolved
      fil_space_t* to avoid a duplicated lookup in the caller.
      
      fil_report_invalid_page_access(): Clean up the parameters.
      
      fil_io(): Return fil_io_t, which comprises fil_node_t and error code.
      Always invoke fil_space_t::acquire_for_io() and let either the
      sync=true caller or fil_aio_callback() invoke
      fil_space_t::release_for_io().
      
      fil_aio_callback(): Rewrite to replace buf_page_io_complete().
      
      fil_check_pending_operations(): Remove a parameter, and remove some
      redundant lookups.
      
      fil_node_close_to_free(): Wait for n_pending==0. Because we no longer
      do an extra lookup of the tablespace between fil_io() and the
      completion of the operation, we must give fil_node_t::complete_io() a
      chance to decrement the counter.
      
      fil_close_tablespace(): Remove unused parameter trx, and document
      that this is only invoked during the error handling of IMPORT TABLESPACE.
      
      row_import_discard_changes(): Merged with the only caller,
      row_import_cleanup(). Do not lock up the data dictionary while
      invoking fil_close_tablespace().
      
      logs_empty_and_mark_files_at_shutdown(): Do not invoke
      fil_close_all_files(), to avoid a !needs_flush assertion failure
      on fil_node_t::close().
      
      innodb_shutdown(): Invoke os_aio_free() before fil_close_all_files().
      
      fil_close_all_files(): Invoke fil_flush_file_spaces()
      to ensure proper durability.
      
      thread_pool::unbind(): Fix a crash that would occur on Windows
      after srv_thread_pool->disable_aio() and os_file_close().
      This fix was submitted by Vladislav Vaintroub.
      
      Thanks to Matthias Leich and Axel Schwenke for extensive testing,
      Vladislav Vaintroub for helpful comments, and Eugene Kosov for a review.
      b1ab211d
  15. 04 Jun, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-22721 Remove bloat caused by InnoDB logger class · eba2d10a
      Marko Mäkelä authored
      Introduce a new ATTRIBUTE_NOINLINE to
      ib::logger member functions, and add UNIV_UNLIKELY hints to callers.
      
      Also, remove some crash reporting output. If needed, the
      information will be available using debugging tools.
      
      Furthermore, remove some fts_enable_diag_print output that included
      indexed words in raw form. The code seemed to assume that words are
      NUL-terminated byte strings. It is not clear whether a NUL terminator
      is always guaranteed to be present. Also, UCS2 or UTF-16 strings would
      typically contain many NUL bytes.
      eba2d10a
  16. 29 Apr, 2020 1 commit
  17. 01 Apr, 2020 1 commit
    • Marko Mäkelä's avatar
      MDEV-13626: Improve innodb.xa_recovery_debug · b1742a5c
      Marko Mäkelä authored
      Improve the test that was imported and adapted for MariaDB in
      commit fb217449.
      
      row_undo_step(): Move the DEBUG_SYNC point from trx_rollback_for_mysql().
      This DEBUG_SYNC point is executed after rolling back one row.
      
      trx_rollback_for_mysql(): Clarify the comments that describe the scenario,
      and remove the DEBUG_SYNC point.
      
      If the statement "if (trx->has_logged_persistent())" and its body are
      removed from trx_rollback_for_mysql(), then the test
      innodb.xa_recovery_debug will fail because the transaction would still
      exist in the XA PREPARE state. If we allow the XA COMMIT statement
      to succeed in the test, we would observe an incorrect state of the
      XA transaction where the table would contain row (1,NULL). Depending
      on whether the XA transaction was committed, the table should either
      be empty or contain the record (1,1). The intermediate state of
      (1,NULL) should never be observed after completed recovery.
      b1742a5c
  18. 11 Mar, 2020 1 commit
  19. 12 Dec, 2019 1 commit
    • Eugene Kosov's avatar
      MDEV-20950 Reduce size of record offsets · f0aa073f
      Eugene Kosov authored
      offset_t: this is a type which represents one record offset.
      It's unsigned short int.
      
      a lot of functions: replace ulint with offset_t
      
      btr_pcur_restore_position_func(),
      page_validate(),
      row_ins_scan_sec_index_for_duplicate(),
      row_upd_clust_rec_by_insert_inherit_func(),
      row_vers_impl_x_locked_low(),
      trx_undo_prev_version_build():
        allocate record offsets on the stack instead of waiting for rec_get_offsets()
        to allocate it from mem_heap_t. So, reducing  memory allocations.
      
      RECORD_OFFSET, INDEX_OFFSET:
        now it's less convenient to store pointers in offset_t*
        array. One pointer occupies now several offset_t. And those constant are start
        indexes into array to places where to store pointer values
      
      REC_OFFS_HEADER_SIZE: adjusted for the new reality
      
      REC_OFFS_NORMAL_SIZE:
        increase size from 100 to 300 which means less heap allocations.
        And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which
        is smaller than previous 800 bytes.
      
      REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality
      
      rem0rec.h, rem0rec.ic, rem0rec.cc:
        various arguments, return values and local variables types were changed to
        fix numerous integer conversions issues.
      
      enum field_type_t:
        offset types concept was introduces which replaces old offset flags stuff.
        Like in earlier version, 2 upper bits are used to store offset type.
        And this enum represents those types.
      
      REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed
      
      get_type(), set_type(), get_value(), combine():
        these are convenience functions to work with offsets and it's types
      
      rec_offs_base()[0]:
        still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL
      
      rec_offs_base()[i]:
        these have type offset_t now. Two upper bits contains type.
      f0aa073f
  20. 03 Dec, 2019 1 commit
    • Marko Mäkelä's avatar
      MDEV-21174: Replace mlog_write_ulint() with mtr_t::write() · 56f6dab1
      Marko Mäkelä authored
      mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull().
      Optimize away writes if the page contents does not change,
      except when a dummy write has been explicitly requested.
      
      Because the member function template takes a block descriptor as a
      parameter, it is possible to introduce better consistency checks.
      Due to this, the code for handling file-based lists, undo logs
      and user transactions was refactored to pass around buf_block_t.
      56f6dab1
  21. 17 May, 2019 1 commit
  22. 11 May, 2019 1 commit
  23. 14 Mar, 2019 1 commit
  24. 22 Nov, 2018 1 commit
    • Marko Mäkelä's avatar
      MDEV-17794 Do not assign persistent ID for temporary tables · 4be0855c
      Marko Mäkelä authored
      InnoDB in MySQL 5.7 introduced two new parameters to the function
      dict_hdr_get_new_id(), to allow redo logging to be disabled when
      assigning identifiers to temporary tables or during the
      backup-unfriendly TRUNCATE TABLE that was replaced in MariaDB
      by MDEV-13564.
      
      Now that MariaDB 10.4.0 removed the crash recovery code for the
      backup-unfriendly TRUNCATE, we can revert dict_hdr_get_new_id()
      to be used only for persistent data structures.
      
      dict_table_assign_new_id(): Remove. This was a simple 2-line function
      that was called from few places.
      
      dict_table_open_on_id_low(): Declare in the only file where it
      is called.
      
      dict_sys_t::temp_id_hash: A separate lookup table for temporary tables.
      Table names will be in the common dict_sys_t::table_hash.
      
      dict_sys_t::get_temporary_table_id(): Assign a temporary table ID.
      
      dict_sys_t::get_table(): Look up a persistent table.
      
      dict_sys_t::get_temporary_table(): Look up a temporary table.
      
      dict_sys_t::temp_table_id: The sequence of temporary table identifiers.
      Starts from DICT_HDR_FIRST_ID, so that we can continue to simply compare
      dict_table_t::id to a few constants for the persistent hard-coded
      data dictionary tables.
      
      undo_node_t::state: Distinguish temporary and persistent tables.
      
      lock_check_dict_lock(), lock_get_table_id(): Assert that there cannot
      be locks on temporary tables.
      
      row_rec_to_index_entry_impl(): Assert that there cannot be metadata
      records on temporary tables.
      
      row_undo_ins_parse_undo_rec(): Distinguish temporary and persistent tables.
      Move some assertions from the only caller. Return whether the table was
      found.
      
      row_undo_ins(): Add some assertions.
      
      row_undo_mod_clust(), row_undo_mod(): Do not assign node->state.
      Let row_undo() do that.
      
      row_undo_mod_parse_undo_rec(): Distinguish temporary and persistent tables.
      Move some assertions from the only caller. Return whether the table was
      found.
      
      row_undo_try_truncate(): Renamed and simplified from trx_roll_try_truncate().
      
      row_undo_rec_get(): Replaces trx_roll_pop_top_rec_of_trx() and
      trx_roll_pop_top_rec(). Fetch an undo log record, and assign undo->state
      accordingly.
      
      trx_undo_truncate_end(): Acquire the rseg->mutex only for the minimum
      required duration, and release it between mini-transactions.
      4be0855c
  25. 19 Nov, 2018 1 commit
  26. 19 Oct, 2018 1 commit
    • Marko Mäkelä's avatar
      MDEV-15662 Instant DROP COLUMN or changing the order of columns · 0e5a4ac2
      Marko Mäkelä authored
      Allow ADD COLUMN anywhere in a table, not only adding as the
      last column.
      
      Allow instant DROP COLUMN and instant changing the order of columns.
      
      The added columns will always be added last in clustered index records.
      In new records, instantly dropped columns will be stored as NULL or
      empty when possible.
      
      Information about dropped and reordered columns will be written in
      a metadata BLOB (mblob), which is stored before the first 'user' field
      in the hidden metadata record at the start of the clustered index.
      The presence of mblob is indicated by setting the delete-mark flag in
      the metadata record.
      
      The metadata BLOB stores the number of clustered index fields,
      followed by an array of column information for each field.
      For dropped columns, we store the NOT NULL flag, the fixed length,
      and for variable-length columns, whether the maximum length exceeded
      255 bytes. For non-dropped columns, we store the column position.
      
      Unlike with MDEV-11369, when a table becomes empty, it cannot
      be converted back to the canonical format. The reason for this is
      that other threads may hold cached objects such as
      row_prebuilt_t::ins_node that could refer to dropped or reordered
      index fields.
      
      For instant DROP COLUMN and ROW_FORMAT=COMPACT or ROW_FORMAT=DYNAMIC,
      we must store the n_core_null_bytes in the root page, so that the
      chain of node pointer records can be followed in order to reach the
      leftmost leaf page where the metadata record is located.
      If the mblob is present, we will zero-initialize the strings
      "infimum" and "supremum" in the root page, and use the last byte of
      "supremum" for storing the number of null bytes (which are allocated
      but useless on node pointer pages). This is necessary for
      btr_cur_instant_init_metadata() to be able to navigate to the mblob.
      
      If the PRIMARY KEY contains any variable-length column and some
      nullable columns were instantly dropped, the dict_index_t::n_nullable
      in the data dictionary could be smaller than it actually is in the
      non-leaf pages. Because of this, the non-leaf pages could use more
      bytes for the null flags than the data dictionary expects, and we
      could be reading the lengths of the variable-length columns from the
      wrong offset, and thus reading the child page number from wrong place.
      This is the result of two design mistakes that involve unnecessary
      storage of data: First, it is nonsense to store any data fields for
      the leftmost node pointer records, because the comparisons would be
      resolved by the MIN_REC_FLAG alone. Second, there cannot be any null
      fields in the clustered index node pointer fields, but we nevertheless
      reserve space for all the null flags.
      
      Limitations (future work):
      
      MDEV-17459 Allow instant ALTER TABLE even if FULLTEXT INDEX exists
      MDEV-17468 Avoid table rebuild on operations on generated columns
      MDEV-17494 Refuse ALGORITHM=INSTANT when the row size is too large
      
      btr_page_reorganize_low(): Preserve any metadata in the root page.
      Call lock_move_reorganize_page() only after restoring the "infimum"
      and "supremum" records, to avoid a memcmp() assertion failure.
      
      dict_col_t::DROPPED: Magic value for dict_col_t::ind.
      
      dict_col_t::clear_instant(): Renamed from dict_col_t::remove_instant().
      Do not assert that the column was instantly added, because we
      sometimes call this unconditionally for all columns.
      Convert an instantly added column to a "core column". The old name
      remove_instant() could be mistaken to refer to "instant DROP COLUMN".
      
      dict_col_t::is_added(): Rename from dict_col_t::is_instant().
      
      dtype_t::metadata_blob_init(): Initialize the mblob data type.
      
      dtuple_t::is_metadata(), dtuple_t::is_alter_metadata(),
      upd_t::is_metadata(), upd_t::is_alter_metadata(): Check if info_bits
      refer to a metadata record.
      
      dict_table_t::instant: Metadata about dropped or reordered columns.
      
      dict_table_t::prepare_instant(): Prepare
      ha_innobase_inplace_ctx::instant_table for instant ALTER TABLE.
      innobase_instant_try() will pass this to dict_table_t::instant_column().
      On rollback, dict_table_t::rollback_instant() will be called.
      
      dict_table_t::instant_column(): Renamed from instant_add_column().
      Add the parameter col_map so that columns can be reordered.
      Copy and adjust v_cols[] as well.
      
      dict_table_t::find(): Find an old column based on a new column number.
      
      dict_table_t::serialise_columns(), dict_table_t::deserialise_columns():
      Convert the mblob.
      
      dict_index_t::instant_metadata(): Create the metadata record
      for instant ALTER TABLE. Invoke dict_table_t::serialise_columns().
      
      dict_index_t::reconstruct_fields(): Invoked by
      dict_table_t::deserialise_columns().
      
      dict_index_t::clear_instant_alter(): Move the fields for the
      dropped columns to the end, and sort the surviving index fields
      in ascending order of column position.
      
      ha_innobase::check_if_supported_inplace_alter(): Do not allow
      adding a FTS_DOC_ID column if a hidden FTS_DOC_ID column exists
      due to FULLTEXT INDEX. (This always required ALGORITHM=COPY.)
      
      instant_alter_column_possible(): Add a parameter for InnoDB table,
      to check for additional conditions, such as the maximum number of
      index fields.
      
      ha_innobase_inplace_ctx::first_alter_pos: The first column whose position
      is affected by instant ADD, DROP, or changing the order of columns.
      
      innobase_build_col_map(): Skip added virtual columns.
      
      prepare_inplace_add_virtual(): Correctly compute num_to_add_vcol.
      Remove some unnecessary code. Note that the call to
      innodb_base_col_setup() should be executed later.
      
      commit_try_norebuild(): If ctx->is_instant(), let the virtual
      columns be added or dropped by innobase_instant_try().
      
      innobase_instant_try(): Fill in a zero default value for the
      hidden column FTS_DOC_ID (to reduce the work needed in MDEV-17459).
      If any columns were dropped or reordered (or added not last),
      delete any SYS_COLUMNS records for the following columns, and
      insert SYS_COLUMNS records for all subsequent stored columns as well
      as for all virtual columns. If any virtual column is dropped, rewrite
      all virtual column metadata. Use a shortcut only for adding
      virtual columns. This is because innobase_drop_virtual_try()
      assumes that the dropped virtual columns still exist in ctx->old_table.
      
      innodb_update_cols(): Renamed from innodb_update_n_cols().
      
      innobase_add_one_virtual(), innobase_insert_sys_virtual(): Change
      the return type to bool, and invoke my_error() when detecting an error.
      
      innodb_insert_sys_columns(): Insert a record into SYS_COLUMNS.
      Refactored from innobase_add_one_virtual() and innobase_instant_add_col().
      
      innobase_instant_add_col(): Replace the parameter dfield with type.
      
      innobase_instant_drop_cols(): Drop matching columns from SYS_COLUMNS
      and all columns from SYS_VIRTUAL.
      
      innobase_add_virtual_try(), innobase_drop_virtual_try(): Let
      the caller invoke innodb_update_cols().
      
      innobase_rename_column_try(): Skip dropped columns.
      
      commit_cache_norebuild(): Update table->fts->doc_col.
      
      dict_mem_table_col_rename_low(): Skip dropped columns.
      
      trx_undo_rec_get_partial_row(): Skip dropped columns.
      
      trx_undo_update_rec_get_update(): Handle the metadata BLOB correctly.
      
      trx_undo_page_report_modify(): Avoid out-of-bounds access to record fields.
      Log metadata records consistently.
      Apparently, the first fields of a clustered index may be updated
      in an update_undo vector when the index is ID_IND of SYS_FOREIGN,
      as part of renaming the table during ALTER TABLE. Normally, updates of
      the PRIMARY KEY should be logged as delete-mark and an insert.
      
      row_undo_mod_parse_undo_rec(), row_purge_parse_undo_rec():
      Use trx_undo_metadata.
      
      row_undo_mod_clust_low(): On metadata rollback, roll back the root page too.
      
      row_undo_mod_clust(): Relax an assertion. The delete-mark flag was
      repurposed for ALTER TABLE metadata records.
      
      row_rec_to_index_entry_impl(): Add the template parameter mblob
      and the optional parameter info_bits for specifying the desired new
      info bits. For the metadata tuple, allow conversion between the original
      format (ADD COLUMN only) and the generic format (with hidden BLOB).
      Add the optional parameter "pad" to determine whether the tuple should
      be padded to the index fields (on ALTER TABLE it should), or whether
      it should remain at its original size (on rollback).
      
      row_build_index_entry_low(): Clean up the code, removing
      redundant variables and conditions. For instantly dropped columns,
      generate a dummy value that is NULL, the empty string, or a
      fixed length of NUL bytes, depending on the type of the dropped column.
      
      row_upd_clust_rec_by_insert_inherit_func(): On the update of PRIMARY KEY
      of a record that contained a dropped column whose value was stored
      externally, we will be inserting a dummy NULL or empty string value
      to the field of the dropped column. The externally stored column would
      eventually be dropped when purge removes the delete-marked record for
      the old PRIMARY KEY value.
      
      btr_index_rec_validate(): Recognize the metadata record.
      
      btr_discard_only_page_on_level(): Preserve the generic instant
      ALTER TABLE metadata.
      
      btr_set_instant(): Replaces page_set_instant(). This sets a clustered
      index root page to the appropriate format, or upgrades from
      the MDEV-11369 instant ADD COLUMN to generic ALTER TABLE format.
      
      btr_cur_instant_init_low(): Read and validate the metadata BLOB page
      before reconstructing the dictionary information based on it.
      
      btr_cur_instant_init_metadata(): Do not read any lengths from the
      metadata record header before reading the BLOB. At this point, we
      would not actually know how many nullable fields the metadata record
      contains.
      
      btr_cur_instant_root_init(): Initialize n_core_null_bytes in one
      of two possible ways.
      
      btr_cur_trim(): Handle the mblob record.
      
      row_metadata_to_tuple(): Convert a metadata record to a data tuple,
      based on the new info_bits of the metadata record.
      
      btr_cur_pessimistic_update(): Invoke row_metadata_to_tuple() if needed.
      Invoke dtuple_convert_big_rec() for metadata records if the record is
      too large, or if the mblob is not yet marked as externally stored.
      
      btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
      When the last user record is deleted, do not delete the
      generic instant ALTER TABLE metadata record. Only delete
      MDEV-11369 instant ADD COLUMN metadata records.
      
      btr_cur_optimistic_insert(): Avoid unnecessary computation of rec_size.
      
      btr_pcur_store_position(): Allow a logically empty page to contain
      a metadata record for generic ALTER TABLE.
      
      REC_INFO_DEFAULT_ROW_ADD: Renamed from REC_INFO_DEFAULT_ROW.
      This is for the old instant ADD COLUMN (MDEV-11369) only.
      
      REC_INFO_DEFAULT_ROW_ALTER: The more generic metadata record,
      with additional information for dropped or reordered columns.
      
      rec_info_bits_valid(): Remove. The only case when this would fail
      is when the record is the generic ALTER TABLE metadata record.
      
      rec_is_alter_metadata(): Check if a record is the metadata record
      for instant ALTER TABLE (other than ADD COLUMN). NOTE: This function
      must not be invoked on node pointer records, because the delete-mark
      flag in those records may be set (it is garbage), and then a debug
      assertion could fail because index->is_instant() does not necessarily
      hold.
      
      rec_is_add_metadata(): Check if a record is MDEV-11369 ADD COLUMN metadata
      record (not more generic instant ALTER TABLE).
      
      rec_get_converted_size_comp_prefix_low(): Assume that the metadata
      field will be stored externally. In dtuple_convert_big_rec() during
      the rec_get_converted_size() call, it would not be there yet.
      
      rec_get_converted_size_comp(): Replace status,fields,n_fields with tuple.
      
      rec_init_offsets_comp_ordinary(), rec_get_converted_size_comp_prefix_low(),
      rec_convert_dtuple_to_rec_comp(): Add template<bool mblob = false>.
      With mblob=true, process a record with a metadata BLOB.
      
      rec_copy_prefix_to_buf(): Assert that no fields beyond the key and
      system columns are being copied. Exclude the metadata BLOB field.
      
      rec_convert_dtuple_to_metadata_comp(): Convert an alter metadata tuple
      into a record.
      
      row_upd_index_replace_metadata(): Apply an update vector to an
      alter_metadata tuple.
      
      row_log_allocate(): Replace dict_index_t::is_instant()
      with a more appropriate condition that ignores dict_table_t::instant.
      Only a table on which the MDEV-11369 ADD COLUMN was performed
      can "lose its instantness" when it becomes empty. After
      instant DROP COLUMN or reordering columns, we cannot simply
      convert the table to the canonical format, because the data
      dictionary cache and all possibly existing references to it
      from other client connection threads would have to be adjusted.
      
      row_quiesce_write_index_fields(): Do not crash when the table contains
      an instantly dropped column.
      
      Thanks to Thirunarayanan Balathandayuthapani for discussing the design
      and implementing an initial prototype of this.
      Thanks to Matthias Leich for testing.
      0e5a4ac2
  27. 19 Sep, 2018 1 commit
  28. 11 Apr, 2018 1 commit
    • Marko Mäkelä's avatar
      MDEV-15832 With innodb_fast_shutdown=3, skip the rollback of connected transactions · dd127799
      Marko Mäkelä authored
      row_undo_step(): If innodb_fast_shutdown=3 has been requested,
      abort the rollback of any non-DDL transactions. Starting with
      MDEV-12323, we aborted the rollback of recovered transactions. The
      transactions would be rolled back on subsequent server startup.
      
      trx_roll_report_progress(): Renamed from trx_roll_must_shutdown(),
      now that the shutdown check has been moved to the only caller.
      
      trx_commit_low(): Allow mtr=NULL for transactions that are aborted
      on rollback.
      
      trx_rollback_finish(): Clean up aborted transactions to avoid
      assertion failures and memory leaks on shutdown. This code was
      previously in trx_rollback_active().
      
      trx_rollback_to_savepoint_low(), trx_rollback_for_mysql_low():
      Remove some redundant assertions.
      dd127799
  29. 10 Apr, 2018 1 commit
  30. 06 Apr, 2018 1 commit
    • Marko Mäkelä's avatar
      MDEV-14705: Do not rollback on InnoDB shutdown · 76ec37f5
      Marko Mäkelä authored
      row_undo_step(): If fast shutdown has been requested, abort the
      rollback of any non-DDL transactions. Starting with MDEV-12323,
      we aborted the rollback of recovered transactions. These
      transactions would be rolled back on subsequent server startup.
      
      trx_roll_report_progress(): Renamed from trx_roll_must_shutdown(),
      now that the shutdown check has been moved to the only caller.
      76ec37f5
  31. 30 Jan, 2018 1 commit
    • Marko Mäkelä's avatar
      MDEV-11415 Remove excessive undo logging during ALTER TABLE…ALGORITHM=COPY · 0ba6aaf0
      Marko Mäkelä authored
      If a crash occurs during ALTER TABLE…ALGORITHM=COPY, InnoDB would spend
      a lot of time rolling back writes to the intermediate copy of the table.
      To reduce the amount of busy work done, a work-around was introduced in
      commit fd069e2b in MySQL 4.1.8 and 5.0.2,
      to commit the transaction after every 10,000 inserted rows.
      
      A proper fix would have been to disable the undo logging altogether and
      to simply drop the intermediate copy of the table on subsequent server
      startup. This is what happens in MariaDB 10.3 with MDEV-14717,MDEV-14585.
      In MariaDB 10.2, the intermediate copy of the table would be left behind
      with a name starting with the string #sql.
      
      This is a backport of a bug fix from MySQL 8.0.0 to MariaDB,
      contributed by jixianliang <271365745@qq.com>.
      
      Unlike recent MySQL, MariaDB supports ALTER IGNORE. For that operation
      InnoDB must for now keep the undo logging enabled, so that the latest
      row can be rolled back in case of an error.
      
      In Galera cluster, the LOAD DATA statement will retain the existing
      behaviour and commit the transaction after every 10,000 rows if
      the parameter wsrep_load_data_splitting=ON is set. The logic to do
      so (the wsrep_load_data_split() function and the call
      handler::extra(HA_EXTRA_FAKE_START_STMT)) are joint work
      by Ji Xianliang and Marko Mäkelä.
      
      The original fix:
      
      Author: Thirunarayanan Balathandayuthapani <thirunarayanan.balathandayuth@oracle.com>
      Date:   Wed Dec 2 16:09:15 2015 +0530
      
      Bug#17479594 AVOID INTERMEDIATE COMMIT WHILE DOING ALTER TABLE ALGORITHM=COPY
      
      Problem:
      
      During ALTER TABLE, we commit and restart the transaction for every
      10,000 rows, so that the rollback after recovery would not take so long.
      
      Fix:
      
      Suppress the undo logging during copy alter operation. If fts_index is
      present then insert directly into fts auxiliary table rather
      than doing at commit time.
      
      ha_innobase::num_write_row: Remove the variable.
      
      ha_innobase::write_row(): Remove the hack for committing every 10000 rows.
      
      row_lock_table_for_mysql(): Remove the extra 2 parameters.
      
      lock_get_src_table(), lock_is_table_exclusive(): Remove.
      Reviewed-by: default avatarMarko Mäkelä <marko.makela@oracle.com>
      Reviewed-by: default avatarShaohua Wang <shaohua.wang@oracle.com>
      Reviewed-by: default avatarJon Olav Hauglid <jon.hauglid@oracle.com>
      0ba6aaf0
  32. 13 Dec, 2017 2 commits
    • Marko Mäkelä's avatar
      MDEV-12323 Rollback progress log messages during crash recovery are intermixed... · b1977a39
      Marko Mäkelä authored
      MDEV-12323 Rollback progress log messages during crash recovery are intermixed with unrelated log messages
      
      trx_roll_must_shutdown(): During the rollback of recovered transactions,
      report progress and check if the rollback should be interrupted because
      of a pending shutdown.
      
      trx_roll_max_undo_no, trx_roll_progress_printed_pct: Remove, along with
      the messages that were interleaved with other messages.
      b1977a39
    • Marko Mäkelä's avatar
      MDEV-12352 InnoDB shutdown should not be blocked by a large transaction rollback · b46fa627
      Marko Mäkelä authored
      row_undo_step(), trx_rollback_active(): Abort the rollback of a
      recovered ordinary transaction if fast shutdown has been initiated.
      
      trx_rollback_resurrected(): Convert an aborted-rollback transaction
      into a fake XA PREPARE transaction, so that fast shutdown can proceed.
      b46fa627
  33. 06 Oct, 2017 1 commit
    • Marko Mäkelä's avatar
      MDEV-11369 Instant ADD COLUMN for InnoDB · a4948daf
      Marko Mäkelä authored
      For InnoDB tables, adding, dropping and reordering columns has
      required a rebuild of the table and all its indexes. Since MySQL 5.6
      (and MariaDB 10.0) this has been supported online (LOCK=NONE), allowing
      concurrent modification of the tables.
      
      This work revises the InnoDB ROW_FORMAT=REDUNDANT, ROW_FORMAT=COMPACT
      and ROW_FORMAT=DYNAMIC so that columns can be appended instantaneously,
      with only minor changes performed to the table structure. The counter
      innodb_instant_alter_column in INFORMATION_SCHEMA.GLOBAL_STATUS
      is incremented whenever a table rebuild operation is converted into
      an instant ADD COLUMN operation.
      
      ROW_FORMAT=COMPRESSED tables will not support instant ADD COLUMN.
      
      Some usability limitations will be addressed in subsequent work:
      
      MDEV-13134 Introduce ALTER TABLE attributes ALGORITHM=NOCOPY
      and ALGORITHM=INSTANT
      MDEV-14016 Allow instant ADD COLUMN, ADD INDEX, LOCK=NONE
      
      The format of the clustered index (PRIMARY KEY) is changed as follows:
      
      (1) The FIL_PAGE_TYPE of the root page will be FIL_PAGE_TYPE_INSTANT,
      and a new field PAGE_INSTANT will contain the original number of fields
      in the clustered index ('core' fields).
      If instant ADD COLUMN has not been used or the table becomes empty,
      or the very first instant ADD COLUMN operation is rolled back,
      the fields PAGE_INSTANT and FIL_PAGE_TYPE will be reset
      to 0 and FIL_PAGE_INDEX.
      
      (2) A special 'default row' record is inserted into the leftmost leaf,
      between the page infimum and the first user record. This record is
      distinguished by the REC_INFO_MIN_REC_FLAG, and it is otherwise in the
      same format as records that contain values for the instantly added
      columns. This 'default row' always has the same number of fields as
      the clustered index according to the table definition. The values of
      'core' fields are to be ignored. For other fields, the 'default row'
      will contain the default values as they were during the ALTER TABLE
      statement. (If the column default values are changed later, those
      values will only be stored in the .frm file. The 'default row' will
      contain the original evaluated values, which must be the same for
      every row.) The 'default row' must be completely hidden from
      higher-level access routines. Assertions have been added to ensure
      that no 'default row' is ever present in the adaptive hash index
      or in locked records. The 'default row' is never delete-marked.
      
      (3) In clustered index leaf page records, the number of fields must
      reside between the number of 'core' fields (dict_index_t::n_core_fields
      introduced in this work) and dict_index_t::n_fields. If the number
      of fields is less than dict_index_t::n_fields, the missing fields
      are replaced with the column value of the 'default row'.
      Note: The number of fields in the record may shrink if some of the
      last instantly added columns are updated to the value that is
      in the 'default row'. The function btr_cur_trim() implements this
      'compression' on update and rollback; dtuple::trim() implements it
      on insert.
      
      (4) In ROW_FORMAT=COMPACT and ROW_FORMAT=DYNAMIC records, the new
      status value REC_STATUS_COLUMNS_ADDED will indicate the presence of
      a new record header that will encode n_fields-n_core_fields-1 in
      1 or 2 bytes. (In ROW_FORMAT=REDUNDANT records, the record header
      always explicitly encodes the number of fields.)
      
      We introduce the undo log record type TRX_UNDO_INSERT_DEFAULT for
      covering the insert of the 'default row' record when instant ADD COLUMN
      is used for the first time. Subsequent instant ADD COLUMN can use
      TRX_UNDO_UPD_EXIST_REC.
      
      This is joint work with Vin Chen (陈福荣) from Tencent. The design
      that was discussed in April 2017 would not have allowed import or
      export of data files, because instead of the 'default row' it would
      have introduced a data dictionary table. The test
      rpl.rpl_alter_instant is exactly as contributed in pull request #408.
      The test innodb.instant_alter is based on a contributed test.
      
      The redo log record format changes for ROW_FORMAT=DYNAMIC and
      ROW_FORMAT=COMPACT are as contributed. (With this change present,
      crash recovery from MariaDB 10.3.1 will fail in spectacular ways!)
      Also the semantics of higher-level redo log records that modify the
      PAGE_INSTANT field is changed. The redo log format version identifier
      was already changed to LOG_HEADER_FORMAT_CURRENT=103 in MariaDB 10.3.1.
      
      Everything else has been rewritten by me. Thanks to Elena Stepanova,
      the code has been tested extensively.
      
      When rolling back an instant ADD COLUMN operation, we must empty the
      PAGE_FREE list after deleting or shortening the 'default row' record,
      by calling either btr_page_empty() or btr_page_reorganize(). We must
      know the size of each entry in the PAGE_FREE list. If rollback left a
      freed copy of the 'default row' in the PAGE_FREE list, we would be
      unable to determine its size (if it is in ROW_FORMAT=COMPACT or
      ROW_FORMAT=DYNAMIC) because it would contain more fields than the
      rolled-back definition of the clustered index.
      
      UNIV_SQL_DEFAULT: A new special constant that designates an instantly
      added column that is not present in the clustered index record.
      
      len_is_stored(): Check if a length is an actual length. There are
      two magic length values: UNIV_SQL_DEFAULT, UNIV_SQL_NULL.
      
      dict_col_t::def_val: The 'default row' value of the column.  If the
      column is not added instantly, def_val.len will be UNIV_SQL_DEFAULT.
      
      dict_col_t: Add the accessors is_virtual(), is_nullable(), is_instant(),
      instant_value().
      
      dict_col_t::remove_instant(): Remove the 'instant ADD' status of
      a column.
      
      dict_col_t::name(const dict_table_t& table): Replaces
      dict_table_get_col_name().
      
      dict_index_t::n_core_fields: The original number of fields.
      For secondary indexes and if instant ADD COLUMN has not been used,
      this will be equal to dict_index_t::n_fields.
      
      dict_index_t::n_core_null_bytes: Number of bytes needed to
      represent the null flags; usually equal to UT_BITS_IN_BYTES(n_nullable).
      
      dict_index_t::NO_CORE_NULL_BYTES: Magic value signalling that
      n_core_null_bytes was not initialized yet from the clustered index
      root page.
      
      dict_index_t: Add the accessors is_instant(), is_clust(),
      get_n_nullable(), instant_field_value().
      
      dict_index_t::instant_add_field(): Adjust clustered index metadata
      for instant ADD COLUMN.
      
      dict_index_t::remove_instant(): Remove the 'instant ADD' status
      of a clustered index when the table becomes empty, or the very first
      instant ADD COLUMN operation is rolled back.
      
      dict_table_t: Add the accessors is_instant(), is_temporary(),
      supports_instant().
      
      dict_table_t::instant_add_column(): Adjust metadata for
      instant ADD COLUMN.
      
      dict_table_t::rollback_instant(): Adjust metadata on the rollback
      of instant ADD COLUMN.
      
      prepare_inplace_alter_table_dict(): First create the ctx->new_table,
      and only then decide if the table really needs to be rebuilt.
      We must split the creation of table or index metadata from the
      creation of the dictionary table records and the creation of
      the data. In this way, we can transform a table-rebuilding operation
      into an instant ADD COLUMN operation. Dictionary objects will only
      be added to cache when table rebuilding or index creation is needed.
      The ctx->instant_table will never be added to cache.
      
      dict_table_t::add_to_cache(): Modified and renamed from
      dict_table_add_to_cache(). Do not modify the table metadata.
      Let the callers invoke dict_table_add_system_columns() and if needed,
      set can_be_evicted.
      
      dict_create_sys_tables_tuple(), dict_create_table_step(): Omit the
      system columns (which will now exist in the dict_table_t object
      already at this point).
      
      dict_create_table_step(): Expect the callers to invoke
      dict_table_add_system_columns().
      
      pars_create_table(): Before creating the table creation execution
      graph, invoke dict_table_add_system_columns().
      
      row_create_table_for_mysql(): Expect all callers to invoke
      dict_table_add_system_columns().
      
      create_index_dict(): Replaces row_merge_create_index_graph().
      
      innodb_update_n_cols(): Renamed from innobase_update_n_virtual().
      Call my_error() if an error occurs.
      
      btr_cur_instant_init(), btr_cur_instant_init_low(),
      btr_cur_instant_root_init():
      Load additional metadata from the clustered index and set
      dict_index_t::n_core_null_bytes. This is invoked
      when table metadata is first loaded into the data dictionary.
      
      dict_boot(): Initialize n_core_null_bytes for the four hard-coded
      dictionary tables.
      
      dict_create_index_step(): Initialize n_core_null_bytes. This is
      executed as part of CREATE TABLE.
      
      dict_index_build_internal_clust(): Initialize n_core_null_bytes to
      NO_CORE_NULL_BYTES if table->supports_instant().
      
      row_create_index_for_mysql(): Initialize n_core_null_bytes for
      CREATE TEMPORARY TABLE.
      
      commit_cache_norebuild(): Call the code to rename or enlarge columns
      in the cache only if instant ADD COLUMN is not being used.
      (Instant ADD COLUMN would copy all column metadata from
      instant_table to old_table, including the names and lengths.)
      
      PAGE_INSTANT: A new 13-bit field for storing dict_index_t::n_core_fields.
      This is repurposing the 16-bit field PAGE_DIRECTION, of which only the
      least significant 3 bits were used. The original byte containing
      PAGE_DIRECTION will be accessible via the new constant PAGE_DIRECTION_B.
      
      page_get_instant(), page_set_instant(): Accessors for the PAGE_INSTANT.
      
      page_ptr_get_direction(), page_get_direction(),
      page_ptr_set_direction(): Accessors for PAGE_DIRECTION.
      
      page_direction_reset(): Reset PAGE_DIRECTION, PAGE_N_DIRECTION.
      
      page_direction_increment(): Increment PAGE_N_DIRECTION
      and set PAGE_DIRECTION.
      
      rec_get_offsets(): Use the 'leaf' parameter for non-debug purposes,
      and assume that heap_no is always set.
      Initialize all dict_index_t::n_fields for ROW_FORMAT=REDUNDANT records,
      even if the record contains fewer fields.
      
      rec_offs_make_valid(): Add the parameter 'leaf'.
      
      rec_copy_prefix_to_dtuple(): Assert that the tuple is only built
      on the core fields. Instant ADD COLUMN only applies to the
      clustered index, and we should never build a search key that has
      more than the PRIMARY KEY and possibly DB_TRX_ID,DB_ROLL_PTR.
      All these columns are always present.
      
      dict_index_build_data_tuple(): Remove assertions that would be
      duplicated in rec_copy_prefix_to_dtuple().
      
      rec_init_offsets(): Support ROW_FORMAT=REDUNDANT records whose
      number of fields is between n_core_fields and n_fields.
      
      cmp_rec_rec_with_match(): Implement the comparison between two
      MIN_REC_FLAG records.
      
      trx_t::in_rollback: Make the field available in non-debug builds.
      
      trx_start_for_ddl_low(): Remove dangerous error-tolerance.
      A dictionary transaction must be flagged as such before it has generated
      any undo log records. This is because trx_undo_assign_undo() will mark
      the transaction as a dictionary transaction in the undo log header
      right before the very first undo log record is being written.
      
      btr_index_rec_validate(): Account for instant ADD COLUMN
      
      row_undo_ins_remove_clust_rec(): On the rollback of an insert into
      SYS_COLUMNS, revert instant ADD COLUMN in the cache by removing the
      last column from the table and the clustered index.
      
      row_search_on_row_ref(), row_undo_mod_parse_undo_rec(), row_undo_mod(),
      trx_undo_update_rec_get_update(): Handle the 'default row'
      as a special case.
      
      dtuple_t::trim(index): Omit a redundant suffix of an index tuple right
      before insert or update. After instant ADD COLUMN, if the last fields
      of a clustered index tuple match the 'default row', there is no
      need to store them. While trimming the entry, we must hold a page latch,
      so that the table cannot be emptied and the 'default row' be deleted.
      
      btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
      row_upd_clust_rec_by_insert(), row_ins_clust_index_entry_low():
      Invoke dtuple_t::trim() if needed.
      
      row_ins_clust_index_entry(): Restore dtuple_t::n_fields after calling
      row_ins_clust_index_entry_low().
      
      rec_get_converted_size(), rec_get_converted_size_comp(): Allow the number
      of fields to be between n_core_fields and n_fields. Do not support
      infimum,supremum. They are never supposed to be stored in dtuple_t,
      because page creation nowadays uses a lower-level method for initializing
      them.
      
      rec_convert_dtuple_to_rec_comp(): Assign the status bits based on the
      number of fields.
      
      btr_cur_trim(): In an update, trim the index entry as needed. For the
      'default row', handle rollback specially. For user records, omit
      fields that match the 'default row'.
      
      btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete():
      Skip locking and adaptive hash index for the 'default row'.
      
      row_log_table_apply_convert_mrec(): Replace 'default row' values if needed.
      In the temporary file that is applied by row_log_table_apply(),
      we must identify whether the records contain the extra header for
      instantly added columns. For now, we will allocate an additional byte
      for this for ROW_T_INSERT and ROW_T_UPDATE records when the source table
      has been subject to instant ADD COLUMN. The ROW_T_DELETE records are
      fine, as they will be converted and will only contain 'core' columns
      (PRIMARY KEY and some system columns) that are converted from dtuple_t.
      
      rec_get_converted_size_temp(), rec_init_offsets_temp(),
      rec_convert_dtuple_to_temp(): Add the parameter 'status'.
      
      REC_INFO_DEFAULT_ROW = REC_INFO_MIN_REC_FLAG | REC_STATUS_COLUMNS_ADDED:
      An info_bits constant for distinguishing the 'default row' record.
      
      rec_comp_status_t: An enum of the status bit values.
      
      rec_leaf_format: An enum that replaces the bool parameter of
      rec_init_offsets_comp_ordinary().
      a4948daf
  34. 03 Oct, 2017 1 commit
    • Marko Mäkelä's avatar
      Remove dict_disable_redo_if_temporary() · 770231f3
      Marko Mäkelä authored
      The function dict_disable_redo_if_temporary() was supposed to
      disable redo logging for temporary tables. It was invoked
      unnecessarily for two read-only operations:
      row_undo_search_clust_to_pcur() and
      dict_stats_update_transient_for_index().
      
      When a table is not temporary and not in the system tablespace,
      the tablespace should be flagged for MLOG_FILE_NAME logging.
      We do not need this overhead for temporary tables. Therefore,
      either mtr_t::set_log_mode() or mtr_t::set_named_space() should
      be invoked.
      
      dict_table_t::is_temporary(): Determine if a table is temporary.
      
      dict_table_is_temporary(): Redefined as a macro wrapper for
      dict_table_t::is_temporary().
      
      dict_disable_redo_if_temporary(): Remove.
      770231f3
  35. 20 Sep, 2017 1 commit
    • Marko Mäkelä's avatar
      Add the parameter bool leaf to rec_get_offsets() · 48192f96
      Marko Mäkelä authored
      This should affect debug builds only. Debug builds will check that
      the status bits of ROW_FORMAT!=REDUNDANT records match the is_leaf
      parameter.
      
      The only observable change to non-debug should be the addition of
      the is_leaf parameter to the function rec_copy_prefix_to_dtuple(),
      and the removal of some calls to update the adaptive hash index
      (it is only built for the leaf pages).
      
      This change should have been made in MySQL 5.0.3, instead of
      introducing the status flags in the ROW_FORMAT=COMPACT record header.
      48192f96
  36. 02 Jun, 2017 1 commit
    • Marko Mäkelä's avatar
      Remove deprecated InnoDB file format parameters · 0c92794d
      Marko Mäkelä authored
      The following options will be removed:
      
      innodb_file_format
      innodb_file_format_check
      innodb_file_format_max
      innodb_large_prefix
      
      They have been deprecated in MySQL 5.7.7 (and MariaDB 10.2.2) in WL#7703.
      
      The file_format column in two INFORMATION_SCHEMA tables will be removed:
      
      innodb_sys_tablespaces
      innodb_sys_tables
      
      Code to update the file format tag at the end of page 0:5
      (TRX_SYS_PAGE in the InnoDB system tablespace) will be removed.
      When initializing a new database, the bytes will remain 0.
      
      All references to the Barracuda file format will be removed.
      Some references to the Antelope file format (meaning
      ROW_FORMAT=REDUNDANT or ROW_FORMAT=COMPACT) will remain.
      
      This basically ports WL#7704 from MySQL 8.0.0 to MariaDB 10.3.1:
      
      commit 4a69dc2a95995501ed92d59a1de74414a38540c6
      Author: Marko Mäkelä <marko.makela@oracle.com>
      Date:   Wed Mar 11 22:19:49 2015 +0200
      0c92794d
  37. 17 Mar, 2017 1 commit
    • Marko Mäkelä's avatar
      MDEV-12271 Port MySQL 8.0 Bug#23150562 REMOVE UNIV_MUST_NOT_INLINE AND UNIV_NONINL · 4e1116b2
      Marko Mäkelä authored
      Also, remove empty .ic files that were not removed by my MySQL commit.
      
      Problem:
      InnoDB used to support a compilation mode that allowed to choose
      whether the function definitions in .ic files are to be inlined or not.
      This stopped making sense when InnoDB moved to C++ in MySQL 5.6
      (and ha_innodb.cc started to #include .ic files), and more so in
      MySQL 5.7 when inline methods and functions were introduced
      in .h files.
      
      Solution:
      Remove all references to UNIV_NONINL and UNIV_MUST_NOT_INLINE from
      all files, assuming that the symbols are never defined.
      Remove the files fut0fut.cc and ut0byte.cc which only mattered when
      UNIV_NONINL was defined.
      4e1116b2
  38. 13 Mar, 2017 1 commit
    • Marko Mäkelä's avatar
      MDEV-12219 Discard temporary undo logs at transaction commit · 13e5c9de
      Marko Mäkelä authored
      Starting with MySQL 5.7, temporary tables in InnoDB are handled
      differently from persistent tables. Because temporary tables are
      private to a connection, concurrency control and multi-versioning
      (MVCC) are not applicable. For performance reasons, purge is
      disabled as well. Rollback is supported for temporary tables;
      that is why we have the temporary undo logs in the first place.
      
      Because MVCC and purge are disabled for temporary tables, we should
      discard all temporary undo logs already at transaction commit,
      just like we discard the persistent insert_undo logs. Before this
      change, update_undo logs were being preserved.
      
      trx_temp_undo_t: A wrapper for temporary undo logs, comprising
      a rollback segment and a single temporary undo log.
      
      trx_rsegs_t::m_noredo: Use trx_temp_undo_t.
      (Instead of insert_undo, update_undo, there will be a single undo.)
      
      trx_is_noredo_rseg_updated(), trx_is_rseg_assigned(): Remove.
      
      trx_undo_add_page(): Remove the parameter undo_ptr.
      Acquire and release the rollback segment mutex inside the function.
      
      trx_undo_free_last_page(): Remove the parameter trx.
      
      trx_undo_truncate_end(): Remove the parameter trx, and add the
      parameter is_temp. Clean up the code a bit.
      
      trx_undo_assign_undo(): Split the parameter undo_ptr into rseg, undo.
      
      trx_undo_commit_cleanup(): Renamed from trx_undo_insert_cleanup().
      Replace the parameter undo_ptr with undo.
      This will discard the temporary undo or insert_undo log at
      commit/rollback.
      
      trx_purge_add_update_undo_to_history(), trx_undo_update_cleanup():
      Remove 3 parameters. Always operate on the persistent update_undo.
      
      trx_serialise(): Renamed from trx_serialisation_number_get().
      
      trx_write_serialisation_history(): Simplify the code flow.
      If there are no persistent changes, do not update MONITOR_TRX_COMMIT_UNDO.
      
      trx_commit_in_memory(): Simplify the logic, and add assertions.
      
      trx_undo_page_report_modify(): Keep a direct reference to the
      persistent update_undo log.
      
      trx_undo_report_row_operation(): Simplify some code.
      Always assign TRX_UNDO_INSERT for temporary undo logs.
      
      trx_prepare_low(): Keep only one parameter. Prepare all 3 undo logs.
      
      trx_roll_try_truncate(): Remove the parameter undo_ptr.
      Try to truncate all 3 undo logs of the transaction.
      
      trx_roll_pop_top_rec_of_trx_low(): Remove.
      
      trx_roll_pop_top_rec_of_trx(): Remove the redundant parameter
      trx->roll_limit. Clear roll_limit when exhausting the undo logs.
      Consider all 3 undo logs at once, prioritizing the persistent
      undo logs.
      
      row_undo(): Minor cleanup. Let trx_roll_pop_top_rec_of_trx()
      reset the trx->roll_limit.
      13e5c9de
  39. 28 Dec, 2016 1 commit
    • Marko Mäkelä's avatar
      MDEV-9282 Debian: the Lintian complains about "shlib-calls-exit" in ha_innodb.so · d50cf42b
      Marko Mäkelä authored
      Replace all exit() calls in InnoDB with abort() [possibly via ut_a()].
      Calling exit() in a multi-threaded program is problematic also for
      the reason that other threads could see corrupted data structures
      while some data structures are being cleaned up by atexit() handlers
      or similar.
      
      In the long term, all these calls should be replaced with something
      that returns an error all the way up the call stack.
      d50cf42b