• Aleksey Midenkov's avatar
    MDEV-25292 Atomic CREATE OR REPLACE TABLE · 93c8252f
    Aleksey Midenkov authored
    Atomic CREATE OR REPLACE allows to keep an old table intact if the
    command fails or during the crash. That is done through creating
    a table with a temporary name and filling it with the data
    (for CREATE OR REPLACE .. SELECT), then renaming the original table
    to another temporary (backup) name and renaming the replacement table
    to original table. The backup table is kept until the last chance of
    failure and if that happens, the replacement table is thrown off and
    backup recovered. When the command is complete and logged the backup
    table is deleted.
    
    Atomic replace algorithm
    
      Two DDL chains are used for CREATE OR REPLACE:
      ddl_log_state_create (C) and ddl_log_state_rm (D).
    
      1. (C) Log CREATE_TABLE_ACTION of TMP table (drops TMP table);
      2. Create new table as TMP;
      3. Do everything with TMP (like insert data);
    
      finalize_atomic_replace():
      4. Link chains: (D) is executed only if (C) is closed;
      5. (D) Log DROP_ACTION of BACKUP;
      6. (C) Log RENAME_TABLE_ACTION from ORIG to BACKUP (replays BACKUP -> ORIG);
      7. Rename ORIG to BACKUP;
      8. (C) Log CREATE_TABLE_ACTION of ORIG (drops ORIG);
      9. Rename TMP to ORIG;
    
      finalize_ddl() in case of success:
      10. Close (C);
      11. Replay (D): BACKUP is dropped.
    
      finalize_ddl() in case of error:
      10. Close (D);
      11. Replay (C):
        1) ORIG is dropped (only after finalize_atomic_replace());
        2) BACKUP renamed to ORIG (only after finalize_atomic_replace());
        3) drop TMP.
    
      If crash happens (C) or (D) is replayed in reverse order. (C) is
      replayed if crash happens before it is closed, otherwise (D) is
      replayed.
    
    Temporary table for CREATE OR REPLACE
    
      Before dropping "old" table, CREATE OR REPLACE creates "tmp" table.
      ddl_log_state_create holds the drop of the "tmp" table.  When
      everything is OK (data is inserted, "tmp" is ready) ddl_log_state_rm
      is written to replace "old" with "tmp". Until ddl_log_state_create
      is closed ddl_log_state_rm is not executed.
    
      After the binlogging is done ddl_log_state_create is closed. At that
      point ddl_log_state_rm is executed and "tmp" is replaced with
      "old". That is: final rename is done by the DDL log.
    
      With that important role of DDL log for CREATE OR REPLACE operation
      replay of ddl_log_state_rm must fail at the first hit error and
      print the error message if possible. F.ex. foreign key error is
      discovered at this phase: InnoDB rejects to drop the "old" table and
      returns corresponding foreign key error code.
    
    Additional notes
    
      - CREATE TABLE without REPLACE is not affected by this commit.
    
      - Engines having HTON_EXPENSIVE_RENAME flag set are not affected by
        this commit.
    
      - CREATE TABLE .. SELECT XID usage is fixed and now there is no need
        to log DROP TABLE via DDL_CREATE_TABLE_PHASE_LOG (see comments in
        do_postlock()). XID is now correctly updated so it disables
        DDL_LOG_DROP_TABLE_ACTION. Note that binary log is flushed at the
        final stage when the table is ready. So if we have XID in the
        binary log we don't need to drop the table.
    
      - Three variations of CREATE OR REPLACE handled:
    
        1. CREATE OR REPLACE TABLE t1 (..);
        2. CREATE OR REPLACE TABLE t1 LIKE t2;
        3. CREATE OR REPLACE TABLE t1 SELECT ..;
    
      - Test case uses 6 combinations for engines (aria, aria_notrans,
        myisam, ib, lock_tables, expensive_rename) and 2 combinations for
        binlog types (row, stmt). Combinations help to check differences
        between the results. Error failures are tested for the above three
        variations.
    
      - expensive_rename tests CREATE OR REPLACE without atomic
        replace. The effect should be the same as with the old behaviour
        before this commit.
    
      - Triggers mechanism is unaffected by this change. This is tested in
        create_replace.test.
    
      - LOCK TABLES is affected. Lock restoration must be done after "rm"
        chain is replayed.
    
      - Moved ddl_log_complete() from send_eof() to finalize_ddl(). This
        checkpoint was not executed before for normal CREATE TABLE but is
        executed now.
    
      - CREATE TABLE will now rollback also if writing to the binary
        logging failed. See rpl_gtid_strict.test
    
    Rename and drop via DDL log
    
      We replay ddl_log_state_rm to drop the old table and rename the
      temporary table. In that case we must throw the correct error
      message if ddl_log_revert() fails (f.ex. on FK error).
    
      If table is deleted earlier and not via DDL log and the crash
      happened, the create chain is not closed. Linked drop chain is not
      executed and the new table is not installed. But the old table is
      already deleted.
    
    ddl_log.cc changes
    
      Now we can place action before DDL_LOG_DROP_INIT_ACTION and it will
      be replayed after DDL_LOG_DROP_TABLE_ACTION.
    
      report_error parameter for ddl_log_revert() allows to fail at first
      error and print the error message if possible.
      ddl_log_execute_action() now can print error message.
    
      Since we now can handle errors from ddl_log_execute_action() (in
      case of non-recovery execution) unconditional setting "error= TRUE"
      is wrong (it was wrong anyway because it was overwritten at the end
      of the function).
    
    On XID usage
    
      Like with all other atomic DDL operations XID is used to avoid
      inconsistency between master and slave in the case of a crash after
      binary log is written and before ddl_log_state_create is closed. On
      recovery XIDs are taken from binary log and corresponding DDL log
      events get disabled.  That is done by
      ddl_log_close_binlogged_events().
    
    On linking two chains together
    
      Chains are executed in the ascending order of entry_pos of execute
      entries. But entry_pos assignment order is undefined: it may assign
      bigger number for the first chain and then smaller number for the
      second chain. So the execution order in that case will be reverse:
      second chain will be executed first.
    
      To avoid that we link one chain to another. While the base chain
      (ddl_log_state_create) is active the secondary chain
      (ddl_log_state_rm) is not executed. That is: only one chain can be
      executed in two linked chains.
    
      The interface ddl_log_link_chains() was done in "MDEV-22166
      ddl_log_write_execute_entry() extension".
    
    More on CREATE OR REPLACE .. SELECT
    
      We use create_and_open_tmp_table() like in ALTER TABLE to create
      temporary TABLE object (tmp_table is (NON_)TRANSACTIONAL_TMP_TABLE).
    
      After we created such TABLE object we use create_info->tmp_table()
      instead of table->s->tmp_table when we need to check for
      parser-requested tmp-table.
    
      External locking is required for temporary table created by
      create_and_open_tmp_table(). F.ex. that disables logging for Aria
      transactional tables and without that (when no mysql_lock_tables()
      is done) it cannot work correctly.
    
      For making external lock the patch requires Aria table to work in
      non-transactional mode. That is usually done by
      ha_enable_transaction(false). But we cannot disable transaction
      completely because: 1. binlog rollback removes pending row events
      (binlog_remove_pending_rows_event()). The row events are added
      during CREATE .. SELECT data insertion phase. 2. replication slave
      highly depends on transaction and cannot work without it.
    
      So we put temporary Aria table into non-transactional mode with
      "thd->transaction->on hack". See comment for on_save variable.
    
      Note that Aria table has internal_table mode. But we cannot use it
      because:
    
      if (!internal_table)
      {
        mysql_mutex_lock(&THR_LOCK_myisam);
        old_info= test_if_reopen(name_buff);
      }
    
      For internal_table test_if_reopen() is not called and we get a new
      MARIA_SHARE for each file handler. In that case duplicate errors are
      missed because insert and lookup in CREATE .. SELECT is done via two
      different handlers (see create_lookup_handler()).
    
      For temporary table before dropping TABLE_SHARE by
      drop_temporary_table() we must do ha_reset(). ha_reset() releases
      storage share. Without that the share is kept and the second CREATE
      OR REPLACE .. SELECT fails with:
    
        HA_ERR_TABLE_EXIST (156): MyISAM table '#sql-create-b5377-4-t2' is
        in use (most likely by a MERGE table). Try FLUSH TABLES.
    
        HA_EXTRA_PREPARE_FOR_DROP also removes MYISAM_SHARE, but that is
        not needed as ha_reset() does the job.
    
      ha_reset() is usually done by
      mark_tmp_table_as_free_for_reuse(). But we don't need that mechanism
      for our temporary table.
    
    Atomic_info in HA_CREATE_INFO
    
      Many functions in CREATE TABLE pass the same parameters. These
      parameters are part of table creation info and should be in
      HA_CREATE_INFO (or whatever). Passing parameters via single
      structure is much easier for adding new data and
      refactoring.
    
    InnoDB changes (revised by Marko Mäkelä)
    
      row_rename_table_for_mysql(): Specify the treatment of FOREIGN KEY
      constraints in a 4-valued enum parameter. In cases where FOREIGN KEY
      constraints cannot exist (partitioned tables, or internal tables of
      FULLTEXT INDEX), we can use the mode RENAME_IGNORE_FK.
      The mod RENAME_REBUILD is for any DDL operation that rebuilds the
      table inside InnoDB, such as TRUNCATE and native ALTER TABLE
      (or OPTIMIZE TABLE). The mode RENAME_ALTER_COPY is used solely
      during non-native ALTER TABLE in ha_innobase::rename_table().
      Normal ha_innobase::rename_table() will use the mode RENAME_FK.
    
      CREATE OR REPLACE will rename the old table (if one exists) along
      with its FOREIGN KEY constraints into a temporary name. The replacement
      table will be initially created with another temporary name.
      Unlike in ALTER TABLE, all FOREIGN KEY constraints must be renamed
      and not inherited as part of these operations, using the mode RENAME_FK.
    
      dict_get_referenced_table(): Let the callers convert names when needed.
    
      create_table_info_t::create_foreign_keys(): CREATE OR REPLACE creates
      the replacement table with a temporary name table, so for
      self-references foreign->referenced_table will be a table with
      temporary name and charset conversion must be skipped for it.
    
    Reviewed by:
    
      Michael Widenius <monty@mariadb.org>
    93c8252f
sql_insert.cc 183 KB