1. 23 Apr, 2015 1 commit
    • Kristian Nielsen's avatar
      MDEV-8031: Parallel replication stops on "connection killed" error (probably... · b616991a
      Kristian Nielsen authored
      MDEV-8031: Parallel replication stops on "connection killed" error (probably incorrectly handled deadlock kill)
      
      There was a rare race, where a deadlock error might not be correctly
      handled, causing the slave to stop with something like this in the error
      log:
      
      150423 14:04:10 [ERROR] Slave SQL: Connection was killed, Gtid 0-1-2, Internal MariaDB error code: 1927
      150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
      150423 14:04:10 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
      150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
      150423 14:04:10 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001 position 1234
      
      The problem was incorrect error handling. When a deadlock is detected, it
      causes a KILL CONNECTION on the offending thread. This error is then later
      converted to a deadlock error, and the transaction is retried.
      
      However, the deadlock error was not cleared at the start of the retry, nor
      was the lingering kill signal. So it was possible to get another deadlock
      kill early during retry. If this happened with particular thread
      scheduling/timing, it was possible that the new KILL CONNECTION error was
      masked by the earlier deadlock error, so that the second kill was not
      properly converted into a deadlock error and retry.
      
      This patch adds code that clears the old error and killed flag before
      starting the retry. It also adds code to handle a deadlock kill caught in a
      couple of places where it was not handled before.
      b616991a
  2. 21 Apr, 2015 1 commit
  3. 20 Apr, 2015 1 commit
  4. 19 Apr, 2015 1 commit
  5. 14 Apr, 2015 2 commits
    • Kristian Nielsen's avatar
      Merge MDEV-7975 into 10.0 · a8523559
      Kristian Nielsen authored
      a8523559
    • Kristian Nielsen's avatar
      MDEV-7975: sporadic failure in test case rpl.rpl_gtid_startpos · 5d2b85a2
      Kristian Nielsen authored
      Add some suppressions that were missing. They are for if a STOP SLAVE is
      executed early during IO thread startup, when it is negotiating with the
      master. The master connection may be killed in the middle of a
      mysql_real_query(), which is not a test failure if it is a network error.
      
      This also caught one real code error, fixed with this commit: The I/O thread
      would fail to automatically reconnect if a network error happened while
      fetching the value of @@GLOBAL.gtid_domain_id.
      5d2b85a2
  6. 13 Apr, 2015 3 commits
    • Kristian Nielsen's avatar
      Merge MDEV-7936 into 10.0. · 17aff4b1
      Kristian Nielsen authored
      Conflicts:
      	sql/sql_base.cc
      17aff4b1
    • Kristian Nielsen's avatar
      MDEV-7936: Assertion `!table || table->in_use == _current_thd()' failed on... · 60d094ae
      Kristian Nielsen authored
      MDEV-7936: Assertion `!table || table->in_use == _current_thd()' failed on parallel replication in optimistic mode
      
      Make sure that in parallel replication, we execute wait_for_prior_commit()
      before setting table->in_use for a temporary table. Otherwise we can end up
      with two parallel replication worker threads competing with each other for
      use of a temporary table.
      
      Re-factor the use of find_temporary_table() to be able to handle errors
      in the caller (as wait_for_prior_commit() can return error in case of
      deadlock kill).
      60d094ae
    • Kristian Nielsen's avatar
      MDEV-7668: Intermediate master groups CREATE TEMPORARY with INSERT, causing... · c47fe0e9
      Kristian Nielsen authored
      MDEV-7668: Intermediate master groups CREATE TEMPORARY with INSERT, causing parallel replication failure
      
      [This commit cherry-picked to be able to merge MDEV-7936, of which it
      is a pre-requisite, into both 10.0 and 10.1.]
      
      Parallel replication depends on locking (table locks, row locks, etc.) to
      prevent two conflicting transactions from running and committing in parallel.
      But temporary tables are designed to be visible only to one thread, and have
      no such locking.
      
      In the concrete issue, an intermediate master could commit a CREATE TEMPORARY
      TABLE in the same group commit as in INSERT into that table. Thus, a
      lower-level master could attempt to run them in parallel and get an error.
      
      More generally, we need protection from parallel replication trying to run
      transactions in parallel that access a common temporary table.
      
      This patch simply causes use of a temporary table from parallel replication
      to wait for all previous transactions to commit, serialising the replication
      at that point.
      
      (A more fine-grained locking could be added later, possibly. However,
      using temporary tables in statement-based replication is in any case
      normally undesirable; for example a restart of the server will lose
      temporary tables and can break replication).
      
      Note that row-based replication is not affected, as it does not do any
      temporary tables on the slave-side.
      
      This patch also cleans up the locking around protecting the list of
      temporary tables in Relay_log_info. This used to take the
      rli->data_lock at the end of every statement, which is very bad for
      concurrency. With this patch, the lock is not taken unless temporary
      tables (with statement-based binlogging) are in use on the slave.
      c47fe0e9
  7. 09 Apr, 2015 2 commits
  8. 08 Apr, 2015 4 commits
    • Kristian Nielsen's avatar
      Merge MDEV-7910' into 10.0 · 670d4dd8
      Kristian Nielsen authored
      670d4dd8
    • Kristian Nielsen's avatar
      MDEV-7910: innodb.binlog_consistent fails sporadically in buildbot · b3c7c8cd
      Kristian Nielsen authored
      The test case was missing --source include/wait_for_binlog_checkpoint.inc.
      So it could occasionally fail if the checkpoint managed to occur just at the
      right point in time between fetching the two binlog positions to compare.
      b3c7c8cd
    • Kristian Nielsen's avatar
      accdabd6
    • Kristian Nielsen's avatar
      MDEV-7888, MDEV-7929: Parallel replication hangs sometimes on ANALYZE TABLE or DDL · 3b961347
      Kristian Nielsen authored
      The hangs occur when the group_commit_orderer object is freed before the last
      mark_start_commit() call on it - this loses the wakeup to other waiting worker
      threads, causing them to hang until killed manually.
      
      The object was freed because wakeup_subsequent_commits() was called two early
      in two places. For MDEV-7888, during ANALYZE TABLE, and for MDEV-7929 during
      record_gtid() after processing a DDL event. The group_commit_orderer object
      can be freed when its last transaction has called wait_for_prior_commit().
      
      Fix by implementing a suspend/resume mechanism for wakeup_subsequent_commits()
      that can be used in places where a transaction is committed without this being
      the commit of the actual replication event group.
      
      Also add a protection mechanism (that asserts in debug builds) which can
      prevent the too-early free and hang if other similar bugs should remain in
      other parts of the code.
      3b961347
  9. 06 Apr, 2015 1 commit
  10. 31 Mar, 2015 2 commits
  11. 30 Mar, 2015 3 commits
    • Kristian Nielsen's avatar
      Merge MDEV-7847 and MDEV-7882 into 10.0. · c41e4d3b
      Kristian Nielsen authored
      Conflicts:
      	mysql-test/suite/rpl/r/rpl_parallel.result
      	mysql-test/suite/rpl/t/rpl_parallel.test
      c41e4d3b
    • Kristian Nielsen's avatar
      MDEV-7847: "Slave worker thread retried transaction 10 time(s) in vain, giving... · 880f2273
      Kristian Nielsen authored
      MDEV-7847: "Slave worker thread retried transaction 10 time(s) in vain, giving up", followed by replication hanging
      
      This patch fixes a bug in the error handling in parallel replication, when one
      worker thread gets a failure and other worker threads processing later
      transactions have to rollback and abort.
      
      The problem was with the lifetime of group_commit_orderer objects (GCOs).
      A GCO is freed when we register that its last event group has committed. This
      relies on register_wait_for_prior_commit() and wait_for_prior_commit() to
      ensure that the fact that T2 has committed implies that any earlier T1 has
      also committed, and can thus no longer execute mark_start_commit().
      
      However, in the error case, the code was skipping the
      register_wait_for_prior_commit() and wait_for_prior_commit() calls. Thus
      commit ordering was not guaranteed, and a GCO could be freed too early. Then a
      later mark_start_commit() would reference deallocated GCO, which could lead to
      lost wakeup (causing slave threads to hang) or other corruption.
      
      This patch makes also the error case respect commit order. This way, also the
      error case gets the GCO lifetime correct, and the hang no longer occurs.
      880f2273
    • Kristian Nielsen's avatar
      MDEV-7882: Excessive transaction retry in parallel replication · a4082918
      Kristian Nielsen authored
      When a transaction in parallel replication needs to retry (eg. because of
      deadlock kill), first wait for all prior transactions to commit before doing
      the retry. This way, we avoid the retry once again conflicting with a prior
      transaction, requiring yet another retry.
      
      Without this patch, we saw "in the wild" that transactions had to be retried
      more than 10 times to succeed, which exceeds the default
      --slave_transaction_retries value and is in any case undesirable.
      
      (We already do this in 10.1 in "optimistic" parallel replication mode; this
      patch just makes the code use the same logic for "conservative" mode (only
      mode in 10.0)).
      a4082918
  12. 25 Mar, 2015 1 commit
  13. 18 Mar, 2015 3 commits
  14. 17 Mar, 2015 2 commits
  15. 16 Mar, 2015 1 commit
  16. 13 Mar, 2015 2 commits
    • Kristian Nielsen's avatar
      MDEV-7249: Performance problem in parallel replication with multi-level slaves · 184f718f
      Kristian Nielsen authored
      Parallel replication (in 10.0 / "conservative" mode) relies on binlog group
      commits to group transactions that can be safely run in parallel on the
      slave. The --binlog-commit-wait-count and --binlog-commit-wait-usec options
      exist to increase the number of commits per group. But in case of conflicts
      between transactions, this can cause unnecessary delay and reduced througput,
      especially on a slave where commit order is fixed.
      
      This patch adds a heuristics to reduce this problem. When transaction T1 goes
      to commit, it will first wait for N transactions to queue up for a group
      commit. However, if we detect that another transaction T2 is waiting for a row
      lock held by T1, then we will skip the wait and let T1 commit immediately,
      releasing locks and let T2 continue.
      
      On a slave, this avoids the unfortunate situation where T1 is waiting for T2
      to join the group commit, but T2 is waiting for T1 to release locks, causing
      no work to be done for the duration of the --binlog-commit-wait-usec timeout.
      
      (The heuristic seems reasonable on the master as well, so it is enabled for
      all transactions, not just replication transactions).
      184f718f
    • Alexander Barkov's avatar
      MDEV-7387 [PATCH] Alter table xxx CHARACTER SET utf8, CONVERT TO CHARACTER SET latin1 should fail · bc902a2b
      Alexander Barkov authored
      A contribution from Daniel Black, with minor additional enhancements.
      bc902a2b
  17. 12 Mar, 2015 1 commit
  18. 11 Mar, 2015 1 commit
    • Kristian Nielsen's avatar
      MDEV-5289: master server starts slave parallel threads · ed04c40b
      Kristian Nielsen authored
      Delay spawning parallel replication worker threads until a slave SQL
      thread is running, and de-spawn them when the last SQL thread stops.
      
      This is especially useful to avoid needless threads on a master in a
      setup where same my.cnf is used on masters and slaves.
      ed04c40b
  19. 09 Mar, 2015 4 commits
    • Jan Lindström's avatar
      MDEV-7685: MariaDB - server crashes when inserting more rows than · a7fd11b3
      Jan Lindström authored
      available space on disk
      
      Add error handling when disk full situation happens and
      intentionally bring server down with stacktrace because
      on all cases InnoDB can't continue anyway.
      a7fd11b3
    • Elena Stepanova's avatar
      MDEV-7107 Sporadic test failure in multi_source.multisource · ec16d1b6
      Elena Stepanova authored
      Extend show_slave_status.inc to run SHOW ALL SLAVES STATUS and
      SHOW SLAVE 'name' STATUS on demand, and make the test use
      the include file instead of direct SHOW statements
      ec16d1b6
    • Kristian Nielsen's avatar
      MDEV-7668: Intermediate master groups CREATE TEMPORARY with INSERT, causing... · 96784eb1
      Kristian Nielsen authored
      MDEV-7668: Intermediate master groups CREATE TEMPORARY with INSERT, causing parallel replication failure
      
      Parallel replication depends on locking (table locks, row locks, etc.) to
      prevent two conflicting transactions from running and committing in parallel.
      But temporary tables are designed to be visible only to one thread, and have
      no such locking.
      
      In the concrete issue, an intermediate master could commit a CREATE TEMPORARY
      TABLE in the same group commit as in INSERT into that table. Thus, a
      lower-level master could attempt to run them in parallel and get an error.
      
      More generally, we need protection from parallel replication trying to run
      transactions in parallel that access a common temporary table.
      
      This patch simply causes use of a temporary table from parallel replication
      to wait for all previous transactions to commit, serialising the replication
      at that point.
      
      (A more fine-grained locking could be added later, possibly. However,
      using temporary tables in statement-based replication is in any case
      normally undesirable; for example a restart of the server will lose
      temporary tables and can break replication).
      
      Note that row-based replication is not affected, as it does not do any
      temporary tables on the slave-side.
      
      This patch also cleans up the locking around protecting the list of
      temporary tables in Relay_log_info. This used to take the
      rli->data_lock at the end of every statement, which is very bad for
      concurrency. With this patch, the lock is not taken unless temporary
      tables (with statement-based binlogging) are in use on the slave.
      96784eb1
    • Jan Lindström's avatar
      MDEV-7627 :Some symbols in table name can cause to Error Code: 1050 · 040027c8
      Jan Lindström authored
      when created FK
      
      Analysis: Table name is on filename charset but foreign key
      identifiers are not. This lead incorrect foreign key
      identifier number to be used.
      
      Fix: Convert foreign key identifier to filename charset before
      comparing it to table name when largest foreign key identifier
      number is resolved.
      040027c8
  20. 08 Mar, 2015 1 commit
    • Elena Stepanova's avatar
      MDEV-7187 perfschema.aggregate fails sporadically in buildbot · 6fc0a8af
      Elena Stepanova authored
      During slow execution, e.g. under valgrind, there was a chance
      that Aria checkpoint would happen while P_S tables were being
      queried; it could cause different data in joined P_S, and
      thus combinations of results that the test did not expect.
      
      Fixed by disabling Aria checkpoints for the test.
      6fc0a8af
  21. 06 Mar, 2015 3 commits