- 23 Apr, 2015 1 commit
-
-
Kristian Nielsen authored
MDEV-8031: Parallel replication stops on "connection killed" error (probably incorrectly handled deadlock kill) There was a rare race, where a deadlock error might not be correctly handled, causing the slave to stop with something like this in the error log: 150423 14:04:10 [ERROR] Slave SQL: Connection was killed, Gtid 0-1-2, Internal MariaDB error code: 1927 150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927 150423 14:04:10 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927 150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927 150423 14:04:10 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001 position 1234 The problem was incorrect error handling. When a deadlock is detected, it causes a KILL CONNECTION on the offending thread. This error is then later converted to a deadlock error, and the transaction is retried. However, the deadlock error was not cleared at the start of the retry, nor was the lingering kill signal. So it was possible to get another deadlock kill early during retry. If this happened with particular thread scheduling/timing, it was possible that the new KILL CONNECTION error was masked by the earlier deadlock error, so that the second kill was not properly converted into a deadlock error and retry. This patch adds code that clears the old error and killed flag before starting the retry. It also adds code to handle a deadlock kill caught in a couple of places where it was not handled before.
-
- 21 Apr, 2015 1 commit
-
-
Kristian Nielsen authored
Fix a silly typo that caused the test to occasionally fail.
-
- 20 Apr, 2015 1 commit
-
-
Kristian Nielsen authored
This was a regression from the patch for MDEV-7668. A test was incorrect, so the slave would not properly handle re-using temporary tables, which lead to replication failure in this case.
-
- 19 Apr, 2015 1 commit
-
-
Elena Stepanova authored
-
- 14 Apr, 2015 2 commits
-
-
Kristian Nielsen authored
-
Kristian Nielsen authored
Add some suppressions that were missing. They are for if a STOP SLAVE is executed early during IO thread startup, when it is negotiating with the master. The master connection may be killed in the middle of a mysql_real_query(), which is not a test failure if it is a network error. This also caught one real code error, fixed with this commit: The I/O thread would fail to automatically reconnect if a network error happened while fetching the value of @@GLOBAL.gtid_domain_id.
-
- 13 Apr, 2015 3 commits
-
-
Kristian Nielsen authored
Conflicts: sql/sql_base.cc
-
Kristian Nielsen authored
MDEV-7936: Assertion `!table || table->in_use == _current_thd()' failed on parallel replication in optimistic mode Make sure that in parallel replication, we execute wait_for_prior_commit() before setting table->in_use for a temporary table. Otherwise we can end up with two parallel replication worker threads competing with each other for use of a temporary table. Re-factor the use of find_temporary_table() to be able to handle errors in the caller (as wait_for_prior_commit() can return error in case of deadlock kill).
-
Kristian Nielsen authored
MDEV-7668: Intermediate master groups CREATE TEMPORARY with INSERT, causing parallel replication failure [This commit cherry-picked to be able to merge MDEV-7936, of which it is a pre-requisite, into both 10.0 and 10.1.] Parallel replication depends on locking (table locks, row locks, etc.) to prevent two conflicting transactions from running and committing in parallel. But temporary tables are designed to be visible only to one thread, and have no such locking. In the concrete issue, an intermediate master could commit a CREATE TEMPORARY TABLE in the same group commit as in INSERT into that table. Thus, a lower-level master could attempt to run them in parallel and get an error. More generally, we need protection from parallel replication trying to run transactions in parallel that access a common temporary table. This patch simply causes use of a temporary table from parallel replication to wait for all previous transactions to commit, serialising the replication at that point. (A more fine-grained locking could be added later, possibly. However, using temporary tables in statement-based replication is in any case normally undesirable; for example a restart of the server will lose temporary tables and can break replication). Note that row-based replication is not affected, as it does not do any temporary tables on the slave-side. This patch also cleans up the locking around protecting the list of temporary tables in Relay_log_info. This used to take the rli->data_lock at the end of every statement, which is very bad for concurrency. With this patch, the lock is not taken unless temporary tables (with statement-based binlogging) are in use on the slave.
-
- 09 Apr, 2015 2 commits
-
-
Kristian Nielsen authored
-
Kristian Nielsen authored
Fix a race in the test case. When we do start_slave.inc immediately followed by stop_slave.inc, it is possible to kill the IO thread while it is still running inside get_master_version_and_clock(), and this gives warnings in the error log that cause the test to fail.
-
- 08 Apr, 2015 4 commits
-
-
Kristian Nielsen authored
-
Kristian Nielsen authored
The test case was missing --source include/wait_for_binlog_checkpoint.inc. So it could occasionally fail if the checkpoint managed to occur just at the right point in time between fetching the two binlog positions to compare.
-
Kristian Nielsen authored
-
Kristian Nielsen authored
The hangs occur when the group_commit_orderer object is freed before the last mark_start_commit() call on it - this loses the wakeup to other waiting worker threads, causing them to hang until killed manually. The object was freed because wakeup_subsequent_commits() was called two early in two places. For MDEV-7888, during ANALYZE TABLE, and for MDEV-7929 during record_gtid() after processing a DDL event. The group_commit_orderer object can be freed when its last transaction has called wait_for_prior_commit(). Fix by implementing a suspend/resume mechanism for wakeup_subsequent_commits() that can be used in places where a transaction is committed without this being the commit of the actual replication event group. Also add a protection mechanism (that asserts in debug builds) which can prevent the too-early free and hang if other similar bugs should remain in other parts of the code.
-
- 06 Apr, 2015 1 commit
-
-
Jan Lindström authored
Problem was that in XA prepared state we should still be able to release a savepoint, but assertions were too strict.
-
- 31 Mar, 2015 2 commits
-
-
Jan Lindström authored
Analysis: MySQL table definition contains also virtual columns. Similarly, index fielnr references MySQL table fields. However, InnoDB table definition does not contain virtual columns. Therefore, when matching MySQL key fieldnr we need to use actual column name to find out referenced InnoDB dictionary column name. Fix: Add new function to match MySQL index key columns to InnoDB dictionary.
-
Jan Lindström authored
Replace static array of thread sync levels with std::vector.
-
- 30 Mar, 2015 3 commits
-
-
Kristian Nielsen authored
Conflicts: mysql-test/suite/rpl/r/rpl_parallel.result mysql-test/suite/rpl/t/rpl_parallel.test
-
Kristian Nielsen authored
MDEV-7847: "Slave worker thread retried transaction 10 time(s) in vain, giving up", followed by replication hanging This patch fixes a bug in the error handling in parallel replication, when one worker thread gets a failure and other worker threads processing later transactions have to rollback and abort. The problem was with the lifetime of group_commit_orderer objects (GCOs). A GCO is freed when we register that its last event group has committed. This relies on register_wait_for_prior_commit() and wait_for_prior_commit() to ensure that the fact that T2 has committed implies that any earlier T1 has also committed, and can thus no longer execute mark_start_commit(). However, in the error case, the code was skipping the register_wait_for_prior_commit() and wait_for_prior_commit() calls. Thus commit ordering was not guaranteed, and a GCO could be freed too early. Then a later mark_start_commit() would reference deallocated GCO, which could lead to lost wakeup (causing slave threads to hang) or other corruption. This patch makes also the error case respect commit order. This way, also the error case gets the GCO lifetime correct, and the hang no longer occurs.
-
Kristian Nielsen authored
When a transaction in parallel replication needs to retry (eg. because of deadlock kill), first wait for all prior transactions to commit before doing the retry. This way, we avoid the retry once again conflicting with a prior transaction, requiring yet another retry. Without this patch, we saw "in the wild" that transactions had to be retried more than 10 times to succeed, which exceeds the default --slave_transaction_retries value and is in any case undesirable. (We already do this in 10.1 in "optimistic" parallel replication mode; this patch just makes the code use the same logic for "conservative" mode (only mode in 10.0)).
-
- 25 Mar, 2015 1 commit
-
-
Sergei Petrunia authored
Fix BigEndian build for Cassandra SE
-
- 18 Mar, 2015 3 commits
-
-
Jan Lindström authored
-
Jan Lindström authored
There is a bug in Visual Studio 2010 Visual Studio has a feature "Checked Iterators". In a debug build, every iterator operation is checked at runtime for errors, e g, out of range. Disable this "Checked Iterators" for Windows and Debug if defined.
-
Jan Lindström authored
-
- 17 Mar, 2015 2 commits
-
-
Jan Lindström authored
Problem was that static array was used for storing thread mutex sync levels. Fixed by using std::vector instead. Does not contain test case to avoid too big memory/disk space usage on buildbot VMs.
-
Kristian Nielsen authored
-
- 16 Mar, 2015 1 commit
-
-
Kristian Nielsen authored
-
- 13 Mar, 2015 2 commits
-
-
Kristian Nielsen authored
Parallel replication (in 10.0 / "conservative" mode) relies on binlog group commits to group transactions that can be safely run in parallel on the slave. The --binlog-commit-wait-count and --binlog-commit-wait-usec options exist to increase the number of commits per group. But in case of conflicts between transactions, this can cause unnecessary delay and reduced througput, especially on a slave where commit order is fixed. This patch adds a heuristics to reduce this problem. When transaction T1 goes to commit, it will first wait for N transactions to queue up for a group commit. However, if we detect that another transaction T2 is waiting for a row lock held by T1, then we will skip the wait and let T1 commit immediately, releasing locks and let T2 continue. On a slave, this avoids the unfortunate situation where T1 is waiting for T2 to join the group commit, but T2 is waiting for T1 to release locks, causing no work to be done for the duration of the --binlog-commit-wait-usec timeout. (The heuristic seems reasonable on the master as well, so it is enabled for all transactions, not just replication transactions).
-
Alexander Barkov authored
A contribution from Daniel Black, with minor additional enhancements.
-
- 12 Mar, 2015 1 commit
-
-
Jan Lindström authored
type storage engine. Authored by: Kentoku Shiba
-
- 11 Mar, 2015 1 commit
-
-
Kristian Nielsen authored
Delay spawning parallel replication worker threads until a slave SQL thread is running, and de-spawn them when the last SQL thread stops. This is especially useful to avoid needless threads on a master in a setup where same my.cnf is used on masters and slaves.
-
- 09 Mar, 2015 4 commits
-
-
Jan Lindström authored
available space on disk Add error handling when disk full situation happens and intentionally bring server down with stacktrace because on all cases InnoDB can't continue anyway.
-
Elena Stepanova authored
Extend show_slave_status.inc to run SHOW ALL SLAVES STATUS and SHOW SLAVE 'name' STATUS on demand, and make the test use the include file instead of direct SHOW statements
-
Kristian Nielsen authored
MDEV-7668: Intermediate master groups CREATE TEMPORARY with INSERT, causing parallel replication failure Parallel replication depends on locking (table locks, row locks, etc.) to prevent two conflicting transactions from running and committing in parallel. But temporary tables are designed to be visible only to one thread, and have no such locking. In the concrete issue, an intermediate master could commit a CREATE TEMPORARY TABLE in the same group commit as in INSERT into that table. Thus, a lower-level master could attempt to run them in parallel and get an error. More generally, we need protection from parallel replication trying to run transactions in parallel that access a common temporary table. This patch simply causes use of a temporary table from parallel replication to wait for all previous transactions to commit, serialising the replication at that point. (A more fine-grained locking could be added later, possibly. However, using temporary tables in statement-based replication is in any case normally undesirable; for example a restart of the server will lose temporary tables and can break replication). Note that row-based replication is not affected, as it does not do any temporary tables on the slave-side. This patch also cleans up the locking around protecting the list of temporary tables in Relay_log_info. This used to take the rli->data_lock at the end of every statement, which is very bad for concurrency. With this patch, the lock is not taken unless temporary tables (with statement-based binlogging) are in use on the slave.
-
Jan Lindström authored
when created FK Analysis: Table name is on filename charset but foreign key identifiers are not. This lead incorrect foreign key identifier number to be used. Fix: Convert foreign key identifier to filename charset before comparing it to table name when largest foreign key identifier number is resolved.
-
- 08 Mar, 2015 1 commit
-
-
Elena Stepanova authored
During slow execution, e.g. under valgrind, there was a chance that Aria checkpoint would happen while P_S tables were being queried; it could cause different data in joined P_S, and thus combinations of results that the test did not expect. Fixed by disabling Aria checkpoints for the test.
-
- 06 Mar, 2015 3 commits
-
-
Sergei Golubchik authored
-
Sergei Golubchik authored
disable a broken test, pending a proper fix
-
Sergei Golubchik authored
-