• Kristian Nielsen's avatar
    MDEV-8031: Parallel replication stops on "connection killed" error (probably... · b616991a
    Kristian Nielsen authored
    MDEV-8031: Parallel replication stops on "connection killed" error (probably incorrectly handled deadlock kill)
    
    There was a rare race, where a deadlock error might not be correctly
    handled, causing the slave to stop with something like this in the error
    log:
    
    150423 14:04:10 [ERROR] Slave SQL: Connection was killed, Gtid 0-1-2, Internal MariaDB error code: 1927
    150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
    150423 14:04:10 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
    150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
    150423 14:04:10 [Warning] Slave: Connection was killed Error_code: 1927
    150423 14:04:10 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001 position 1234
    
    The problem was incorrect error handling. When a deadlock is detected, it
    causes a KILL CONNECTION on the offending thread. This error is then later
    converted to a deadlock error, and the transaction is retried.
    
    However, the deadlock error was not cleared at the start of the retry, nor
    was the lingering kill signal. So it was possible to get another deadlock
    kill early during retry. If this happened with particular thread
    scheduling/timing, it was possible that the new KILL CONNECTION error was
    masked by the earlier deadlock error, so that the second kill was not
    properly converted into a deadlock error and retry.
    
    This patch adds code that clears the old error and killed flag before
    starting the retry. It also adds code to handle a deadlock kill caught in a
    couple of places where it was not handled before.
    b616991a
rpl_parallel.cc 70.5 KB