• Marko Mäkelä's avatar
    MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to... · e039720b
    Marko Mäkelä authored
    MDEV-32096 Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait
    
    lock_sys_t::cancel(trx_t*): Remove, and merge to its only caller
    innobase_kill_query().
    
    innobase_kill_query(): Before reading trx->lock.wait_lock,
    do acquire lock_sys.wait_mutex, like we did before
    commit e71e6133 (MDEV-24671).
    In this way, we should not miss a recently started lock wait
    by the killee transaction.
    
    lock_rec_lock(): Add a DEBUG_SYNC "lock_rec" for the test case.
    
    lock_wait(): Invoke trx_is_interrupted() before entering the wait,
    in case innobase_kill_query() was invoked some time earlier and
    some longer-running operation did not check for interrupts.
    As suggested by Vladislav Lesin, do not overwrite
    trx->error_state==DB_INTERRUPTED with DB_SUCCESS.
    This would avoid a call to trx_is_interrupted() when the test is
    modified to use the DEBUG_SYNC point lock_wait_start instead of lock_rec.
    Avoid some redundant loads of trx->lock.wait_lock; cache the value
    in the local variable wait_lock.
    
    Deadlock::check_and_resolve(): Take wait_lock as a parameter and
    return wait_lock (or -1 or nullptr). We only need to reload
    trx->lock.wait_lock if lock_sys.wait_mutex had been released
    and reacquired.
    
    trx_t::error_state: Correctly document the data member.
    
    trx_lock_t::was_chosen_as_deadlock_victim: Clarify that other threads
    may set the field (or flags in it) while holding lock_sys.wait_mutex.
    
    Thanks to Johannes Baumgarten for reporting the problem and testing
    the fix, as well as to Kristian Nielsen for suggesting the fix.
    
    Reviewed by: Vladislav Lesin
    Tested by: Matthias Leich
    e039720b
trx0trx.h 40.8 KB