• Marko Mäkelä's avatar
    Bug#11877216 InnoDB too eager to commit suicide on a busy server · 9a961662
    Marko Mäkelä authored
    sync_array_print_long_waits(): Return the longest waiting thread ID
    and the longest waited-for lock. Only if those remain unchanged
    between calls in srv_error_monitor_thread(), increment
    fatal_cnt. Otherwise, reset fatal_cnt.
    
    Background: There is a built-in watchdog in InnoDB whose purpose is to
    kill the server when some thread is stuck waiting for a mutex or
    rw-lock. Before this fix, the logic was flawed.
    
    The function sync_array_print_long_waits() returns TRUE if it finds a
    lock wait that exceeds 10 minutes (srv_fatal_semaphore_wait_threshold).
    The function srv_error_monitor_thread() will kill the server if this
    happens 10 times in a row (fatal_cnt reaches 10), checked every 30
    seconds. This is wrong, because this situation does not mean that the
    server is hung. If the server is very busy for a little over 15
    minutes, it will be killed.
    
    Consider this example. Thread T1 is waiting for mutex M. Some time
    later, threads T2..Tn start waiting for the same mutex M. If T1 keeps
    waiting for 600 seconds, fatal_cnt will be incremented to 1. So far,
    so good. Now, if M is granted to T1, the server was obviously not
    stuck. But, T2..Tn keeps waiting, and their wait time will be longer
    than 600 seconds. If 5 minutes later, some Tn has still been waiting
    for more than 10 minutes for the mutex M, the server can be killed,
    even though it is not stuck.
    
    rb:622 approved by Jimmy Yang
    9a961662
srv0srv.c 82.1 KB