• Kristian Nielsen's avatar
    MDEV-7326: Server deadlock in connection with parallel replication · f27817c1
    Kristian Nielsen authored
    The bug occurs when a transaction does a retry after all transactions have
    done mark_start_commit() in a batch of group commit from the master. In this
    case, the retrying transaction can unmark_start_commit() after the following
    batch has already started running and de-allocated the GCO. Then after retry,
    the transaction will re-do mark_start_commit() on a de-allocated GCO, and also
    wakeup of later GCOs can be lost.
    
    This was seen "in the wild" by a user, even though it is not known exactly
    what circumstances can lead to retry of one transaction after all transactions
    in a group have reached the commit phase.
    
    The lifetime around GCO was somewhat clunky anyway. With this patch, a GCO
    lives until rpl_parallel_entry::last_committed_sub_id has reached the last
    transaction in the GCO. This guarantees that the GCO will still be alive when
    a transaction does mark_start_commit(). Also, we now loop over the list of
    active GCOs for wakeup, to ensure we do not lose a wakeup even in the
    problematic case.
    f27817c1
rpl_rli.h 25.7 KB