• David Teigland's avatar
    [DLM] can miss clearing resend flag · b790c3b7
    David Teigland authored
    A long, complicated sequence of events, beginning with the RESEND flag not
    being cleared on an lkb, can result in an unlock never completing.
    
    - lkb on waiters list for remote lookup
    - the remote node is both the dir node and the master node, so
      it optimizes the lookup into a request and sends a request
      reply back
    - the request reply is saved on the requestqueue to be processed
      after recovery
    - recovery runs dlm_recover_waiters_pre() which sets RESEND flag
      so the lookup will be resent after recovery
    - end of recovery: process_requestqueue takes saved request reply
      which removes the lkb off the waitesr list, _without_ clearing
      the RESEND flag
    - end of recovery: dlm_recover_waiters_post() doesn't do anything
      with the now completed lookup lkb (would usually clear RESEND)
    - later, the node unmounts, unlocks this lkb that still has RESEND
      flag set
    - the lkb is on the waiters list again, now for unlock, when recovery
      occurs, dlm_recover_waiters_pre() shows the lkb for unlock with RESEND
      set, doesn't do anything since the master still exists
    - end of recovery: dlm_recover_waiters_post() takes this lkb off
      the waiters list because it has the RESEND flag set, then reports
      an error because unlocks are never supposed to be handled in
      recover_waiters_post().
    - later, the unlock reply is received, doesn't find the lkb on
      the waiters list because recover_waiters_post() has wrongly
      removed it.
    - the unlock operation has been lost, and we're left with a
      stray granted lock
    - unmount spins waiting for the unlock to complete
    
    The visible evidence of this problem will be a node where gfs umount is
    spinning, the dlm waiters list will be empty, and the dlm locks list will
    show a granted lock.
    
    The fix is simply to clear the RESEND flag when taking an lkb off the
    waiters list.
    Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
    Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    b790c3b7
lock.c 90.8 KB