• Tariq Saeed's avatar
    ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed() · 1918581a
    Tariq Saeed authored
    commit 3d46a44a upstream.
    
    PID: 614    TASK: ffff882a739da580  CPU: 3   COMMAND: "ocfs2dc"
      #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
      #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
      #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
      #3 [ffff882ecc375b20] die at ffffffff8101868b
      #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
      #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
      #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
         [exception RIP: ocfs2_ci_checkpointed+208]
         RIP: ffffffffa0a7e940  RSP: ffff882ecc375cf0  RFLAGS: 00010002
         RAX: 0000000000000001  RBX: 000000000000654b  RCX: ffff8812dc83f1f8
         RDX: 00000000000017d9  RSI: ffff8812dc83f1f8  RDI: ffffffffa0b2c318
         RBP: ffff882ecc375d20   R8: ffff882ef6ecfa60   R9: ffff88301f272200
         R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffffffffff
         R13: ffff8812dc83f4f0  R14: 0000000000000000  R15: ffff8812dc83f1f8
         ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
      #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
      #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
    assert is tripped because the tran is not checkpointed and the lock level is PR.
    
    Some time ago, chmod command had been executed. As result, the following call
    chain left the inode cluster lock in PR state, latter on causing the assert.
    system_call_fastpath
      -> my_chmod
       -> sys_chmod
        -> sys_fchmodat
         -> notify_change
          -> ocfs2_setattr
           -> posix_acl_chmod
            -> ocfs2_iop_set_acl
             -> ocfs2_set_acl
              -> ocfs2_acl_set_mode
    Here is how.
    1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
    1120 {
    1247         ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
    ..
    1258         if (!status && attr->ia_valid & ATTR_MODE) {
    1259                 status =  posix_acl_chmod(inode, inode->i_mode);
    
    519 posix_acl_chmod(struct inode *inode, umode_t mode)
    520 {
    ..
    539         ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);
    
    287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
    288 {
    289         return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);
    
    224 int ocfs2_set_acl(handle_t *handle,
    225                          struct inode *inode, ...
    231 {
    ..
    252                                 ret = ocfs2_acl_set_mode(inode, di_bh,
    253                                                          handle, mode);
    
    168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
    170 {
    183         if (handle == NULL) {
                        >>> BUG: inode lock not held in ex at this point <<<
    184                 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
    185                                            OCFS2_INODE_UPDATE_CREDITS);
    
    ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
    ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
    mode (it should be). How this could have happended?
    
    We are the lock master, were holding lock EX and have released it in
    ocfs2_setattr.#1247.  Note that there are no holders of this lock at
    this point.  Another node needs the lock in PR, and we downconvert from
    EX to PR.  So the inode lock is PR when do the trans in
    ocfs2_acl_set_mode.#184.  The trans stays in core (not flushed to disc).
    Now another node want the lock in EX, downconvert thread gets kicked
    (the one that tripped assert abovt), finds an unflushed trans but the
    lock is not EX (it is PR).  If the lock was at EX, it would have flushed
    the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
    downconverting (to NULL) for the request.
    
    ocfs2_setattr must not drop inode lock ex in this code path.  If it
    does, takes it again before the trans, say in ocfs2_set_acl, another
    cluster node can get in between, execute another setattr, overwriting
    the one in progress on this node, resulting in a mode acl size combo
    that is a mix of the two.
    
    Orabug: 20189959
    Signed-off-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
    Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Joseph Qi <joseph.qi@huawei.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
    1918581a
file.c 66.6 KB