• Eric Ren's avatar
    ocfs2: fix crash caused by stale lvb with fsdlm plugin · 4b93c1da
    Eric Ren authored
    commit e7ee2c08 upstream.
    
    The crash happens rather often when we reset some cluster nodes while
    nodes contend fiercely to do truncate and append.
    
    The crash backtrace is below:
    
       dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
       dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
       ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
       ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
       ocfs2: Beginning quota recovery on device (253,18) for slot 2
       ocfs2: Finishing quota recovery on device (253,18) for slot 2
       (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
       (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
       ------------[ cut here ]------------
       kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
       invalid opcode: 0000 [#1] SMP
       Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
       Supported: No, Unsupported modules are loaded
       CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
       task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
       RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
       RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
       RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
       RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
       R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
       R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
       FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
       Call Trace:
         ocfs2_setattr+0x698/0xa90 [ocfs2]
         notify_change+0x1ae/0x380
         do_truncate+0x5e/0x90
         do_sys_ftruncate.constprop.11+0x108/0x160
         entry_SYSCALL_64_fastpath+0x12/0x6d
       Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
       RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
    
    It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
    not equal to the disk i_size.  We mistakenly trust the LVB because the
    underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
    DLM_SBF_VALNOTVALID properly for us.  But, why?
    
    The current code tries to downconvert lock without DLM_LKF_VALBLK flag
    to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
    if the lock resource type needs LVB.  This is not the right way for
    fsdlm.
    
    The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
    DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
    DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
    this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
    failure happens.
    
    The following diagram briefly illustrates how this crash happens:
    
    RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
    
    The 1st round:
    
                 Node1                                    Node2
    RSB1: PR
                                                      RSB1(master): NULL->EX
    ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
      ocfs2_dlm_lock(no DLM_LKF_VALBLK)
    
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    dlm_lock(no DLM_LKF_VALBLK)
      convert_lock(overwrite lkb->lkb_exflags
                   with no DLM_LKF_VALBLK)
    
    RSB1: NULL                                        RSB1: EX
                                                      reset Node2
    dlm_recover_rsbs()
      recover_lvb()
    
    /* The LVB is not trustable if the node with EX fails and
     * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
     */
    
     if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
               return;                   * to invalid the LVB here.
                                         */
    
    The 2nd round:
    
             Node 1                                Node2
    RSB1(become master from recovery)
    
    ocfs2_setattr()
      ocfs2_inode_lock(NULL->EX)
        /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
        ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
      ocfs2_truncate_file()
          mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
    
    The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
    for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
    is uesed.
    
    Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: default avatarEric Ren <zren@suse.com>
    Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
    Cc: Mark Fasheh <mfasheh@versity.com>
    Cc: Joel Becker <jlbec@evilplan.org>
    Cc: Junxiao Bi <junxiao.bi@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
    4b93c1da
stackglue.c 17 KB