1. 09 Dec, 2022 26 commits
    • Baokun Li's avatar
      ext4: fix corruption when online resizing a 1K bigalloc fs · 0aeaa255
      Baokun Li authored
      When a backup superblock is updated in update_backups(), the primary
      superblock's offset in the group (that is, sbi->s_sbh->b_blocknr) is used
      as the backup superblock's offset in its group. However, when the block
      size is 1K and bigalloc is enabled, the two offsets are not equal. This
      causes the backup group descriptors to be overwritten by the superblock
      in update_backups(). Moreover, if meta_bg is enabled, the file system will
      be corrupted because this feature uses backup group descriptors.
      
      To solve this issue, we use a more accurate ext4_group_first_block_no() as
      the offset of the backup superblock in its group.
      
      Fixes: d77147ff ("ext4: add support for online resizing with bigalloc")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20221117040341.1380702-4-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0aeaa255
    • Baokun Li's avatar
      ext4: fix corrupt backup group descriptors after online resize · 8f49ec60
      Baokun Li authored
      In commit 9a8c5b0d ("ext4: update the backup superblock's at the end
      of the online resize"), it is assumed that update_backups() only updates
      backup superblocks, so each b_data is treated as a backupsuper block to
      update its s_block_group_nr and s_checksum. However, update_backups()
      also updates the backup group descriptors, which causes the backup group
      descriptors to be corrupted.
      
      The above commit fixes the problem of invalid checksum of the backup
      superblock. The root cause of this problem is that the checksum of
      ext4_update_super() is not set correctly. This problem has been fixed
      in the previous patch ("ext4: fix bad checksum after online resize").
      
      However, we do need to set block_group_nr for the backup superblock in
      update_backups(). When a block is in a group that contains a backup
      superblock, and the block is the first block in the group, the block is
      definitely a superblock. We add a helper function that includes setting
      s_block_group_nr and updating checksum, and then call it only when the
      above conditions are met to prevent the backup group descriptors from
      being incorrectly modified.
      
      Fixes: 9a8c5b0d ("ext4: update the backup superblock's at the end of the online resize")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20221117040341.1380702-3-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8f49ec60
    • Baokun Li's avatar
      ext4: fix bad checksum after online resize · a408f33e
      Baokun Li authored
      When online resizing is performed twice consecutively, the error message
      "Superblock checksum does not match superblock" is displayed for the
      second time. Here's the reproducer:
      
      	mkfs.ext4 -F /dev/sdb 100M
      	mount /dev/sdb /tmp/test
      	resize2fs /dev/sdb 5G
      	resize2fs /dev/sdb 6G
      
      To solve this issue, we moved the update of the checksum after the
      es->s_overhead_clusters is updated.
      
      Fixes: 026d0d27 ("ext4: reduce computation of overhead during resize")
      Fixes: de394a86 ("ext4: update s_overhead_clusters in the superblock during an on-line resize")
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20221117040341.1380702-2-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a408f33e
    • Darrick J. Wong's avatar
      ext4: don't fail GETFSUUID when the caller provides a long buffer · a7e9d977
      Darrick J. Wong authored
      If userspace provides a longer UUID buffer than is required, we
      shouldn't fail the call with EINVAL -- rather, we can fill the caller's
      buffer with the bytes we /can/ fill, and update the length field to
      reflect what we copied.  This doesn't break the UAPI since we're
      enabling a case that currently fails, and so far Ted hasn't released a
      version of e2fsprogs that uses the new ext4 ioctl.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Link: https://lore.kernel.org/r/166811139478.327006.13879198441587445544.stgit@magnoliaSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      a7e9d977
    • Darrick J. Wong's avatar
      ext4: dont return EINVAL from GETFSUUID when reporting UUID length · b76abb51
      Darrick J. Wong authored
      If userspace calls this ioctl with fsu_length (the length of the
      fsuuid.fsu_uuid array) set to zero, ext4 copies the desired uuid length
      out to userspace.  The kernel call returned a result from a valid input,
      so the return value here should be zero, not EINVAL.
      
      While we're at it, fix the copy_to_user call to make it clear that we're
      only copying out fsu_len.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Link: https://lore.kernel.org/r/166811138914.327006.9241306894437166566.stgit@magnoliaSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      b76abb51
    • Luís Henriques's avatar
      ext4: fix error code return to user-space in ext4_get_branch() · 26d75a16
      Luís Henriques authored
      If a block is out of range in ext4_get_branch(), -ENOMEM will be returned
      to user-space.  Obviously, this error code isn't really useful.  This
      patch fixes it by making sure the right error code (-EFSCORRUPTED) is
      propagated to user-space.  EUCLEAN is more informative than ENOMEM.
      Signed-off-by: default avatarLuís Henriques <lhenriques@suse.de>
      Link: https://lore.kernel.org/r/20221109181445.17843-1-lhenriques@suse.deSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      26d75a16
    • JunChao Sun's avatar
      ext4: replace kmem_cache_create with KMEM_CACHE · 060f7739
      JunChao Sun authored
      Replace kmem_cache_create with KMEM_CACHE macro that
      guaranteed struct alignment
      Signed-off-by: default avatarJunChao Sun <sunjunchao2870@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221109153822.80250-1-sunjunchao2870@gmail.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      060f7739
    • Baokun Li's avatar
      ext4: correct inconsistent error msg in nojournal mode · 89481b5f
      Baokun Li authored
      When we used the journal_async_commit mounting option in nojournal mode,
      the kernel told me that "can't mount with journal_checksum", was very
      confusing. I find that when we mount with journal_async_commit, both the
      JOURNAL_ASYNC_COMMIT and EXPLICIT_JOURNAL_CHECKSUM flags are set. However,
      in the error branch, CHECKSUM is checked before ASYNC_COMMIT. As a result,
      the above inconsistency occurs, and the ASYNC_COMMIT branch becomes dead
      code that cannot be executed. Therefore, we exchange the positions of the
      two judgments to make the error msg more accurate.
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221109074343.4184862-1-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      89481b5f
    • Lukas Czerner's avatar
      ext4: print file system UUID on mount, remount and unmount · bb0fbc78
      Lukas Czerner authored
      The device names are not necessarily consistent across reboots which can
      make it more difficult to identify the right file system when tracking
      down issues using system logs.
      
      Print file system UUID string on every mount, remount and unmount to
      make this task easier.
      
      This is similar to the functionality recently propsed for XFS.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Cc: Lukas Herbolt <lukas@herbolt.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Link: https://lore.kernel.org/r/20221108145042.85770-1-lczerner@redhat.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bb0fbc78
    • Ye Bin's avatar
      ext4: init quota for 'old.inode' in 'ext4_rename' · fae381a3
      Ye Bin authored
      Syzbot found the following issue:
      ext4_parse_param: s_want_extra_isize=128
      ext4_inode_info_init: s_want_extra_isize=32
      ext4_rename: old.inode=ffff88823869a2c8 old.dir=ffff888238699828 new.inode=ffff88823869d7e8 new.dir=ffff888238699828
      __ext4_mark_inode_dirty: inode=ffff888238699828 ea_isize=32 want_ea_size=128
      __ext4_mark_inode_dirty: inode=ffff88823869a2c8 ea_isize=32 want_ea_size=128
      ext4_xattr_block_set: inode=ffff88823869a2c8
      ------------[ cut here ]------------
      WARNING: CPU: 13 PID: 2234 at fs/ext4/xattr.c:2070 ext4_xattr_block_set.cold+0x22/0x980
      Modules linked in:
      RIP: 0010:ext4_xattr_block_set.cold+0x22/0x980
      RSP: 0018:ffff888227d3f3b0 EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff88823007a000 RCX: 0000000000000000
      RDX: 0000000000000a03 RSI: 0000000000000040 RDI: ffff888230078178
      RBP: 0000000000000000 R08: 000000000000002c R09: ffffed1075c7df8e
      R10: ffff8883ae3efc6b R11: ffffed1075c7df8d R12: 0000000000000000
      R13: ffff88823869a2c8 R14: ffff8881012e0460 R15: dffffc0000000000
      FS:  00007f350ac1f740(0000) GS:ffff8883ae200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f350a6ed6a0 CR3: 0000000237456000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ? ext4_xattr_set_entry+0x3b7/0x2320
       ? ext4_xattr_block_set+0x0/0x2020
       ? ext4_xattr_set_entry+0x0/0x2320
       ? ext4_xattr_check_entries+0x77/0x310
       ? ext4_xattr_ibody_set+0x23b/0x340
       ext4_xattr_move_to_block+0x594/0x720
       ext4_expand_extra_isize_ea+0x59a/0x10f0
       __ext4_expand_extra_isize+0x278/0x3f0
       __ext4_mark_inode_dirty.cold+0x347/0x410
       ext4_rename+0xed3/0x174f
       vfs_rename+0x13a7/0x2510
       do_renameat2+0x55d/0x920
       __x64_sys_rename+0x7d/0xb0
       do_syscall_64+0x3b/0xa0
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      As 'ext4_rename' will modify 'old.inode' ctime and mark inode dirty,
      which may trigger expand 'extra_isize' and allocate block. If inode
      didn't init quota will lead to warning.  To solve above issue, init
      'old.inode' firstly in 'ext4_rename'.
      
      Reported-by: syzbot+98346927678ac3059c77@syzkaller.appspotmail.com
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221107015335.2524319-1-yebin@huaweicloud.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      fae381a3
    • Eric Biggers's avatar
      ext4: simplify fast-commit CRC calculation · 8805dbcb
      Eric Biggers authored
      Instead of checksumming each field as it is added to the block, just
      checksum each block before it is written.  This is simpler, and also
      much more efficient.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221106224841.279231-8-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8805dbcb
    • Eric Biggers's avatar
      ext4: fix off-by-one errors in fast-commit block filling · 48a6a66d
      Eric Biggers authored
      Due to several different off-by-one errors, or perhaps due to a late
      change in design that wasn't fully reflected in the code that was
      actually merged, there are several very strange constraints on how
      fast-commit blocks are filled with tlv entries:
      
      - tlvs must start at least 10 bytes before the end of the block, even
        though the minimum tlv length is 8.  Otherwise, the replay code will
        ignore them.  (BUG: ext4_fc_reserve_space() could violate this
        requirement if called with a len of blocksize - 9 or blocksize - 8.
        Fortunately, this doesn't seem to happen currently.)
      
      - tlvs must end at least 1 byte before the end of the block.  Otherwise
        the replay code will consider them to be invalid.  This quirk
        contributed to a bug (fixed by an earlier commit) where uninitialized
        memory was being leaked to disk in the last byte of blocks.
      
      Also, strangely these constraints don't apply to the replay code in
      e2fsprogs, which will accept any tlvs in the blocks (with no bounds
      checks at all, but that is a separate issue...).
      
      Given that this all seems to be a bug, let's fix it by just filling
      blocks with tlv entries in the natural way.
      
      Note that old kernels will be unable to replay fast-commit journals
      created by kernels that have this commit.
      
      Fixes: aa75f4d3 ("ext4: main fast-commit commit path")
      Cc: <stable@vger.kernel.org> # v5.10+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221106224841.279231-7-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      48a6a66d
    • Eric Biggers's avatar
      ext4: fix unaligned memory access in ext4_fc_reserve_space() · 8415ce07
      Eric Biggers authored
      As is done elsewhere in the file, build the struct ext4_fc_tl on the
      stack and memcpy() it into the buffer, rather than directly writing it
      to a potentially-unaligned location in the buffer.
      
      Fixes: aa75f4d3 ("ext4: main fast-commit commit path")
      Cc: <stable@vger.kernel.org> # v5.10+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221106224841.279231-6-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8415ce07
    • Eric Biggers's avatar
      ext4: add missing validation of fast-commit record lengths · 64b4a25c
      Eric Biggers authored
      Validate the inode and filename lengths in fast-commit journal records
      so that a malicious fast-commit journal cannot cause a crash by having
      invalid values for these.  Also validate EXT4_FC_TAG_DEL_RANGE.
      
      Fixes: aa75f4d3 ("ext4: main fast-commit commit path")
      Cc: <stable@vger.kernel.org> # v5.10+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221106224841.279231-5-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      64b4a25c
    • Eric Biggers's avatar
      ext4: fix leaking uninitialized memory in fast-commit journal · 594bc43b
      Eric Biggers authored
      When space at the end of fast-commit journal blocks is unused, make sure
      to zero it out so that uninitialized memory is not leaked to disk.
      
      Fixes: aa75f4d3 ("ext4: main fast-commit commit path")
      Cc: <stable@vger.kernel.org> # v5.10+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221106224841.279231-4-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      594bc43b
    • Eric Biggers's avatar
      ext4: don't set up encryption key during jbd2 transaction · 4c0d5778
      Eric Biggers authored
      Commit a80f7fcf ("ext4: fixup ext4_fc_track_* functions' signature")
      extended the scope of the transaction in ext4_unlink() too far, making
      it include the call to ext4_find_entry().  However, ext4_find_entry()
      can deadlock when called from within a transaction because it may need
      to set up the directory's encryption key.
      
      Fix this by restoring the transaction to its original scope.
      
      Reported-by: syzbot+1a748d0007eeac3ab079@syzkaller.appspotmail.com
      Fixes: a80f7fcf ("ext4: fixup ext4_fc_track_* functions' signature")
      Cc: <stable@vger.kernel.org> # v5.10+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221106224841.279231-3-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4c0d5778
    • Eric Biggers's avatar
      ext4: disable fast-commit of encrypted dir operations · 0fbcb525
      Eric Biggers authored
      fast-commit of create, link, and unlink operations in encrypted
      directories is completely broken because the unencrypted filenames are
      being written to the fast-commit journal instead of the encrypted
      filenames.  These operations can't be replayed, as encryption keys
      aren't present at journal replay time.  It is also an information leak.
      
      Until if/when we can get this working properly, make encrypted directory
      operations ineligible for fast-commit.
      
      Note that fast-commit operations on encrypted regular files continue to
      be allowed, as they seem to work.
      
      Fixes: aa75f4d3 ("ext4: main fast-commit commit path")
      Cc: <stable@vger.kernel.org> # v5.10+
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221106224841.279231-2-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0fbcb525
    • Baokun Li's avatar
      ext4: fix use-after-free in ext4_orphan_cleanup · a71248b1
      Baokun Li authored
      I caught a issue as follows:
      ==================================================================
       BUG: KASAN: use-after-free in __list_add_valid+0x28/0x1a0
       Read of size 8 at addr ffff88814b13f378 by task mount/710
      
       CPU: 1 PID: 710 Comm: mount Not tainted 6.1.0-rc3-next #370
       Call Trace:
        <TASK>
        dump_stack_lvl+0x73/0x9f
        print_report+0x25d/0x759
        kasan_report+0xc0/0x120
        __asan_load8+0x99/0x140
        __list_add_valid+0x28/0x1a0
        ext4_orphan_cleanup+0x564/0x9d0 [ext4]
        __ext4_fill_super+0x48e2/0x5300 [ext4]
        ext4_fill_super+0x19f/0x3a0 [ext4]
        get_tree_bdev+0x27b/0x450
        ext4_get_tree+0x19/0x30 [ext4]
        vfs_get_tree+0x49/0x150
        path_mount+0xaae/0x1350
        do_mount+0xe2/0x110
        __x64_sys_mount+0xf0/0x190
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
        </TASK>
       [...]
      ==================================================================
      
      Above issue may happen as follows:
      -------------------------------------
      ext4_fill_super
        ext4_orphan_cleanup
         --- loop1: assume last_orphan is 12 ---
          list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan)
          ext4_truncate --> return 0
            ext4_inode_attach_jinode --> return -ENOMEM
          iput(inode) --> free inode<12>
         --- loop2: last_orphan is still 12 ---
          list_add(&EXT4_I(inode)->i_orphan, &EXT4_SB(sb)->s_orphan);
          // use inode<12> and trigger UAF
      
      To solve this issue, we need to propagate the return value of
      ext4_inode_attach_jinode() appropriately.
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221102080633.1630225-1-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      a71248b1
    • Eric Biggers's avatar
      ext4: don't allow journal inode to have encrypt flag · 105c78e1
      Eric Biggers authored
      Mounting a filesystem whose journal inode has the encrypt flag causes a
      NULL dereference in fscrypt_limit_io_blocks() when the 'inlinecrypt'
      mount option is used.
      
      The problem is that when jbd2_journal_init_inode() calls bmap(), it
      eventually finds its way into ext4_iomap_begin(), which calls
      fscrypt_limit_io_blocks().  fscrypt_limit_io_blocks() requires that if
      the inode is encrypted, then its encryption key must already be set up.
      That's not the case here, since the journal inode is never "opened" like
      a normal file would be.  Hence the crash.
      
      A reproducer is:
      
          mkfs.ext4 -F /dev/vdb
          debugfs -w /dev/vdb -R "set_inode_field <8> flags 0x80808"
          mount /dev/vdb /mnt -o inlinecrypt
      
      To fix this, make ext4 consider journal inodes with the encrypt flag to
      be invalid.  (Note, maybe other flags should be rejected on the journal
      inode too.  For now, this is just the minimal fix for the above issue.)
      
      I've marked this as fixing the commit that introduced the call to
      fscrypt_limit_io_blocks(), since that's what made an actual crash start
      being possible.  But this fix could be applied to any version of ext4
      that supports the encrypt feature.
      
      Reported-by: syzbot+ba9dac45bc76c490b7c3@syzkaller.appspotmail.com
      Fixes: 38ea50da ("ext4: support direct I/O with fscrypt using blk-crypto")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20221102053312.189962-1-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      105c78e1
    • Gaosheng Cui's avatar
      ext4: fix undefined behavior in bit shift for ext4_check_flag_values · 3bf678a0
      Gaosheng Cui authored
      Shifting signed 32-bit value by 31 bits is undefined, so changing
      significant bit to unsigned. The UBSAN warning calltrace like below:
      
      UBSAN: shift-out-of-bounds in fs/ext4/ext4.h:591:2
      left shift of 1 by 31 places cannot be represented in type 'int'
      Call Trace:
       <TASK>
       dump_stack_lvl+0x7d/0xa5
       dump_stack+0x15/0x1b
       ubsan_epilogue+0xe/0x4e
       __ubsan_handle_shift_out_of_bounds+0x1e7/0x20c
       ext4_init_fs+0x5a/0x277
       do_one_initcall+0x76/0x430
       kernel_init_freeable+0x3b3/0x422
       kernel_init+0x24/0x1e0
       ret_from_fork+0x1f/0x30
       </TASK>
      
      Fixes: 9a4c8019 ("ext4: ensure Inode flags consistency are checked at build time")
      Signed-off-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Link: https://lore.kernel.org/r/20221031055833.3966222-1-cuigaosheng1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      3bf678a0
    • Baokun Li's avatar
      ext4: fix bug_on in __es_tree_search caused by bad boot loader inode · 991ed014
      Baokun Li authored
      We got a issue as fllows:
      ==================================================================
       kernel BUG at fs/ext4/extents_status.c:203!
       invalid opcode: 0000 [#1] PREEMPT SMP
       CPU: 1 PID: 945 Comm: cat Not tainted 6.0.0-next-20221007-dirty #349
       RIP: 0010:ext4_es_end.isra.0+0x34/0x42
       RSP: 0018:ffffc9000143b768 EFLAGS: 00010203
       RAX: 0000000000000000 RBX: ffff8881769cd0b8 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffffffff8fc27cf7 RDI: 00000000ffffffff
       RBP: ffff8881769cd0bc R08: 0000000000000000 R09: ffffc9000143b5f8
       R10: 0000000000000001 R11: 0000000000000001 R12: ffff8881769cd0a0
       R13: ffff8881768e5668 R14: 00000000768e52f0 R15: 0000000000000000
       FS: 00007f359f7f05c0(0000)GS:ffff88842fd00000(0000)knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f359f5a2000 CR3: 000000017130c000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        <TASK>
        __es_tree_search.isra.0+0x6d/0xf5
        ext4_es_cache_extent+0xfa/0x230
        ext4_cache_extents+0xd2/0x110
        ext4_find_extent+0x5d5/0x8c0
        ext4_ext_map_blocks+0x9c/0x1d30
        ext4_map_blocks+0x431/0xa50
        ext4_mpage_readpages+0x48e/0xe40
        ext4_readahead+0x47/0x50
        read_pages+0x82/0x530
        page_cache_ra_unbounded+0x199/0x2a0
        do_page_cache_ra+0x47/0x70
        page_cache_ra_order+0x242/0x400
        ondemand_readahead+0x1e8/0x4b0
        page_cache_sync_ra+0xf4/0x110
        filemap_get_pages+0x131/0xb20
        filemap_read+0xda/0x4b0
        generic_file_read_iter+0x13a/0x250
        ext4_file_read_iter+0x59/0x1d0
        vfs_read+0x28f/0x460
        ksys_read+0x73/0x160
        __x64_sys_read+0x1e/0x30
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
        </TASK>
      ==================================================================
      
      In the above issue, ioctl invokes the swap_inode_boot_loader function to
      swap inode<5> and inode<12>. However, inode<5> contain incorrect imode and
      disordered extents, and i_nlink is set to 1. The extents check for inode in
      the ext4_iget function can be bypassed bacause 5 is EXT4_BOOT_LOADER_INO.
      While links_count is set to 1, the extents are not initialized in
      swap_inode_boot_loader. After the ioctl command is executed successfully,
      the extents are swapped to inode<12>, in this case, run the `cat` command
      to view inode<12>. And Bug_ON is triggered due to the incorrect extents.
      
      When the boot loader inode is not initialized, its imode can be one of the
      following:
      1) the imode is a bad type, which is marked as bad_inode in ext4_iget and
         set to S_IFREG.
      2) the imode is good type but not S_IFREG.
      3) the imode is S_IFREG.
      
      The BUG_ON may be triggered by bypassing the check in cases 1 and 2.
      Therefore, when the boot loader inode is bad_inode or its imode is not
      S_IFREG, initialize the inode to avoid triggering the BUG.
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJason Yan <yanaijie@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221026042310.3839669-5-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      991ed014
    • Baokun Li's avatar
      ext4: add EXT4_IGET_BAD flag to prevent unexpected bad inode · 63b1e9bc
      Baokun Li authored
      There are many places that will get unhappy (and crash) when ext4_iget()
      returns a bad inode. However, if iget the boot loader inode, allows a bad
      inode to be returned, because the inode may not be initialized. This
      mechanism can be used to bypass some checks and cause panic. To solve this
      problem, we add a special iget flag EXT4_IGET_BAD. Only with this flag
      we'd be returning bad inode from ext4_iget(), otherwise we always return
      the error code if the inode is bad inode.(suggested by Jan Kara)
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJason Yan <yanaijie@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221026042310.3839669-4-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      63b1e9bc
    • Baokun Li's avatar
      ext4: add helper to check quota inums · 07342ec2
      Baokun Li authored
      Before quota is enabled, a check on the preset quota inums in
      ext4_super_block is added to prevent wrong quota inodes from being loaded.
      In addition, when the quota fails to be enabled, the quota type and quota
      inum are printed to facilitate fault locating.
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarJason Yan <yanaijie@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221026042310.3839669-3-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      07342ec2
    • Baokun Li's avatar
      ext4: fix bug_on in __es_tree_search caused by bad quota inode · d3238774
      Baokun Li authored
      We got a issue as fllows:
      ==================================================================
       kernel BUG at fs/ext4/extents_status.c:202!
       invalid opcode: 0000 [#1] PREEMPT SMP
       CPU: 1 PID: 810 Comm: mount Not tainted 6.1.0-rc1-next-g9631525255e3 #352
       RIP: 0010:__es_tree_search.isra.0+0xb8/0xe0
       RSP: 0018:ffffc90001227900 EFLAGS: 00010202
       RAX: 0000000000000000 RBX: 0000000077512a0f RCX: 0000000000000000
       RDX: 0000000000000002 RSI: 0000000000002a10 RDI: ffff8881004cd0c8
       RBP: ffff888177512ac8 R08: 47ffffffffffffff R09: 0000000000000001
       R10: 0000000000000001 R11: 00000000000679af R12: 0000000000002a10
       R13: ffff888177512d88 R14: 0000000077512a10 R15: 0000000000000000
       FS: 00007f4bd76dbc40(0000)GS:ffff88842fd00000(0000)knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00005653bf993cf8 CR3: 000000017bfdf000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        <TASK>
        ext4_es_cache_extent+0xe2/0x210
        ext4_cache_extents+0xd2/0x110
        ext4_find_extent+0x5d5/0x8c0
        ext4_ext_map_blocks+0x9c/0x1d30
        ext4_map_blocks+0x431/0xa50
        ext4_getblk+0x82/0x340
        ext4_bread+0x14/0x110
        ext4_quota_read+0xf0/0x180
        v2_read_header+0x24/0x90
        v2_check_quota_file+0x2f/0xa0
        dquot_load_quota_sb+0x26c/0x760
        dquot_load_quota_inode+0xa5/0x190
        ext4_enable_quotas+0x14c/0x300
        __ext4_fill_super+0x31cc/0x32c0
        ext4_fill_super+0x115/0x2d0
        get_tree_bdev+0x1d2/0x360
        ext4_get_tree+0x19/0x30
        vfs_get_tree+0x26/0xe0
        path_mount+0x81d/0xfc0
        do_mount+0x8d/0xc0
        __x64_sys_mount+0xc0/0x160
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd
        </TASK>
      ==================================================================
      
      Above issue may happen as follows:
      -------------------------------------
      ext4_fill_super
       ext4_orphan_cleanup
        ext4_enable_quotas
         ext4_quota_enable
          ext4_iget --> get error inode <5>
           ext4_ext_check_inode --> Wrong imode makes it escape inspection
           make_bad_inode(inode) --> EXT4_BOOT_LOADER_INO set imode
          dquot_load_quota_inode
           vfs_setup_quota_inode --> check pass
           dquot_load_quota_sb
            v2_check_quota_file
             v2_read_header
              ext4_quota_read
               ext4_bread
                ext4_getblk
                 ext4_map_blocks
                  ext4_ext_map_blocks
                   ext4_find_extent
                    ext4_cache_extents
                     ext4_es_cache_extent
                      __es_tree_search.isra.0
                       ext4_es_end --> Wrong extents trigger BUG_ON
      
      In the above issue, s_usr_quota_inum is set to 5, but inode<5> contains
      incorrect imode and disordered extents. Because 5 is EXT4_BOOT_LOADER_INO,
      the ext4_ext_check_inode check in the ext4_iget function can be bypassed,
      finally, the extents that are not checked trigger the BUG_ON in the
      __es_tree_search function. To solve this issue, check whether the inode is
      bad_inode in vfs_setup_quota_inode().
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: default avatarJason Yan <yanaijie@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221026042310.3839669-2-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      d3238774
    • Luís Henriques's avatar
      ext4: remove trailing newline from ext4_msg() message · 78742d4d
      Luís Henriques authored
      The ext4_msg() function adds a new line to the message.  Remove extra '\n'
      from call to ext4_msg() in ext4_orphan_cleanup().
      Signed-off-by: default avatarLuís Henriques <lhenriques@suse.de>
      Link: https://lore.kernel.org/r/20221011155758.15287-1-lhenriques@suse.deSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      78742d4d
    • Bixuan Cui's avatar
      jbd2: use the correct print format · d87a7b4c
      Bixuan Cui authored
      The print format error was found when using ftrace event:
          <...>-1406 [000] .... 23599442.895823: jbd2_end_commit: dev 252,8 transaction -1866216965 sync 0 head -1866217368
          <...>-1406 [000] .... 23599442.896299: jbd2_start_commit: dev 252,8 transaction -1866216964 sync 0
      
      Use the correct print format for transaction, head and tid.
      
      Fixes: 879c5e6b ('jbd2: convert instrumentation from markers to tracepoints')
      Signed-off-by: default avatarBixuan Cui <cuibixuan@linux.alibaba.com>
      Reviewed-by: default avatarJason Yan <yanaijie@huawei.com>
      Link: https://lore.kernel.org/r/1665488024-95172-1-git-send-email-cuibixuan@linux.alibaba.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      d87a7b4c
  2. 01 Dec, 2022 5 commits
  3. 29 Nov, 2022 2 commits
  4. 28 Nov, 2022 1 commit
    • Zhang Yi's avatar
      ext4: silence the warning when evicting inode with dioread_nolock · bc12ac98
      Zhang Yi authored
      When evicting an inode with default dioread_nolock, it could be raced by
      the unwritten extents converting kworker after writeback some new
      allocated dirty blocks. It convert unwritten extents to written, the
      extents could be merged to upper level and free extent blocks, so it
      could mark the inode dirty again even this inode has been marked
      I_FREEING. But the inode->i_io_list check and warning in
      ext4_evict_inode() missing this corner case. Fortunately,
      ext4_evict_inode() will wait all extents converting finished before this
      check, so it will not lead to inode use-after-free problem, every thing
      is OK besides this warning. The WARN_ON_ONCE was originally designed
      for finding inode use-after-free issues in advance, but if we add
      current dioread_nolock case in, it will become not quite useful, so fix
      this warning by just remove this check.
      
       ======
       WARNING: CPU: 7 PID: 1092 at fs/ext4/inode.c:227
       ext4_evict_inode+0x875/0xc60
       ...
       RIP: 0010:ext4_evict_inode+0x875/0xc60
       ...
       Call Trace:
        <TASK>
        evict+0x11c/0x2b0
        iput+0x236/0x3a0
        do_unlinkat+0x1b4/0x490
        __x64_sys_unlinkat+0x4c/0xb0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
       RIP: 0033:0x7fa933c1115b
       ======
      
      rm                          kworker
                                  ext4_end_io_end()
      vfs_unlink()
       ext4_unlink()
                                   ext4_convert_unwritten_io_end_vec()
                                    ext4_convert_unwritten_extents()
                                     ext4_map_blocks()
                                      ext4_ext_map_blocks()
                                       ext4_ext_try_to_merge_up()
                                        __mark_inode_dirty()
                                         check !I_FREEING
                                         locked_inode_to_wb_and_lock_list()
       iput()
        iput_final()
         evict()
          ext4_evict_inode()
           truncate_inode_pages_final() //wait release io_end
                                          inode_io_list_move_locked()
                                   ext4_release_io_end()
           trigger WARN_ON_ONCE()
      
      Cc: stable@kernel.org
      Fixes: ceff86fd ("ext4: Avoid freeing inodes on dirty list")
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220629112647.4141034-1-yi.zhang@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      bc12ac98
  5. 22 Nov, 2022 1 commit
  6. 07 Nov, 2022 1 commit
    • Baokun Li's avatar
      ext4: fix use-after-free in ext4_ext_shift_extents · f6b1a1cf
      Baokun Li authored
      If the starting position of our insert range happens to be in the hole
      between the two ext4_extent_idx, because the lblk of the ext4_extent in
      the previous ext4_extent_idx is always less than the start, which leads
      to the "extent" variable access across the boundary, the following UAF is
      triggered:
      ==================================================================
      BUG: KASAN: use-after-free in ext4_ext_shift_extents+0x257/0x790
      Read of size 4 at addr ffff88819807a008 by task fallocate/8010
      CPU: 3 PID: 8010 Comm: fallocate Tainted: G            E     5.10.0+ #492
      Call Trace:
       dump_stack+0x7d/0xa3
       print_address_description.constprop.0+0x1e/0x220
       kasan_report.cold+0x67/0x7f
       ext4_ext_shift_extents+0x257/0x790
       ext4_insert_range+0x5b6/0x700
       ext4_fallocate+0x39e/0x3d0
       vfs_fallocate+0x26f/0x470
       ksys_fallocate+0x3a/0x70
       __x64_sys_fallocate+0x4f/0x60
       do_syscall_64+0x33/0x40
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ==================================================================
      
      For right shifts, we can divide them into the following situations:
      
      1. When the first ee_block of ext4_extent_idx is greater than or equal to
         start, make right shifts directly from the first ee_block.
          1) If it is greater than start, we need to continue searching in the
             previous ext4_extent_idx.
          2) If it is equal to start, we can exit the loop (iterator=NULL).
      
      2. When the first ee_block of ext4_extent_idx is less than start, then
         traverse from the last extent to find the first extent whose ee_block
         is less than start.
          1) If extent is still the last extent after traversal, it means that
             the last ee_block of ext4_extent_idx is less than start, that is,
             start is located in the hole between idx and (idx+1), so we can
             exit the loop directly (break) without right shifts.
          2) Otherwise, make right shifts at the corresponding position of the
             found extent, and then exit the loop (iterator=NULL).
      
      Fixes: 331573fe ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate")
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: default avatarZhihao Cheng <chengzhihao1@huawei.com>
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Link: https://lore.kernel.org/r/20220922120434.1294789-1-libaokun1@huawei.comSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      f6b1a1cf
  7. 06 Nov, 2022 4 commits
    • Linus Torvalds's avatar
      Linux 6.1-rc4 · f0c4d9fc
      Linus Torvalds authored
      f0c4d9fc
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 16c7a368
      Linus Torvalds authored
      Pull cxl fixes from Dan Williams:
       "Several fixes for CXL region creation crashes, leaks and failures.
      
        This is mainly fallout from the original implementation of dynamic CXL
        region creation (instantiate new physical memory pools) that arrived
        in v6.0-rc1.
      
        Given the theme of "failures in the presence of pass-through decoders"
        this also includes new regression test infrastructure for that case.
      
        Summary:
      
         - Fix region creation crash with pass-through decoders
      
         - Fix region creation crash when no decoder allocation fails
      
         - Fix region creation crash when scanning regions to enforce the
           increasing physical address order constraint that CXL mandates
      
         - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
           1:1 memory-device-to-region associations.
      
         - Fix a memory leak for cxl_region objects when regions with active
           targets are deleted
      
         - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
           emulated proximity domains.
      
         - Fix region creation failure for switch attached devices downstream
           of a single-port host-bridge
      
         - Fix false positive memory leak of cxl_region objects by recycling
           recently used region ids rather than freeing them
      
         - Add regression test infrastructure for a pass-through decoder
           configuration
      
         - Fix some mailbox payload handling corner cases"
      
      * tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/region: Recycle region ids
        cxl/region: Fix 'distance' calculation with passthrough ports
        tools/testing/cxl: Add a single-port host-bridge regression config
        tools/testing/cxl: Fix some error exits
        cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
        cxl/region: Fix cxl_region leak, cleanup targets at region delete
        cxl/region: Fix region HPA ordering validation
        cxl/pmem: Use size_add() against integer overflow
        cxl/region: Fix decoder allocation crash
        ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
        cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
        cxl/region: Fix null pointer dereference due to pass through decoder commit
        cxl/mbox: Add a check on input payload size
      16c7a368
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-v6.1-rc4' of... · aa529949
      Linus Torvalds authored
      Merge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
       "Fix two regressions:
      
         - Commit 54cc3dbf ("hwmon: (pmbus) Add regulator supply into
           macro") resulted in regulator undercount when disabling regulators.
           Revert it.
      
         - The thermal subsystem rework caused the scmi driver to no longer
           register with the thermal subsystem because index values no longer
           match. To fix the problem, the scmi driver now directly registers
           with the thermal subsystem, no longer through the hwmon core"
      
      * tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        Revert "hwmon: (pmbus) Add regulator supply into macro"
        hwmon: (scmi) Register explicitly with Thermal Framework
      aa529949
    • Linus Torvalds's avatar
      Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 727ea09e
      Linus Torvalds authored
      Pull perf fixes from Borislav Petkov:
      
       - Add Cooper Lake's stepping to the PEBS guest/host events isolation
         fixed microcode revisions checking quirk
      
       - Update Icelake and Sapphire Rapids events constraints
      
       - Use the standard energy unit for Sapphire Rapids in RAPL
      
       - Fix the hw_breakpoint test to fail more graciously on !SMP configs
      
      * tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]
        perf/x86/intel: Fix pebs event constraints for SPR
        perf/x86/intel: Fix pebs event constraints for ICL
        perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain
        perf/hw_breakpoint: test: Skip the test if dependencies unmet
      727ea09e