1. 05 Mar, 2024 13 commits
    • Filipe Manana's avatar
      btrfs: fix off-by-one chunk length calculation at contains_pending_extent() · ae6bd7f9
      Filipe Manana authored
      At contains_pending_extent() the value of the end offset of a chunk we
      found in the device's allocation state io tree is inclusive, so when
      we calculate the length we pass to the in_range() macro, we must sum
      1 to the expression "physical_end - physical_offset".
      
      In practice the wrong calculation should be harmless as chunks sizes
      are never 1 byte and we should never have 1 byte ranges of unallocated
      space. Nevertheless fix the wrong calculation.
      Reported-by: default avatarAlex Lyakas <alex.lyakas@zadara.com>
      Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/
      Fixes: 1c11b63e ("btrfs: replace pending/pinned chunks lists with io tree")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae6bd7f9
    • Qu Wenruo's avatar
      btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent · b20fe56c
      Qu Wenruo authored
      Currently "btrfs subvolume snapshot -i <qgroupid>" would always mark the
      qgroup inconsistent.
      
      This can be annoying if the fs has a lot of snapshots, and needs qgroup
      to get the accounting for the amount of bytes it can free for each
      snapshot.
      
      Although we have the new simple quote as a solution, there is also a
      case where we can skip the full scan, if all the following conditions
      are met:
      
      - The source subvolume belongs to a higher level parent qgroup
      - The parent qgroup already owns all its bytes exclusively
      - The new snapshot is also added to the same parent qgroup
      
      In that case, we only need to add nodesize to the parent qgroup and
      avoid a full rescan.
      
      This patch would add the extra quick accounting update for such inherit.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b20fe56c
    • Qu Wenruo's avatar
      btrfs: qgroup: validate btrfs_qgroup_inherit parameter · 86211eea
      Qu Wenruo authored
      [BUG]
      Currently btrfs can create subvolume with an invalid qgroup inherit
      without triggering any error:
      
        # mkfs.btrfs -O quota -f $dev
        # mount $dev $mnt
        # btrfs subvolume create -i 2/0 $mnt/subv1
        # btrfs qgroup show -prce --sync $mnt
        Qgroupid    Referenced    Exclusive   Path
        --------    ----------    ---------   ----
        0/5           16.00KiB     16.00KiB   <toplevel>
        0/256         16.00KiB     16.00KiB   subv1
      
      [CAUSE]
      We only do a very basic size check for btrfs_qgroup_inherit structure,
      but never really verify if the values are correct.
      
      Thus in btrfs_qgroup_inherit() function, we have to skip non-existing
      qgroups, and never return any error.
      
      [FIX]
      Fix the behavior and introduce extra checks:
      
      - Introduce early check for btrfs_qgroup_inherit structure
        Not only the size, but also all the qgroup ids would be verified.
      
        And the timing is very early, so we can return error early.
        This early check is very important for snapshot creation, as snapshot
        is delayed to transaction commit.
      
      - Drop support for btrfs_qgroup_inherit::num_ref_copies and
        num_excl_copies
        Those two members are used to specify to copy refr/excl numbers from
        other qgroups.
        This would definitely mark qgroup inconsistent, and btrfs-progs has
        dropped the support for them for a long time.
        It's time to drop the support for kernel.
      
      - Verify the supported btrfs_qgroup_inherit::flags
        Just in case we want to add extra flags for btrfs_qgroup_inherit.
      
      Now above subvolume creation would fail with -ENOENT other than silently
      ignore the non-existing qgroup.
      
      CC: stable@vger.kernel.org # 6.7+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      86211eea
    • Anand Jain's avatar
      btrfs: include device major and minor numbers in the device scan notice · 0782303a
      Anand Jain authored
      To better debug issues surrounding device scans, include the device's
      major and minor numbers in the device scan notice for btrfs.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0782303a
    • Lijuan Li's avatar
      btrfs: mark btrfs_put_caching_control() static · 7ec28f83
      Lijuan Li authored
      btrfs_put_caching_control() is only used in block-group.c, so mark it
      static.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarLijuan Li <lilijuan@iscas.ac.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ec28f83
    • Chengming Zhou's avatar
      btrfs: remove SLAB_MEM_SPREAD flag use · ef5a05c5
      Chengming Zhou authored
      The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
      removed as of v6.8-rc1, so it became a dead flag since the commit
      16a1d968 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
      series[1] went on to mark it obsolete to avoid confusion for users.
      Here we can just remove all its users, which has no functional change.
      
      [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef5a05c5
    • Qu Wenruo's avatar
      btrfs: qgroup: always free reserved space for extent records · d139ded8
      Qu Wenruo authored
      [BUG]
      If qgroup is marked inconsistent (e.g. caused by operations needing full
      subtree rescan, like creating a snapshot and assign to a higher level
      qgroup), btrfs would immediately start leaking its data reserved space.
      
      The following script can easily reproduce it:
      
        mkfs.btrfs -O quota -f $dev
        mount $dev $mnt
        btrfs subvolume create $mnt/subv1
        btrfs qgroup create 1/0 $mnt
      
        # This snapshot creation would mark qgroup inconsistent,
        # as the ownership involves different higher level qgroup, thus
        # we have to rescan both source and snapshot, which can be very
        # time consuming, thus here btrfs just choose to mark qgroup
        # inconsistent, and let users to determine when to do the rescan.
        btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1
      
        # Now this write would lead to qgroup rsv leak.
        xfs_io -f -c "pwrite 0 64k" $mnt/file1
      
        # And at unmount time, btrfs would report 64K DATA rsv space leaked.
        umount $mnt
      
      And we would have the following dmesg output for the unmount:
      
        BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3
        BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536
      
      [CAUSE]
      Since commit e15e9f43 ("btrfs: introduce
      BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"),
      we introduce a mode for btrfs qgroup to skip the timing consuming
      backref walk, if the qgroup is already inconsistent.
      
      But this skip also covered the data reserved freeing, thus the qgroup
      reserved space for each newly created data extent would not be freed,
      thus cause the leakage.
      
      [FIX]
      Make the data extent reserved space freeing mandatory.
      
      The qgroup reserved space handling is way cheaper compared to the
      backref walking part, and we always have the super sensitive leak
      detector, thus it's definitely worth to always free the qgroup
      reserved data space.
      Reported-by: default avatarFabian Vogt <fvogt@suse.com>
      Fixes: e15e9f43 ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting")
      CC: stable@vger.kernel.org # 6.1+
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d139ded8
    • Qu Wenruo's avatar
      btrfs: tree-checker: dump the page status if hit something wrong · dd6a5719
      Qu Wenruo authored
      [BUG]
      There is a bug report about very suspicious tree-checker got triggered:
      
        BTRFS critical (device dm-0): corrupted node, root=256
      block=8550954455682405139 owner mismatch, have 11858205567642294356
      expect [256, 18446744073709551360]
        BTRFS critical (device dm-0): corrupted node, root=256
      block=8550954455682405139 owner mismatch, have 11858205567642294356
      expect [256, 18446744073709551360]
        BTRFS critical (device dm-0): corrupted node, root=256
      block=8550954455682405139 owner mismatch, have 11858205567642294356
      expect [256, 18446744073709551360]
        SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
      ino=5737268
      
      [ANALYZE]
      The root cause is still unclear, but there are some clues already:
      
      - Unaligned eb bytenr
        The block bytenr is 8550954455682405139, which is not even aligned to
        2.
        This bytenr is fetched from extent buffer header, not from eb->start.
      
        This means, at the initial time of read, eb header bytenr is still
        correct (the very basis check to continue read), but later something
        wrong happened, got at least the first page corrupted.
        Thus we got such obviously incorrect value.
      
      - Invalid extent buffer header owner
        The read itself is triggered for subvolume 256, but the eb header
        owner is 11858205567642294356, which is not really possible.
        The problem here is, subvolume id is limited to (1 << 48 - 1),
        and this one definitely goes beyond that limit.
      
        So this value is another garbage.
      
      We already got two garbage from an extent buffer, which passed the
      initial bytenr and csum checks, but later the contents become garbage at
      some point.
      
      This looks like a page lifespan problem (e.g. we didn't properly hold the
      page).
      
      [ENHANCEMENT]
      The current tree-checker only outputs things from the extent buffer,
      nothing with the page status.
      
      So this patch would enhance the tree-checker output by also dumping the
      first page, which would look like this:
      
        page:00000000aa9f3ce8 refcount:4 mapcount:0 mapping:00000000169aa6b6 index:0x1d0c pfn:0x1022e5
        memcg:ffff888103456000
        aops:btree_aops [btrfs] ino:1
        flags: 0x2ffff0000008000(private|node=0|zone=2|lastcpupid=0xffff)
        page_type: 0xffffffff()
        raw: 02ffff0000008000 0000000000000000 dead000000000122 ffff88811e06e220
        raw: 0000000000001d0c ffff888102fdb1d8 00000004ffffffff ffff888103456000
        page dumped because: eb page dump
        BTRFS critical (device dm-3): corrupt leaf: root=5 block=30457856 slot=6 ino=257 file_offset=0, invalid disk_bytenr for file extent, have 10617606235235216665, should be aligned to 4096
        BTRFS error (device dm-3): read time tree block corruption detected on logical 30457856 mirror 1
      
      From the dump we can see some extra info, something can help us to do
      extra cross-checks:
      
      - Page refcount
        if it's too low, it definitely means something bad.
      
      - Page aops
        Any mapped eb page should have btree_aops with inode number 1.
      
      - Page index
        Since a mapped eb page should has its bytenr matching the page
        position, (index << PAGE_SHIFT) should match the bytenr of the
        bytenr from the critical line.
      
      - Page Private flags
        A mapped eb page should have Private flag set to indicate it's managed
        by btrfs.
      
      Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd6a5719
    • Qu Wenruo's avatar
      btrfs: compression: remove dead comments in btrfs_compress_heuristic() · 25da852d
      Qu Wenruo authored
      Since commit a440d48c ("Btrfs: heuristic: implement sampling
      logic"), btrfs_compress_heuristic() is no longer a simple "return true",
      but more complex to determine if we should compress.
      
      Thus the comment is dead and can be confusing, just remove it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25da852d
    • Qu Wenruo's avatar
      btrfs: subpage: make writer lock utilize bitmap · b086c5bd
      Qu Wenruo authored
      For the writer counter, it's pretty much the same as the reader counter,
      and they are exclusive.  So move them to the new locked bitmap.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b086c5bd
    • Qu Wenruo's avatar
      btrfs: subpage: make reader lock utilize bitmap · 8e7e9c67
      Qu Wenruo authored
      Currently btrfs_subpage utilizes its atomic member @reader to manage the
      reader counter.  However it is only utilized to prevent the page to be
      released/unlocked when we still have reads underway.
      
      In that use case, we don't really allow multiple readers on the same
      subpage sector.  So here we can introduce a new locked bitmap to
      represent exactly which subpage range is locked for read.
      
      In theory we can remove btrfs_subpage::reader as it's just the set bits
      of the new locked bitmap.  But unfortunately bitmap doesn't provide such
      handy API yet, so we still keep the reader counter.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e7e9c67
    • Qu Wenruo's avatar
      btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer() · 621b9ff1
      Qu Wenruo authored
      Both functions were introduced in commit 1e1de387 ("btrfs: make
      process_one_page() to handle subpage locking"), but they have never
      been utilized out of subpage code.  So just unexport them.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      621b9ff1
    • David Sterba's avatar
      btrfs: pass a valid extent map cache pointer to __get_extent_map() · 970ea374
      David Sterba authored
      We can pass a valid em cache pointer down to __get_extent_map() and
      drop the validity check. This avoids the special case, the call stacks
      are simple:
      
      btrfs_read_folio
        btrfs_do_readpage
          __get_extent_map
      
      extent_readahead
        contiguous_readpages
          btrfs_do_readpage
            __get_extent_map
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      970ea374
  2. 04 Mar, 2024 27 commits