1. 15 Dec, 2023 40 commits
    • Johannes Thumshirn's avatar
      btrfs: re-introduce struct btrfs_io_geometry · fd747f2d
      Johannes Thumshirn authored
      Re-introduce struct btrfs_io_geometry, holding the necessary bits and
      pieces needed in btrfs_map_block() to decide the I/O geometry of a specific
      block mapping.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd747f2d
    • Johannes Thumshirn's avatar
      btrfs: factor out helper for single device IO check · 02d05b64
      Johannes Thumshirn authored
      The check in btrfs_map_block() deciding if a particular I/O is targeting a
      single device is getting more and more convoluted.
      
      Factor out the check conditions into a helper function, with no functional
      change otherwise.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02d05b64
    • Qu Wenruo's avatar
      btrfs: migrate btrfs_repair_io_failure() to folio interfaces · 96c36eaa
      Qu Wenruo authored
      [BUG]
      Test case btrfs/124 failed if larger metadata folio is enabled, the
      dying message looks like this:
      
       BTRFS error (device dm-2): bad tree block start, mirror 2 want 31686656 have 0
       BTRFS info (device dm-2): read error corrected: ino 0 off 31686656 (dev /dev/mapper/test-scratch2 sector 20928)
       BUG: kernel NULL pointer dereference, address: 0000000000000020
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       CPU: 6 PID: 350881 Comm: btrfs Tainted: G           OE      6.7.0-rc3-custom+ #128
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
       RIP: 0010:btrfs_read_extent_buffer+0x106/0x180 [btrfs]
       PKRU: 55555554
       Call Trace:
        <TASK>
        read_tree_block+0x33/0xb0 [btrfs]
        read_block_for_search+0x23e/0x340 [btrfs]
        btrfs_search_slot+0x2f9/0xe60 [btrfs]
        btrfs_lookup_csum+0x75/0x160 [btrfs]
        btrfs_lookup_bio_sums+0x21a/0x560 [btrfs]
        btrfs_submit_chunk+0x152/0x680 [btrfs]
        btrfs_submit_bio+0x1c/0x50 [btrfs]
        submit_one_bio+0x40/0x80 [btrfs]
        submit_extent_page+0x158/0x390 [btrfs]
        btrfs_do_readpage+0x330/0x740 [btrfs]
        extent_readahead+0x38d/0x6c0 [btrfs]
        read_pages+0x94/0x2c0
        page_cache_ra_unbounded+0x12d/0x190
        relocate_file_extent_cluster+0x7c1/0x9d0 [btrfs]
        relocate_block_group+0x2d3/0x560 [btrfs]
        btrfs_relocate_block_group+0x2c7/0x4b0 [btrfs]
        btrfs_relocate_chunk+0x4c/0x1a0 [btrfs]
        btrfs_balance+0x925/0x13c0 [btrfs]
        btrfs_ioctl+0x19f1/0x25d0 [btrfs]
        __x64_sys_ioctl+0x90/0xd0
        do_syscall_64+0x3f/0xf0
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      [CAUSE]
      The dying line is at btrfs_repair_io_failure() call inside
      btrfs_repair_eb_io_failure().
      
      The function is still relying on the extent buffer using page sized
      folios.
      When the extent buffer is using larger folio, we go into the 2nd slot of
      folios[], and triggered the NULL pointer dereference.
      
      [FIX]
      Migrate btrfs_repair_io_failure() to folio interfaces.
      
      So that when we hit a larger folio, we just submit the whole folio in
      one go.
      
      This also affects data repair path through btrfs_end_repair_bio(),
      thankfully data is still fully page based, we can just add an
      ASSERT(), and use page_folio() to convert the page to folio.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96c36eaa
    • Qu Wenruo's avatar
      btrfs: migrate eb_bitmap_offset() to folio interfaces · f4521b01
      Qu Wenruo authored
      [BUG]
      Test case btrfs/002 would fail if larger folios are enabled for
      metadata:
      
       assertion failed: folio, in fs/btrfs/extent_io.c:4358
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/extent_io.c:4358!
       invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 30916 Comm: fsstress Tainted: G           OE      6.7.0-rc3-custom+ #128
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
       RIP: 0010:assert_eb_folio_uptodate+0x98/0xe0 [btrfs]
       Call Trace:
        <TASK>
        extent_buffer_test_bit+0x3c/0x70 [btrfs]
        free_space_test_bit+0xcd/0x140 [btrfs]
        modify_free_space_bitmap+0x27a/0x430 [btrfs]
        add_to_free_space_tree+0x8d/0x160 [btrfs]
        __btrfs_free_extent.isra.0+0xef1/0x13c0 [btrfs]
        __btrfs_run_delayed_refs+0x786/0x13c0 [btrfs]
        btrfs_run_delayed_refs+0x33/0x120 [btrfs]
        btrfs_commit_transaction+0xa2/0x1350 [btrfs]
        iterate_supers+0x77/0xe0
        ksys_sync+0x60/0xa0
        __do_sys_sync+0xa/0x20
        do_syscall_64+0x3f/0xf0
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
        </TASK>
      
      [CAUSE]
      The function extent_buffer_test_bit() is not folio compatible.
      
      It still assumes the old fixed page size, when an extent buffer with
      large folio passed in, only eb->folios[0] is populated.
      
      Then if the target bit range falls in the 2nd page of the folio, then we
      would check eb->folios[1], and trigger the ASSERT().
      
      [FIX]
      Just migrate eb_bitmap_offset() to folio interfaces, using the
      folio_size() to replace PAGE_SIZE.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f4521b01
    • Qu Wenruo's avatar
      btrfs: migrate various end io functions to folios · a700ca5e
      Qu Wenruo authored
      If we still go the old page based iterator functions, like
      bio_for_each_segment_all(), we can hit middle pages of a folio (compound
      page).
      
      In that case if we set any page flag on those middle pages, we can
      easily trigger VM_BUG_ON(), as for compound page flags, they should
      follow their flag policies (normally only set on leading or tail pages).
      
      To avoid such problem in the future full folio migration, here we do:
      
      - Change from bio_for_each_segment_all() to bio_for_each_folio_all()
        This completely removes the ability to access the middle page.
      
      - Add extra ASSERT()s for data read/write paths
        To ensure we only get single paged folio for data now.
      
      - Rename those end io functions to follow a certain schema
        * end_bbio_compressed_read()
        * end_bbio_compressed_write()
      
          These two endio functions don't set any page flags, as they use pages
          not mapped to any address space.
          They can be very good candidates for higher order folio testing.
      
          And they are shared between compression and encoded IO.
      
        * end_bbio_data_read()
        * end_bbio_data_write()
        * end_bbio_meta_read()
        * end_bbio_meta_write()
      
        The old function names are not unified:
          - end_bio_extent_writepage()
          - end_bio_extent_readpage()
          - extent_buffer_write_end_io()
          - extent_buffer_read_end_io()
      
        They share no schema on where the "end_*io" string should be, nor can
        be confusing just using "extent_buffer" and "extent" to distinguish
        data and metadata paths.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a700ca5e
    • Qu Wenruo's avatar
      btrfs: migrate subpage code to folio interfaces · 55151ea9
      Qu Wenruo authored
      Although subpage itself is conflicting with higher folio, since subpage
      (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never
      need higher order folio, there is a hidden pitfall:
      
      - btrfs_page_*() helpers
      
      Those helpers are an abstraction to handle both subpage and non-subpage
      cases, which means we're going to pass pages pointers to those helpers.
      
      And since those helpers are shared between data and metadata paths, it's
      unavoidable to let them to handle folios, including higher order
      folios).
      
      Meanwhile for true subpage case, we should only have a single page
      backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert()
      to ensure that.
      
      Also since those helpers are shared between both data and metadata, add
      some extra ASSERT()s for data path to make sure we only get single page
      backed folio for now.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55151ea9
    • Qu Wenruo's avatar
      btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios · 8d993618
      Qu Wenruo authored
      These two functions are still using the old page based code, which is
      not going to handle larger folios at all.
      
      The migration itself is going to involve the following changes:
      
      - PAGE_SIZE -> folio_size()
      - PAGE_SHIFT -> folio_shift()
      - get_eb_page_index() -> get_eb_folio_index()
      - get_eb_offset_in_page() -> get_eb_offset_in_folio()
      
      And since we're going to support larger folios, although above straight
      conversion is good enough, this patch would add extra comments in the
      involved functions to explain why the same single line code can now
      cover 3 cases:
      
      - folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
        The common, non-subpage case with per-page folio.
      
      - folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
        The incoming larger folio, non-subpage case.
      
      - folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
        The existing subpage case, we won't larger folio anyway.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8d993618
    • Josef Bacik's avatar
      btrfs: don't double put our subpage reference in alloc_extent_buffer · 4a565c80
      Josef Bacik authored
      This fixes as case in "btrfs: refactor alloc_extent_buffer() to
      allocate-then-attach method".
      
      We have been seeing panics in the CI for the subpage stuff recently, it
      happens on btrfs/187 but could potentially happen anywhere.
      
      In the subpage case, if we race with somebody else inserting the same
      extent buffer, the error case will end up calling
      detach_extent_buffer_page() on the page twice.
      
      This is done first in the bit
      
      for (int i = 0; i < attached; i++)
      	detach_extent_buffer_page(eb, eb->pages[i];
      
      and then again in btrfs_release_extent_buffer().
      
      This works fine for !subpage because we're the only person who ever has
      ourselves on the private, and so when we do the initial
      detach_extent_buffer_page() we know we've completely removed it.
      
      However for subpage we could be using this page private elsewhere, so
      this results in a double put on the subpage, which can result in an
      early freeing.
      
      The fix here is to clear eb->pages[i] for everything we detach.  Then
      anything still attached to the eb is freed in
      btrfs_release_extent_buffer().
      
      Because of this change we must update
      btrfs_release_extent_buffer_pages() to not use num_extent_folios,
      because it assumes eb->folio[0] is set properly.  Since this is only
      interested in freeing any pages we have on the extent buffer we can
      simply use INLINE_EXTENT_BUFFER_PAGES.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a565c80
    • Qu Wenruo's avatar
      btrfs: cleanup metadata page pointer usage · 13df3775
      Qu Wenruo authored
      Although we have migrated extent_buffer::pages[] to folios[], we're
      still mostly using the folio_page() help to grab the page.
      
      This patch would do the following cleanups for metadata:
      
      - Introduce num_extent_folios() helper
        This is to replace most num_extent_pages() callers.
      
      - Use num_extent_folios() to iterate future large folios
        This allows us to use things like
        bio_add_folio()/bio_add_folio_nofail(), and only set the needed flags
        for the folio (aka the leading/tailing page), which reduces the loop
        iteration to 1 for large folios.
      
      - Change metadata related functions to use folio pointers
        Including their function name, involving:
        * attach_extent_buffer_page()
        * detach_extent_buffer_page()
        * page_range_has_eb()
        * btrfs_release_extent_buffer_pages()
        * btree_clear_page_dirty()
        * btrfs_page_inc_eb_refs()
        * btrfs_page_dec_eb_refs()
      
      - Change btrfs_is_subpage() to accept an address_space pointer
        This is to allow both page->mapping and folio->mapping to be utilized.
        As data is still using the old per-page code, and may keep so for a
        while.
      
      - Special corner case place holder for future order mismatches between
        extent buffer and inode filemap
        For now it's  just a block of comments and a dead ASSERT(), no real
        handling yet.
      
      The subpage code would still go page, just because subpage and large
      folio are conflicting conditions, thus we don't need to bother subpage
      with higher order folios at all. Just folio_page(folio, 0) would be
      enough.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor styling tweaks ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13df3775
    • Qu Wenruo's avatar
      btrfs: migrate extent_buffer::pages[] to folio · 082d5bb9
      Qu Wenruo authored
      For now extent_buffer::pages[] are still only accepting single page
      pointer, thus we can migrate to folios pretty easily.
      
      As for single page, page and folio are 1:1 mapped, including their page
      flags.
      
      This patch would just do the conversion from struct page to struct
      folio, providing the first step to higher order folio in the future.
      
      This conversion is pretty simple:
      
      - extent_buffer::pages[] -> extent_buffer::folios[]
      
      - page_address(eb->pages[i]) -> folio_address(eb->pages[i])
      
      - eb->pages[i] -> folio_page(eb->folios[i], 0)
      
      There would be more specific cleanups preparing for the incoming higher
      order folio support.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      082d5bb9
    • Qu Wenruo's avatar
      btrfs: refactor alloc_extent_buffer() to allocate-then-attach method · 09e6cef1
      Qu Wenruo authored
      Currently alloc_extent_buffer() utilizes find_or_create_page() to
      allocate one page a time for an extent buffer.
      
      This method has the following disadvantages:
      
      - find_or_create_page() is the legacy way of allocating new pages
        With the new folio infrastructure, find_or_create_page() is just
        redirected to filemap_get_folio().
      
      - Lacks the way to support higher order (order >= 1) folios
        As we can not yet let filemap give us a higher order folio.
      
      This patch would change the workflow by the following way:
      
      		Old		   |		new
      -----------------------------------+-------------------------------------
                                         | ret = btrfs_alloc_page_array();
      for (i = 0; i < num_pages; i++) {  | for (i = 0; i < num_pages; i++) {
          p = find_or_create_page();     |     ret = filemap_add_folio();
          /* Attach page private */      |     /* Reuse page cache if needed */
          /* Reused eb if needed */      |
      				   |     /* Attach page private and
      				   |        reuse eb if needed */
      				   | }
      
      By this we split the page allocation and private attaching into two
      parts, allowing future updates to each part more easily, and migrate to
      folio interfaces (especially for possible higher order folios).
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09e6cef1
    • David Disseldorp's avatar
      btrfs: sysfs: validate scrub_speed_max value · 2b0122aa
      David Disseldorp authored
      The value set as scrub_speed_max accepts size with suffixes
      (k/m/g/t/p/e) but we should still validate it for trailing characters,
      similar to what we do with chunk_size_store.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarDavid Disseldorp <ddiss@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b0122aa
    • David Sterba's avatar
      btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-tree · 6140ba8a
      David Sterba authored
      The radix-tree has been superseded by the xarray
      (https://lwn.net/Articles/745073), this patch converts the
      btrfs_root::delayed_nodes, the APIs are used in a simple way.
      
      First idea is to do xa_insert() but this would require GFP_ATOMIC
      allocation which we want to avoid if possible. The preload mechanism of
      radix-tree can be emulated within the xarray API.
      
      - xa_reserve() with GFP_NOFS outside of the lock, the reserved entry
        is inserted atomically at most once
      
      - xa_store() under a lock, in case something races in we can detect that
        and xa_load() returns a valid pointer
      
      All uses of xa_load() must check for a valid pointer in case they manage
      to get between the xa_reserve() and xa_store(), this is handled in
      btrfs_get_delayed_node().
      
      Otherwise the functionality is equivalent, xarray implements the
      radix-tree and there should be no performance difference.
      
      The patch continues the efforts started in 253bf575 ("btrfs: turn
      delayed_nodes_tree into an XArray") and fixes the problems with locking
      and GFP flags 088aea3b ("Revert "btrfs: turn delayed_nodes_tree
      into an XArray"").
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6140ba8a
    • David Sterba's avatar
      btrfs: fix typos found by codespell · eefaf0a1
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eefaf0a1
    • Qu Wenruo's avatar
      btrfs: fix mismatching parameter names for btrfs_get_extent() · 4618d0a6
      Qu Wenruo authored
      The definition for btrfs_get_extent() is using "u64 end" as the last
      parameter, but in implementation we go "u64 len", and all call sites
      follows the implementation.
      
      This can be very confusing during development, as most developers
      including me, would just use the snippet returned by LSP (clangd in my
      case), which would only check the definition.
      
      Unfortunately this mismatch is introduced from the very beginning of
      btrfs.
      
      Fix it to prevent further confusion.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4618d0a6
    • Filipe Manana's avatar
      btrfs: use the flags of an extent map to identify the compression type · f86f7a75
      Filipe Manana authored
      Currently, in struct extent_map, we use an unsigned int (32 bits) to
      identify the compression type of an extent and an unsigned long (64 bits
      on a 64 bits platform, 32 bits otherwise) for flags. We are only using
      6 different flags, so an unsigned long is excessive and we can use flags
      to identify the compression type instead of using a dedicated 32 bits
      field.
      
      We can easily have tens or hundreds of thousands (or more) of extent maps
      on busy and large filesystems, specially with compression enabled or many
      or large files with tons of small extents. So it's convenient to have the
      extent_map structure as small as possible in order to use less memory.
      
      So remove the compression type field from struct extent_map, use flags
      to identify the compression type and shorten the flags field from an
      unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
      reduces the size of the structure from 136 bytes down to 128 bytes, using
      now only two cache lines, and increases the number of extent maps we can
      have per 4K page from 30 to 32. By using a u32 for the flags instead of
      an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
      but that level of atomicity is not needed as most flags are never cleared
      once set (before adding an extent map to the tree), and the ones that can
      be cleared or set after an extent map is added to the tree, are always
      performed while holding the write lock on the extent map tree, while the
      reader holds a lock on the tree or tests for a flag that never changes
      once the extent map is in the tree (such as compression flags).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f86f7a75
    • Filipe Manana's avatar
      btrfs: refactor mergable_maps() for more readability · 27f0d9c9
      Filipe Manana authored
      At mergable_maps() instead of having a single if statement with many
      ORed and ANDed conditions, refactor it with multiple if statements that
      check a single condition and return immediately once a requirement fails.
      This makes it easier to read.
      
      Also change the return type from int to bool, make the arguments const
      and rename the function from mergable_maps() to mergeable_maps().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27f0d9c9
    • Filipe Manana's avatar
      btrfs: make extent_map_end() argument const · b144cc04
      Filipe Manana authored
      The extent map pointer argument for extent_map_end() can be const as we
      are not modifyng anything in the extent map. So make it const, as it will
      allow further changes to callers that have a const extent map pointer.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b144cc04
    • Filipe Manana's avatar
      btrfs: avoid useless rbtree iterations when attempting to merge extent map · 1a9fb16c
      Filipe Manana authored
      When trying to merge an extent map that was just inserted or unpinned, we
      will try to merge it with any adjacent extent map that is suitable.
      
      However we will only check if our extent map is mergeable after searching
      for the previous and next extent maps in the rbtree, meaning that we are
      doing unnecessary calls to rb_prev() and rb_next() in case our extent map
      is not mergeable (it's compressed, in the list of modifed extents, being
      logged or pinned), wasting CPU time chasing rbtree pointers and pulling
      in unnecessary cache lines.
      
      So change the logic to check first if an extent map is mergeable before
      searching for the next and previous extent maps in the rbtree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a9fb16c
    • Filipe Manana's avatar
      btrfs: log messages at unpin_extent_range() during unexpected cases · 00deaf04
      Filipe Manana authored
      At unpin_extent_range() we trigger a WARN_ON() when we don't find an
      extent map or we find one with a start offset not matching the start
      offset of the target range. This however isn't very useful for debugging
      because:
      
      1) We don't know which condition was triggered, as they are both in the
         same WARN_ON() call;
      
      2) We don't know which inode was affected, from which root, for which
         range, what's the start offset of the extent map, and so on.
      
      So trigger a separate warning for each case and log a message for each
      case providing information about the inode, its root, the target range,
      the generation and the start offset of the extent map we found.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00deaf04
    • Filipe Manana's avatar
      btrfs: remove redundant value assignment at btrfs_add_extent_mapping() · d224d2ef
      Filipe Manana authored
      At btrfs_add_extent_mapping(), in case add_extent_mapping() returned
      -EEXIST, it's pointless to assign 0 to 'ret' since we will assign a value
      to it shortly after, without 'ret' being used before that. So remove that
      pointless assignment.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d224d2ef
    • Filipe Manana's avatar
      btrfs: unexport add_extent_mapping() · db9d9446
      Filipe Manana authored
      There's no need to export add_extent_mapping(), as it's only used inside
      extent_map.c and in the self tests. For the tests we can use instead
      btrfs_add_extent_mapping(), which will accomplish exactly the same as we
      don't expect collisions in any of them. So unexport it and make the tests
      use btrfs_add_extent_mapping() instead of add_extent_mapping().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db9d9446
    • Filipe Manana's avatar
      btrfs: tests: print all values as decimal in messages for extent map tests · c9201b4f
      Filipe Manana authored
      Some error messages of the extent map tests print decimal values of start
      offsets and lengths, while other are oddly printing in hexadecimal, which
      is far less human friendly, specially taking into consideration that all
      the values are small and multiples of 4K, so it's a lot easier to read
      them as decimal values. Change the format specifiers to print as decimal
      instead.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9201b4f
    • Filipe Manana's avatar
      btrfs: tests: do not ignore NULL extent maps for extent maps tests · eca3aaec
      Filipe Manana authored
      Several of the extent map tests call btrfs_add_extent_mapping() which is
      supposed to succeed and return an extent map through the pointer to
      pointer argument. However the tests are deliberately ignoring a NULL
      extent map, which is not expected to happen. So change the tests to error
      out if a NULL extent map is found.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eca3aaec
    • Filipe Manana's avatar
      btrfs: tests: fix error messages for test case 4 of extent map tests · b30aa1c1
      Filipe Manana authored
      In test case 4 for extent maps, if we error out we are supposed to print
      in interval but instead of printing a non-inclusive end offset, we are
      printing the length of the interval, which makes it confusing. So fix
      that to print the exclusive end offset instead.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b30aa1c1
    • Filipe Manana's avatar
      btrfs: assert extent map is not in a list when setting it up · 32d53f6f
      Filipe Manana authored
      When setting up a new extent map, at setup_extent_mapping(), we're doing
      a list move operation to add the extent map the tree's list of modified
      extents. This is confusing because at this point the extent map can not
      be in any list, because it's a new extent map. So replace the list move
      with a list add and add an assertion that checks that the extent map is
      not currently in any list.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32d53f6f
    • David Sterba's avatar
      btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLES · 637e6e0f
      David Sterba authored
      The file_extent_tree was added in 41a2ee75 ("btrfs: introduce
      per-inode file extent tree") so we have an explicit mapping of the file
      extents to know where it is safe to update i_size. When the feature
      NO_HOLES is enabled, and it's been a mkfs default since 5.15, the tree
      is not necessary.
      
      To save some space in the inode, allocate the tree only when necessary.
      This reduces size by 16 bytes from 1096 to 1080 on a x86_64 release
      config.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      637e6e0f
    • Josef Bacik's avatar
      btrfs: cache that we don't have security.capability set · ed9b50a1
      Josef Bacik authored
      When profiling a workload I noticed we were constantly calling getxattr.
      These were mostly coming from __remove_privs, which will lookup if
      security.capability exists to remove it.  However instrumenting getxattr
      showed we get called nearly constantly on an idle machine on a lot of
      accesses.
      
      These are wasteful and not free.  Other security LSMs have a way to
      cache their results, but capability doesn't have this, so it's asking us
      all the time for the xattr.
      
      Fix this by setting a flag in our inode that it doesn't have a
      security.capability xattr.  We set this on new inodes and after a failed
      lookup of security.capability.  If we set this xattr at all we'll clear
      the flag.
      
      I haven't found a test in fsperf that this makes a visible difference
      on, but I assume fs_mark related tests would show it clearly.  This is a
      perf report output of the smallfiles100k run where it shows 20% of our
      time spent in __remove_privs because we're looking up the non-existent
      xattr.
      
      --21.86%--btrfs_write_check.constprop.0
        --21.62%--__file_remove_privs
          --21.55%--security_inode_need_killpriv
            --21.54%--cap_inode_need_killpriv
              --21.53%--__vfs_getxattr
                --20.89%--btrfs_getxattr
      
      Obviously this is just CPU time in a mostly IO bound test, so the actual
      effect of removing this callchain is minimal.  However in just normal
      testing of an idle system tracing showed around 100 getxattr calls per
      minute, and with this patch there are 0.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed9b50a1
    • Josef Bacik's avatar
      btrfs: remove code for inode_cache and recovery mount options · a1912f71
      Josef Bacik authored
      We've deprecated these a while ago in 5.11, go ahead and remove the code
      for them.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1912f71
    • Josef Bacik's avatar
      btrfs: set clear_cache if we use usebackuproot · 9fb3b1a7
      Josef Bacik authored
      We're currently setting this when we try to load the roots and we see
      that usebackuproot is set.  Instead set this at mount option parsing
      time.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9fb3b1a7
    • Josef Bacik's avatar
      btrfs: move one shot mount option clearing to super.c · 83e3a40a
      Josef Bacik authored
      There's no reason this has to happen in open_ctree, and in fact in the
      old mount API we had to call this from remount.  Move this to super.c,
      unexport it, and call it from both mount and reconfigure.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      83e3a40a
    • Josef Bacik's avatar
      btrfs: remove old mount API code · 6941823c
      Josef Bacik authored
      Now that we've switched to the new mount API, remove the old stuff.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6941823c
    • Josef Bacik's avatar
      btrfs: move the device specific mount options to super.c · 41d46b29
      Josef Bacik authored
      We add these mount options based on the fs_devices settings, which can
      be set once we've opened the fs_devices.  Move these into their own
      helper and call it from get_tree_super.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      41d46b29
    • Josef Bacik's avatar
      btrfs: switch to the new mount API · ad21f15b
      Josef Bacik authored
      Now that we have all of the parts in place to use the new mount API,
      switch our fs_type to use the new callbacks.
      
      There are a few things that have to be done at the same time because of
      the order of operations changes that come along with the new mount API.
      These must be done in the same patch otherwise things will go wrong.
      
      1. Export and use btrfs_check_options in open_ctree().  This is because
         the options are done ahead of time, and we need to check them once we
         have the feature flags loaded.
      
      2. Update the free space cache settings.  Since we're coming in with the
         options already set we need to make sure we don't undo what the user
         has asked for.
      
      3. Set our sb_flags at init_fs_context time, the fs_context stuff is
         trying to manage the sb_flagss itself, so move that into
         init_fs_context and out of the fill super part.
      
      Additionally I've marked the unused functions with __maybe_unused and
      will remove them in a future patch.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad21f15b
    • Josef Bacik's avatar
      btrfs: handle the ro->rw transition for mounting different subvolumes · f044b318
      Josef Bacik authored
      This is a special case that we've carried around since 0723a047 ("btrfs:
      allow mounting btrfs subvolumes with different ro/rw options") where
      we'll under the covers flip the file system to RW if you're mixing and
      matching ro/rw options with different subvol mounts.  The first mount is
      what the super gets setup as, so we'd handle this by remount the super
      as rw under the covers to facilitate this behavior.
      
      With the new mount API we can't really allow this, because user space
      has the ability to specify the super block settings, and the mount
      settings.  So if the user explicitly sets the super block as read only,
      and then tried to mount a rw mount with the super block we'll reject
      this.  However the old API was less descriptive and thus we allowed this
      kind of behavior.
      
      This patch preserves this behavior for the old API calls.  This is
      inspired by Christians work [1], and includes his comment in
      btrfs_get_tree_super() explaining the history and how it all works in
      the old and new APIs.
      
      Link: https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-2-045e9735a00b@kernel.org/Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f044b318
    • Josef Bacik's avatar
      btrfs: add get_tree callback for new mount API · 3bb17a25
      Josef Bacik authored
      This is the actual mounting callback for the new mount API.  Implement
      this using our current fill super as a guideline, making the appropriate
      adjustments for the new mount API.
      
      Our old mount operation had two fs_types, one to handle the actual
      opening, and the one that we called to handle the actual opening and
      then did the subvol lookup for returning the actual root dentry.  This
      is mirrored here, but simply with different behaviors for ->get_tree.
      We use the existence of ->s_fs_info to tell which part we're in.  The
      initial call allocates the fs_info, then call mount_fc() with a
      duplicated fc to do the actual open_ctree part.  Then we take that
      vfsmount and use it to look up our subvolume that we're mounting and
      return that as our s_root.  This idea was taken from Christians attempt
      to convert us to the new mount API [1].
      
      In btrfs_get_tree_super() the mount device is scanned and opened in one
      go under uuid_mutex we expect that all related devices have been already
      scanned, either by mount or from the outside. A device forget can be
      called on some of the devices as the whole context is not protected but
      it's an unlikely event, though it's a minor behaviour change.
      
      References: https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-2-045e9735a00b@kernel.org/Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note about device scanning ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bb17a25
    • Josef Bacik's avatar
      btrfs: add reconfigure callback for fs_context · eddb1a43
      Josef Bacik authored
      This is what is used to remount the file system with the new mount API.
      Because the mount options are parsed separately and one at a time I've
      added a helper to emit the mount options after the fact once the mount
      is configured, this matches the dmesg output for what happens with the
      old mount API.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eddb1a43
    • Josef Bacik's avatar
      btrfs: add fs context handling functions · 0f85e244
      Josef Bacik authored
      We are going to use the fs context to hold the mount options, so
      allocate the btrfs_fs_context when we're asked to init the fs context,
      and free it in the free callback.
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f85e244
    • Josef Bacik's avatar
      btrfs: add parse_param callback for the new mount API · 17b36120
      Josef Bacik authored
      The parse_param callback handles one parameter at a time, take our
      existing mount option parsing loop and adjust it to handle one parameter
      at a time, and tie it into the fs_context_operations.
      
      Create a btrfs_fs_context object that will store the various mount
      properties, we'll house this in fc->fs_private.  This is necessary to
      separate because remounting will use ->reconfigure, and we'll get a new
      copy of the parsed parameters, so we can no longer directly mess with
      the fs_info in this stage.
      
      In the future we'll add this to the btrfs_fs_info and update the users
      to use the new context object instead.
      
      There's a change how the option device= is processed. Previously all
      mount options were parsed in one go under uuid_mutex and the devices
      opened. This prevented a concurrent scan to happen during mount. Now we
      could see a device scan happen (e.g. by udev) but this should not affect
      the end result, mount will either see the populated fs_devices or will
      scan the device by itself.
      
      Alternatively we could save all the device paths first and then process
      them in one go as before but this does not seem to be necessary.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note about device scanning ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17b36120
    • Josef Bacik's avatar
      btrfs: add fs_parameter definitions · 15ddcdd3
      Josef Bacik authored
      In order to convert to the new mount API we have to change how we do the
      mount option parsing.  For now we're going to duplicate these helpers to
      make it easier to follow, and then remove the old code once everything
      is in place.  This patch contains the re-definition of all of our mount
      options into the new fs_parameter_spec format.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15ddcdd3