1. 11 Jul, 2024 40 commits
    • Filipe Manana's avatar
      btrfs: avoid removal and re-insertion of split ordered extent · cb3cd624
      Filipe Manana authored
      At btrfs_split_ordered_extent(), we are removing and re-inserting the
      ordered extent that we are trimming, but we don't need to since the
      trimming doesn't change its position in the red black tree because we
      don't have overlapping ordered extents (that would imply double allocation
      of extents) and we know the split length is smaller than the ordered
      extent's num_bytes field (we checked that early in the function).
      
      So drop the remove and re-insert code for the slit ordered extent.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb3cd624
    • Filipe Manana's avatar
      btrfs: add comment about locking to btrfs_split_ordered_extent() · c18ca3c9
      Filipe Manana authored
      There are subtle details about why the root's ordered_extent_lock is held,
      so add a comment mentioning them.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c18ca3c9
    • Filipe Manana's avatar
      btrfs: reduce critical section at btrfs_wait_ordered_extents() · ac1f580c
      Filipe Manana authored
      At btrfs_wait_ordered_extents(), there's no point in updating the counters
      after locking the root's ordered extent lock, as the counters are local.
      So change this to update the counters before taking the lock.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac1f580c
    • Filipe Manana's avatar
      btrfs: reduce critical section at btrfs_wait_ordered_roots() · 03103ecf
      Filipe Manana authored
      At btrfs_wait_ordered_roots(), there's no point in decrementing the
      counter after locking fs_info->ordered_root_lock as the counter is local.
      So change this to decrement the counter before taking the lock.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03103ecf
    • David Sterba's avatar
      btrfs: constify pointer parameters where applicable · 2917f741
      David Sterba authored
      We can add const to many parameters, this is for clarity and minor
      addition to safety. There are some minor effects, in the assembly
      code and .ko measured on release config. This patch does not cover all
      possible conversions.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2917f741
    • Qu Wenruo's avatar
      btrfs: do not directly include rwlock_types.h · c27b1dbb
      Qu Wenruo authored
      There is already an error inside that header:
      
       #if !defined(__LINUX_SPINLOCK_TYPES_H)
       # error "Do not include directly, include spinlock_types.h"
       #endif
      
      Thankfully it never get triggered as some other headers have already
      included spinlock_types.h.
      
      However clangd would still do a proper warning on that if only
      extent_map.h is opened.
      Fix it by using spinlock_types.h instead.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c27b1dbb
    • Qu Wenruo's avatar
      btrfs: cleanup recursive include of the same header · 3b8dbf34
      Qu Wenruo authored
      We have several headers that are including themselves, triggering clangd
      warnings.
      Such includes are caused by commit 602035d7 ("btrfs: add forward
      declarations and headers, part 2").
      
      Just remove such unnecessary include.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3b8dbf34
    • Junchao Sun's avatar
      btrfs: qgroup: delete a TODO about using kmem cache to allocate structures · a56b7952
      Junchao Sun authored
      Generic slab works fine allocating btrfs_qgroup_extent_record
      structures. It's not necessary to create a dedicated kmem cache that
      would be created but unused if quotas were not enabled. Let's delete the
      TODO line.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJunchao Sun <sunjunchao2870@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a56b7952
    • Qu Wenruo's avatar
      btrfs: make extent_write_locked_range() handle subpage writeback correctly · a185373e
      Qu Wenruo authored
      When extent_write_locked_range() generated an inline extent, it would
      set and finish the writeback for the whole page.
      
      Although currently it's safe since subpage disables inline creation,
      for the sake of consistency, let it go with subpage helpers to set and
      clear the writeback flags.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a185373e
    • Qu Wenruo's avatar
      btrfs: do not clear page dirty inside extent_write_locked_range() · 97713b1a
      Qu Wenruo authored
      [BUG]
      For subpage + zoned case, the following workload can lead to rsv data
      leak at unmount time:
      
        # mkfs.btrfs -f -s 4k $dev
        # mount $dev $mnt
        # fsstress -w -n 8 -d $mnt -s 1709539240
        0/0: fiemap - no filename
        0/1: copyrange read - no filename
        0/2: write - no filename
        0/3: rename - no source filename
        0/4: creat f0 x:0 0 0
        0/4: creat add id=0,parent=-1
        0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
        0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
        0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
        0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
        # umount $mnt
      
      The dmesg includes the following rsv leak detection warning (all call
      trace skipped):
      
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
        ---[ end trace 0000000000000000 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
        ---[ end trace 0000000000000000 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
        ---[ end trace 0000000000000000 ]---
        BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
        ---[ end trace 0000000000000000 ]---
        BTRFS info (device sda): space_info DATA has 268218368 free, is not full
        BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
        BTRFS info (device sda): global_block_rsv: size 0 reserved 0
        BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
        ---[ end trace 0000000000000000 ]---
        BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
        BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
        BTRFS info (device sda): global_block_rsv: size 0 reserved 0
        BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
      
      Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
      append size of 64K, and the system has 64K page size.
      
      [CAUSE]
      I have added several trace_printk() to show the events (header skipped):
      
        > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
        > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
        > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
        > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
      
      The above lines show our buffered write has dirtied 3 pages of inode
      259 of root 5:
      
        704K             768K              832K              896K
        I           |////I/////////////////I///////////|     I
                    756K                               868K
      
        |///| is the dirtied range using subpage bitmaps. and 'I' is the page
        boundary.
      
        Meanwhile all three pages (704K, 768K, 832K) have their PageDirty
        flag set.
      
        > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
      
      Then direct IO write starts, since the range [680K, 780K) covers the
      beginning part of the above dirty range, we need to writeback the
      two pages at 704K and 768K.
      
        > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
        > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
      
      Now the above 2 lines show that we're writing back for dirty range
      [756K, 756K + 64K).
      We only writeback 64K because the zoned device has max zone append size
      as 64K.
      
        > extent_write_locked_range: r/i=5/259 clear dirty for page=786432
      
      !!! The above line shows the root cause. !!!
      
      We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
      for the page 768K.
      This is because extent_write_locked_range() can go beyond the current
      locked page, here we hit the page at 768K and clear its page dirt.
      
      In fact this would lead to the desync between subpage dirty and page
      dirty flags.  We have the page dirty flag cleared, but the subpage range
      [820K, 832K) is still dirty.
      
      After the writeback of range [756K, 820K), the dirty flags look like
      this, as page 768K no longer has dirty flag set.
      
        704K             768K              832K              896K
        I                I      |          I/////////////|   I
                                820K                     868K
      
      This means we will no longer writeback range [820K, 832K), thus the
      reserved data/metadata space would never be properly released.
      
        > extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432
      
      Now even though we try to start writeback for page 768K, since the
      page is not dirty, we completely skip it at extent_write_cache_pages()
      time.
      
        > btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0
      
      Now the direct IO finished.
      
        > cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
        > extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864
      
      Now we writeback the remaining dirty range, which is [832K, 868K).
      Causing the range [820K, 832K) never to be submitted, thus leaking the
      reserved space.
      
      This bug only affects subpage and zoned case.  For non-subpage and zoned
      case, we have exactly one sector for each page, thus no such partial dirty
      cases.
      
      For subpage and non-zoned case, we never go into run_delalloc_cow(), and
      normally all the dirty subpage ranges would be properly submitted inside
      __extent_writepage_io().
      
      [FIX]
      Just do not clear the page dirty at all inside extent_write_locked_range().
      As __extent_writepage_io() would do a more accurate, subpage compatible
      clear for page and subpage dirty flags anyway.
      
      Now the correct trace would look like this:
      
        > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
        > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
        > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
        > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
      
      The page dirty part is still the same 3 pages.
      
        > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
        > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
        > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
      
      And the writeback for the first 64K is still correct.
      
        > cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
        > extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152
      
      Now with the fix, we can properly writeback the range [820K, 832K), and
      properly release the reserved data/metadata space.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97713b1a
    • Qu Wenruo's avatar
      btrfs: lock subpage ranges in one go for writepage_delalloc() · d034cdb4
      Qu Wenruo authored
      If we have a subpage range like this for a 16K page with 4K sectorsize:
      
          0     4K     8K     12K     16K
          |/////|      |//////|       |
      
          |/////| = dirty range
      
      Currently writepage_delalloc() would go through the following steps:
      
      - lock range [0, 4K)
      - run delalloc range for [0, 4K)
      - lock range [8K, 12K)
      - run delalloc range for [8K 12K)
      
      So far it's fine for regular subpage writeback, as
      btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
      cow_file_range() and run_delalloc_compressed().
      
      But there is a special case for zoned subpage, where we will go
      through run_delalloc_cow(), which would create the ordered extent for the
      range and immediately submit the range.
      This would unlock the whole page range, causing all kinds of different
      ASSERT()s related to locked page.
      
      Address the page unlocking problem of run_delalloc_cow(), by changing
      the workflow to the following one:
      
      - lock range [0, 4K)
      - lock range [8K, 12K)
      - run delalloc range for [0, 4K)
      - run delalloc range for [8K, 12K)
      
      So that run_delalloc_cow() can only unlock the full page until the
      last lock user released.
      
      To do that:
      
      - Utilize subpage locked bitmap
        So for every delalloc range we found, call
        btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
        and later btrfs_folio_end_all_writers() if the page is fully unlocked.
      
        So we know there is a delalloc range that needs to be run later.
      
      - Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
        Since subpage locked bitmap is only for ranges inside the page,
        meanwhile we can have delalloc range ends beyond our page boundary,
        we have to save the @last_delalloc_end just in case it's beyond our
        page boundary.
      
      Although there is one extra point to notice:
      
      - We need to handle errors in previous iteration
        Since we can have multiple locked delalloc ranges we have to call
        run_delalloc_ranges() multiple times.
        If we hit an error half way, we still need to unlock the remaining
        ranges.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d034cdb4
    • Qu Wenruo's avatar
      btrfs: subpage: introduce helpers to handle subpage delalloc locking · bca707e5
      Qu Wenruo authored
      Three new helpers are introduced for the incoming subpage delalloc locking
      change.
      
      - btrfs_folio_set_writer_lock()
        This is to mark specified range with subpage specific writer lock.
        After calling this, the subpage range can be proper unlocked by
        btrfs_folio_end_writer_lock()
      
      - btrfs_subpage_find_writer_locked()
        This is to find the writer locked subpage range in a page.
        With the help of btrfs_folio_set_writer_lock(), it can allow us to
        record and find previously locked subpage range without extra memory
        allocation.
      
      - btrfs_folio_end_all_writers()
        This is for the locked_page of __extent_writepage(), as there may be
        multiple subpage delalloc ranges locked.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bca707e5
    • Qu Wenruo's avatar
      btrfs: make __extent_writepage_io() to write specified range only · 21b5bef2
      Qu Wenruo authored
      Function __extent_writepage_io() is designed to find all dirty ranges of
      a page, and add the dirty ranges to the bio_ctrl for submission.
      It requires all the dirtied ranges to be covered by an ordered extent.
      
      It gets called in two locations, but one call site is not subpage aware:
      
      - __extent_writepage()
        It gets called when writepage_delalloc() returned 0, which means
        writepage_delalloc() has handled delalloc for all subpage sectors
        inside the page.
      
        So this call site is OK.
      
      - extent_write_locked_range()
        This call site is utilized by zoned support, and in this case, we may
        only run delalloc range for a subset of the page, like this: (64K page
        size)
      
        0     16K     32K     48K     64K
        |/////|       |///////|       |
      
        In the above case, if extent_write_locked_range() is only triggered for
        range [0, 16K), __extent_writepage_io() would still try to submit
        the dirty range of [32K, 48K), then it would not find any ordered
        extent for it and triggers various ASSERT()s.
      
      Fix this problem by:
      
      - Introducing @start and @len parameters to specify the range
      
        For the first call site, we just pass the whole page, and the behavior
        is not touched, since run_delalloc_range() for the page should have
        created all ordered extents for the page.
      
        For the second call site, we avoid touching anything beyond the
        range, thus avoiding the dirty range which is not yet covered by any
        delalloc range.
      
      - Making btrfs_folio_assert_not_dirty() subpage aware
        The only caller is inside __extent_writepage_io(), and since that
        caller now accepts a subpage range, we should also check the subpage
        range other than the whole page.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21b5bef2
    • Jeff Johnson's avatar
      btrfs: add MODULE_DESCRIPTION() · 95359f63
      Jeff Johnson authored
      Fix the 'make W=1' warning:
      WARNING: modpost: missing MODULE_DESCRIPTION() in fs/btrfs/btrfs.o
      Signed-off-by: default avatarJeff Johnson <quic_jjohnson@quicinc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95359f63
    • Anand Jain's avatar
      btrfs: rename err to ret in btrfs_drop_snapshot() · ca8ba2cc
      Anand Jain authored
      Drop the variable 'err', reuse the variable 'ret' by reinitializing it to
      zero where necessary.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca8ba2cc
    • Anand Jain's avatar
      btrfs: rename err to ret in btrfs_recover_relocation() · ced1b1bd
      Anand Jain authored
      Fix coding style: rename the return variable to 'ret' in the function
      btrfs_recover_relocation instead of 'err'.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ced1b1bd
    • Anand Jain's avatar
      btrfs: rename ret to ret2 in btrfs_recover_relocation() · bd0d9a61
      Anand Jain authored
      A preparatory patch to rename 'err' to 'ret', but ret is already used as an
      intermediary return value, so first rename 'ret' to 'ret2'.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd0d9a61
    • Anand Jain's avatar
      btrfs: rename ret to err in btrfs_recover_relocation() · ba69f42a
      Anand Jain authored
      In the function btrfs_recover_relocation(), currently the variable 'err'
      carries the return value and 'ret' holds the intermediary return value.
      However, in some lines, we don't need this two-step approach; we can
      directly use 'err'. So, optimize them, which requires reinitializing 'err'
      to zero at two locations.
      
      This is a preparatory patch to fix the code style by renaming 'err'
      to 'ret'.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ba69f42a
    • Anand Jain's avatar
      btrfs: rename err to ret in btrfs_cleanup_fs_roots() · 53d6c0da
      Anand Jain authored
      Since err represents the function return value, rename it as ret,
      and rename the original ret, which serves as a helper return value,
      to found. Also, optimize the code to continue call btrfs_put_root()
      for the rest of the root if even after btrfs_orphan_cleanup() returns
      error.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53d6c0da
    • Qu Wenruo's avatar
      btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent() · 04ef7631
      Qu Wenruo authored
      The following 3 parameters can be cleaned up using btrfs_file_extent
      structure:
      
      - len
        btrfs_file_extent::num_bytes
      
      - orig_block_len
        btrfs_file_extent::disk_num_bytes
      
      - ram_bytes
        btrfs_file_extent::ram_bytes
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04ef7631
    • Qu Wenruo's avatar
      btrfs: cleanup duplicated parameters related to create_io_em() · 9fec848b
      Qu Wenruo authored
      Most parameters of create_io_em() can be replaced by the members with
      the same name inside btrfs_file_extent.
      
      Do a direct parameters cleanup here.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9fec848b
    • Qu Wenruo's avatar
      btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent · e9ea31fb
      Qu Wenruo authored
      All parameters after @filepos of btrfs_alloc_ordered_extent() can be
      replaced with btrfs_file_extent structure.
      
      This patch does the cleanup, meanwhile some points to note:
      
      - Move btrfs_file_extent structure to ordered-data.h
        The structure is needed by both btrfs_alloc_ordered_extent() and
        can_nocow_extent(), but since btrfs_inode.h includes
        ordered-data.h, so we need to move the structure to ordered-data.h.
      
      - Move the special handling of NOCOW/PREALLOC into
        btrfs_alloc_ordered_extent()
        This is to allow btrfs_split_ordered_extent() to properly split them
        for DIO.
        For now just move the handling into btrfs_alloc_ordered_extent() to
        simplify the callers.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9ea31fb
    • Qu Wenruo's avatar
      btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args · cdc627e6
      Qu Wenruo authored
      The following functions and structures can be simplified using the
      btrfs_file_extent structure:
      
      - can_nocow_extent()
        No need to return ram_bytes/orig_block_len through the parameter list,
        the @file_extent parameter contains all the needed info.
      
      - can_nocow_file_extent_args
        The following members are no longer needed:
      
        * disk_bytenr
          This one is confusing as it's not really the
          btrfs_file_extent_item::disk_bytenr, but where the IO would be,
          thus it's file_extent::disk_bytenr + file_extent::offset now.
      
        * num_bytes
          Now file_extent::num_bytes.
      
        * extent_offset
          Now file_extent::offset.
      
        * disk_num_bytes
          Now file_extent::disk_num_bytes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cdc627e6
    • Qu Wenruo's avatar
      btrfs: remove extent_map::block_start member · c77a8c61
      Qu Wenruo authored
      The member extent_map::block_start can be calculated from
      extent_map::disk_bytenr + extent_map::offset for regular extents.
      And otherwise just extent_map::disk_bytenr.
      
      And this is already validated by the validate_extent_map().  Now we can
      remove the member.
      
      However there is a special case in btrfs_create_dio_extent() where we
      for NOCOW/PREALLOC ordered extents cannot directly use the resulting
      btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
      yet.
      
      So for that call site, we pass file_extent->disk_bytenr +
      file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
      offset.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c77a8c61
    • Qu Wenruo's avatar
      btrfs: remove extent_map::block_len member · e28b851e
      Qu Wenruo authored
      The extent_map::block_len is either extent_map::len (non-compressed
      extent) or extent_map::disk_num_bytes (compressed extent).
      
      Since we already have sanity checks to do the cross-checks between the
      new and old members, we can drop the old extent_map::block_len now.
      
      For most call sites, they can manually select extent_map::len or
      extent_map::disk_num_bytes, since most if not all of them have checked
      if the extent is compressed.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e28b851e
    • Qu Wenruo's avatar
      btrfs: remove extent_map::orig_start member · 4aa7b5d1
      Qu Wenruo authored
      Since we have extent_map::offset, the old extent_map::orig_start is just
      extent_map::start - extent_map::offset for non-hole/inline extents.
      
      And since the new extent_map::offset is already verified by
      validate_extent_map() while the old orig_start is not, let's just remove
      the old member from all call sites.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4aa7b5d1
    • Qu Wenruo's avatar
      btrfs: introduce extra sanity checks for extent maps · 3f255ece
      Qu Wenruo authored
      Since extent_map structure has the all the needed members to represent a
      file extent directly, we can apply all the file extent sanity checks to
      an extent map.
      
      The new sanity checks will cross check both the old members
      (block_start/block_len/orig_start) and the new members
      (disk_bytenr/disk_num_bytes/offset).
      
      There is a special case for offset/orig_start/start cross check, we only
      do such sanity check for compressed extent, as only compressed
      read/encoded write really utilize orig_start.
      This can be proved by the cleanup patch of orig_start.
      
      The checks happens at the following times:
      
      - add_extent_mapping()
        This is for newly added extent map
      
      - replace_extent_mapping()
        This is for btrfs_drop_extent_map_range() and split_extent_map()
      
      - try_merge_map()
      
      For a lot of call sites we have to properly populate all the members to
      pass the sanity check, meanwhile the following code needs extra
      modification:
      
      - setup_file_extents() from inode-tests
        The file extents layout of setup_file_extents() is already too invalid
        that tree-checker would reject most of them in real world.
      
        However there is just a special unaligned regular extent which has
        mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1).
        So instead of dropping the whole test case, here we just unify
        disk_num_bytes and ram_bytes to 4096 - 1.
      
      - test_case_7() from extent-map-tests
        An extent is inserted with 16K length, but on-disk extent size is
        only 4K.
        This means it must be a compressed extent, so set the compressed flag
        for it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3f255ece
    • Qu Wenruo's avatar
      btrfs: introduce new members for extent_map · 3d2ac992
      Qu Wenruo authored
      Introduce two new members for extent_map:
      
      - disk_bytenr
      - offset
      
      Both are matching the members with the same name inside
      btrfs_file_extent_items.
      
      For now this patch only touches those members when:
      
      - Reading btrfs_file_extent_items from disk
      - Inserting new holes
      - Merging two extent maps
        With the new disk_bytenr and disk_num_bytes, doing merging would be a
        little more complex, as we have 3 different cases:
      
        * Both extent maps are referring to the same data extents
          |<----- data extent A ----->|
             |<- em 1 ->|<- em 2 ->|
      
        * Both extent maps are referring to different data extents
          |<-- data extent A -->|<-- data extent B -->|
                     |<- em 1 ->|<- em 2 ->|
      
        * One of the extent maps is referring to a merged and larger data
          extent that covers both extent maps
      
          This is not really valid case other than some selftests.
          So this test case would be removed.
      
        A new helper merge_ondisk_extents() is introduced to handle the above
        valid cases.
      
      To properly assign values for those new members, a new btrfs_file_extent
      parameter is introduced to all the involved call sites.
      
      - For NOCOW writes the btrfs_file_extent would be exposed from
        can_nocow_file_extent().
      
      - For other writes, the members can be easily calculated
        As most of them have 0 offset and utilizing the whole on-disk data
        extent.
        The exception is encoded write, but thankfully that interface provided
        offset directly and all other needed info.
      
      For now, both the old members (block_start/block_len/orig_start) are
      co-existing with the new members (disk_bytenr/offset), meanwhile all the
      critical code is still using the old members only.
      
      The cleanup will happen later after all the old and new members are
      properly validated.
      
      There would be some re-ordering for the assignment of the extent_map
      members, now we follow the new ordering:
      
      - start and len
        Or file_pos and num_bytes for other structures.
      
      - disk_bytenr and disk_num_bytes
      - offset and ram_bytes
      - compression
      
      So expect some seemingly unrelated line movement.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d2ac992
    • Qu Wenruo's avatar
      btrfs: export the expected file extent through can_nocow_extent() · 87a6962f
      Qu Wenruo authored
      Currently function can_nocow_extent() only returns members needed for
      extent_map.
      
      However since we will soon change the extent_map structure to be more
      like btrfs_file_extent_item, we want to expose the expected file extent
      caused by the NOCOW write for future usage.
      
      This introduces a new structure, btrfs_file_extent, to be a more
      memory access friendly representation of btrfs_file_extent_item.
      And use that structure to expose the expected file extent caused by the
      NOCOW write.
      
      For now there is no user of the new structure yet.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87a6962f
    • Qu Wenruo's avatar
      btrfs: rename extent_map::orig_block_len to disk_num_bytes · e8fe524d
      Qu Wenruo authored
      This would make it very obvious that the member just matches
      btrfs_file_extent_item::disk_num_bytes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e8fe524d
    • Filipe Manana's avatar
      btrfs: move fiemap code into its own file · 8996f61a
      Filipe Manana authored
      Currently the core of the fiemap code lives in extent_io.c, which does
      not make any sense because it's not related to extent IO at all (and it
      was not as well before the big rewrite of fiemap I did some time ago).
      The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
      an inode operation.
      
      Since there's a significant amount of fiemap code, move all of it into a
      dedicated file, including its entry point inode.c:btrfs_fiemap().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8996f61a
    • Filipe Manana's avatar
      btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate() · f9763e4d
      Filipe Manana authored
      Now that there is a helper to commit the current transaction and we are
      using it, there's no need for the label and goto statements at
      ensure_commit_roots_uptodate(). So replace them with direct return
      statements that call btrfs_commit_current_transaction().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9763e4d
    • Filipe Manana's avatar
      btrfs: add and use helper to commit the current transaction · ded980eb
      Filipe Manana authored
      We have several places that attach to the current transaction with
      btrfs_attach_transaction_barrier() and then commit the transaction if
      there is one. Add a helper and use it to deduplicate this pattern.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ded980eb
    • Filipe Manana's avatar
      btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned() · 1f8aee29
      Filipe Manana authored
      At finish_extent_writes_for_zoned() we use btrfs_join_transaction() to
      catch any running transaction and then commit it. This will however create
      a new and empty transaction in case there's no running transaction anymore
      (got committed by the transaction kthread or other task for example) or
      there's a running transaction finishing its commit and with a state >=
      TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything
      while in the second case we just need to wait for the transaction to
      complete its commit.
      
      So improve this by using btrfs_attach_transaction_barrier() instead, which
      does not create a new transaction if there's none running, and if there's
      a current transaction that is committing, it will wait for it to fully
      commit and not create a new transaction. This helps avoiding creating and
      committing empty transactions, saving IO, time and unnecessary rotation of
      the backup roots in the super block.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f8aee29
    • Filipe Manana's avatar
      btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate() · 0557feab
      Filipe Manana authored
      At ensure_commit_roots_uptodate() we use btrfs_join_transaction() to
      catch any running transaction and then commit it. This will however create
      a new and empty transaction in case there's no running transaction anymore
      (got committed by the transaction kthread or other task for example) or
      there's a running transaction finishing its commit and with a state >=
      TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything
      while in the second case we just need to wait for the transaction to
      complete its commit.
      
      So improve this by using btrfs_attach_transaction_barrier() instead, which
      does not create a new transaction if there's none running, and if there's
      a current transaction that is committing, it will wait for it to fully
      commit and not create a new transaction. This helps avoiding creating and
      committing empty transactions, saving IO, time and unnecessary rotation of
      the backup roots in the super block.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0557feab
    • Filipe Manana's avatar
      btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient · 9e79c497
      Filipe Manana authored
      Before starting a send operation we have to make sure that every root has
      its commit root matching the regular root, to that send doesn't find stale
      inodes in the commit root (inodes that were deleted in the regular root)
      and fails the inode lookups with -ESTALE.
      
      Currently we keep looking for roots used by the send operation and as soon
      as we find one we commit the current transaction (or a new one since
      btrfs_join_transaction() creates one if there isn't any running or the
      running one is in a state >= TRANS_STATE_UNBLOCKED). It's pointless to
      keep looking until we don't find any, because after the first transaction
      commit all the other roots are updated too, as they were already tagged in
      the fs_info->fs_roots_radix radix tree when they were modified in order to
      have a commit root different from the regular root.
      
      Currently we are also always passing the main send root into
      btrfs_join_transaction(), which despite not having any functional issue,
      it is not optimal because in case the root wasn't modified we end up
      adding it to fs_info->fs_roots_radix and then update its root item in the
      root tree when committing the transaction, causing unnecessary work.
      
      So simplify and make this more efficient by removing the looping and by
      passing the first root we found that is modified as the argument to
      btrfs_join_transaction().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e79c497
    • Filipe Manana's avatar
      btrfs: avoid create and commit empty transaction when committing super · cab0d862
      Filipe Manana authored
      At btrfs_commit_super(), called in a few contexts such as when unmounting
      a filesystem, we use btrfs_join_transaction() to catch any running
      transaction and then commit it. This will however create a new and empty
      transaction in case there's no running transaction or there's a running
      transaction with a state >= TRANS_STATE_UNBLOCKED.
      
      As we just want to be sure that any existing transaction is fully
      committed, we can use btrfs_attach_transaction_barrier() instead of
      btrfs_join_transaction(), therefore avoiding the creation and commit of
      empty transactions, which only waste IO and causes rotation of the
      precious backup roots.
      
      Example where we create and commit a pointless empty transaction:
      
        $ mkfs.btrfs -f /dev/sdj
        $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
        generation            6
      
        $ mount /dev/sdj /mnt/sdj
        $ touch /mnt/sdj/foo
      
        # Commit the currently open transaction. Just 'sync' or wait ~30
        # seconds for the transaction kthread to commit it.
        $ sync
      
        $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
        generation            7
      
        $ umount /mnt/sdj
      
        $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
        generation            8
      
      The transaction with id 8 was pointless, an empty transaction that did
      not achieve anything.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cab0d862
    • Filipe Manana's avatar
      btrfs: qgroup: avoid start/commit empty transaction when flushing reservations · de18fba8
      Filipe Manana authored
      When flushing reservations we are using btrfs_join_transaction() to get a
      handle for the current transaction and then commit it to try to release
      space. However btrfs_join_transaction() has some undesirable consequences:
      
      1) If there's no running transaction, it will create one, and we will
         commit it right after. This is unnecessary because it will not release
         any space, and it will result in unnecessary IO and rotation of backup
         roots in the superblock;
      
      2) If there's a current transaction and that transaction is committing
         (its state is >= TRANS_STATE_COMMIT_DOING), it will wait for that
         transaction to almost finish its commit (for its state to be >=
         TRANS_STATE_UNBLOCKED) and then start and return a new transaction.
      
         We will then commit that new transaction, which is pointless because
         all we wanted was to wait for the current (previous) transaction to
         fully finish its commit (state == TRANS_STATE_COMPLETED), and by
         starting and committing a new transaction we are wasting IO too and
         causing unnecessary rotation of backup roots in the superblock.
      
      So improve this by using btrfs_attach_transaction_barrier() instead, which
      does not create a new transaction if there's none running, and if there's
      a current transaction that is committing, it will wait for it to fully
      commit and not create a new transaction.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de18fba8
    • David Sterba's avatar
      btrfs: simplify range parameters of btrfs_wait_ordered_roots() · 42317ab4
      David Sterba authored
      The range is specified only in two ways, we can simplify the case for
      the whole filesystem range as a NULL block group parameter.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      42317ab4
    • Qu Wenruo's avatar
      btrfs: automatically remove the subvolume qgroup · 839d6ea4
      Qu Wenruo authored
      Currently if we fully clean a subvolume (not only delete its directory,
      but fully clean all it's related data and root item), the associated
      qgroup would not be removed.
      
      We have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
      
      Change the behavior to automatically removie the qgroup of a fully
      cleaned subvolume when possible:
      
      - Full qgroup but still consistent
        We can and should remove the qgroup.
        The qgroup numbers should be 0, without any rsv.
      
      - Full qgroup but inconsistent
        Can happen with drop_subtree_threshold feature (skip accounting
        and mark qgroup inconsistent).
      
        We can and should remove the qgroup.
        Higher level qgroup numbers will be incorrect, but since qgroup
        is already inconsistent, it should not be a problem.
      
      - Squota mode
        This is the special case, we can only drop the qgroup if its numbers
        are all 0.
      
        This would be handled by can_delete_qgroup(), so we only need to check
        the return value and ignore the -EBUSY error.
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      839d6ea4