An error occurred fetching the project authors.
  1. 19 Jun, 2023 6 commits
  2. 17 May, 2023 1 commit
    • Qu Wenruo's avatar
      btrfs: scrub: try harder to mark RAID56 block groups read-only · 7561551e
      Qu Wenruo authored
      Currently we allow a block group not to be marked read-only for scrub.
      
      But for RAID56 block groups if we require the block group to be
      read-only, then we're allowed to use cached content from scrub stripe to
      reduce unnecessary RAID56 reads.
      
      So this patch would:
      
      - Make btrfs_inc_block_group_ro() try harder
        During my tests, for cases like btrfs/061 and btrfs/064, we can hit
        ENOSPC from btrfs_inc_block_group_ro() calls during scrub.
      
        The reason is if we only have one single data chunk, and trying to
        scrub it, we won't have any space left for any newer data writes.
      
        But this check should be done by the caller, especially for scrub
        cases we only temporarily mark the chunk read-only.
        And newer data writes would always try to allocate a new data chunk
        when needed.
      
      - Return error for scrub if we failed to mark a RAID56 chunk read-only
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7561551e
  3. 17 Apr, 2023 4 commits
    • Qu Wenruo's avatar
      btrfs: scrub: remove the old scrub recheck code · e9255d6c
      Qu Wenruo authored
      The old scrub code has different entrance to verify the content, and
      since we have removed the writeback path, now we can start removing the
      re-check part, including:
      
      - scrub_recover structure
      - scrub_sector::recover member
      - function scrub_setup_recheck_block()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_group_good_copy()
      - function scrub_repair_sector_from_good_copy()
      - function scrub_is_page_on_raid56()
      
      - function full_stripe_lock()
      - function search_full_stripe_lock()
      - function get_full_stripe_logical()
      - function insert_full_stripe_lock()
      - function lock_full_stripe()
      - function unlock_full_stripe()
      - btrfs_block_group::full_stripe_locks_root member
      - btrfs_full_stripe_locks_tree structure
        This infrastructure is to ensure RAID56 scrub is properly handling
        recovery and P/Q scrub correctly.
      
        This is no longer needed, before P/Q scrub we will wait for all
        the involved data stripes to be scrubbed first, and RAID56 code has
        internal lock to ensure no race in the same full stripe.
      
      - function scrub_print_warning()
      - function scrub_get_recover()
      - function scrub_put_recover()
      - function scrub_handle_errored_block()
      - function scrub_setup_recheck_block()
      - function scrub_bio_wait_endio()
      - function scrub_submit_raid56_bio_wait()
      - function scrub_recheck_block_on_raid56()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_from_good_copy()
      - function scrub_repair_sector_from_good_copy()
      
      And two more functions exported temporarily for later cleanup:
      
      - alloc_scrub_sector()
      - alloc_scrub_block()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9255d6c
    • Filipe Manana's avatar
      btrfs: remove bytes_used argument from btrfs_make_block_group() · 5758d1bd
      Filipe Manana authored
      The only caller of btrfs_make_block_group() always passes 0 as the value
      for the bytes_used argument, so remove it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5758d1bd
    • Qu Wenruo's avatar
      btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32 · 6ded22c1
      Qu Wenruo authored
      There are quite some div64 calls inside btrfs_map_block() and its
      variants.
      
      Such calls are for @stripe_nr, where @stripe_nr is the number of
      stripes before our logical bytenr inside a chunk.
      
      However we can eliminate such div64 calls by just reducing the width of
      @stripe_nr from 64 to 32.
      
      This can be done because our chunk size limit is already 10G, with fixed
      stripe length 64K.
      Thus a U32 is definitely enough to contain the number of stripes.
      
      With such width reduction, we can get rid of slower div64, and extra
      warning for certain 32bit arch.
      
      This patch would do:
      
      - Add a new tree-checker chunk validation on chunk length
        Make sure no chunk can reach 256G, which can also act as a bitflip
        checker.
      
      - Reduce the width from u64 to u32 for @stripe_nr variables
      
      - Replace unnecessary div64 calls with regular modulo and division
        32bit division and modulo are much faster than 64bit operations, and
        we are finally free of the div64 fear at least in those involved
        functions.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ded22c1
    • Qu Wenruo's avatar
      btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN · a97699d1
      Qu Wenruo authored
      Currently btrfs doesn't support stripe lengths other than 64KiB.
      This is already set in the tree-checker.
      
      There is really no meaning to record that fixed value in map_lookup for
      now, and can all be replaced with BTRFS_STRIPE_LEN.
      
      Furthermore we can use the fix stripe length to do the following
      optimization:
      
      - Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
        Now we only need to do a right shift.
      
        And the value of BTRFS_STRIPE_LEN itself is already too large for bit
        shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
        a compiler warning would be triggered.
      
        Thus this bit shift optimization would be safe.
      
      - Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a97699d1
  4. 15 Mar, 2023 2 commits
  5. 08 Mar, 2023 1 commit
    • Filipe Manana's avatar
      btrfs: fix block group item corruption after inserting new block group · 675dfe12
      Filipe Manana authored
      We can often end up inserting a block group item, for a new block group,
      with a wrong value for the used bytes field.
      
      This happens if for the new allocated block group, in the same transaction
      that created the block group, we have tasks allocating extents from it as
      well as tasks removing extents from it.
      
      For example:
      
      1) Task A creates a metadata block group X;
      
      2) Two extents are allocated from block group X, so its "used" field is
         updated to 32K, and its "commit_used" field remains as 0;
      
      3) Transaction commit starts, by some task B, and it enters
         btrfs_start_dirty_block_groups(). There it tries to update the block
         group item for block group X, which currently has its "used" field with
         a value of 32K. But that fails since the block group item was not yet
         inserted, and so on failure update_block_group_item() sets the
         "commit_used" field of the block group back to 0;
      
      4) The block group item is inserted by task A, when for example
         btrfs_create_pending_block_groups() is called when releasing its
         transaction handle. This results in insert_block_group_item() inserting
         the block group item in the extent tree (or block group tree), with a
         "used" field having a value of 32K, but without updating the
         "commit_used" field in the block group, which remains with value of 0;
      
      5) The two extents are freed from block X, so its "used" field changes
         from 32K to 0;
      
      6) The transaction commit by task B continues, it enters
         btrfs_write_dirty_block_groups() which calls update_block_group_item()
         for block group X, and there it decides to skip the block group item
         update, because "used" has a value of 0 and "commit_used" has a value
         of 0 too.
      
         As a result, we end up with a block item having a 32K "used" field but
         no extents allocated from it.
      
      When this issue happens, a btrfs check reports an error like this:
      
         [1/7] checking root items
         [2/7] checking extents
         block group [1104150528 1073741824] used 39796736 but extent items used 0
         ERROR: errors found in extent allocation tree or chunk allocation
         (...)
      
      Fix this by making insert_block_group_item() update the block group's
      "commit_used" field.
      
      Fixes: 7248e0ce ("btrfs: skip update of block group item if used bytes are the same")
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      675dfe12
  6. 06 Mar, 2023 2 commits
    • Johannes Thumshirn's avatar
      btrfs: fix percent calculation for bg reclaim message · 95cd356c
      Johannes Thumshirn authored
      We have a report, that the info message for block-group reclaim is
      crossing the 100% used mark.
      
      This is happening as we were truncating the divisor for the division
      (the block_group->length) to a 32bit value.
      
      Fix this by using div64_u64() to not truncate the divisor.
      
      In the worst case, it can lead to a div by zero error and should be
      possible to trigger on 4 disks RAID0, and each device is large enough:
      
        $ mkfs.btrfs  -f /dev/test/scratch[1234] -m raid1 -d raid0
        btrfs-progs v6.1
        [...]
        Filesystem size:    40.00GiB
        Block group profiles:
          Data:             RAID0             4.00GiB <<<
          Metadata:         RAID1           256.00MiB
          System:           RAID1             8.00MiB
      Reported-by: default avatarForza <forza@tnonline.net>
      Link: https://lore.kernel.org/linux-btrfs/e99483.c11a58d.1863591ca52@tnonline.net/
      Fixes: 5f93e776 ("btrfs: zoned: print unusable percentage when reclaiming block groups")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add Qu's note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95cd356c
    • Boris Burkov's avatar
      btrfs: fix potential dead lock in size class loading logic · 12148367
      Boris Burkov authored
      As reported by Filipe, there's a potential deadlock caused by
      using btrfs_search_forward on commit_root. The locking there is
      unconditional, even if ->skip_locking and ->search_commit_root is set.
      It's not meant to be used for commit roots, so it always needs to do
      locking.
      
      So if another task is COWing a child node of the same root node and
      then needs to wait for block group caching to complete when trying to
      allocate a metadata extent, it deadlocks.
      
      For example:
      
      [539604.239315] sysrq: Show Blocked State
      [539604.240133] task:kworker/u16:6   state:D stack:0     pid:2119594 ppid:2      flags:0x00004000
      [539604.241613] Workqueue: btrfs-cache btrfs_work_helper [btrfs]
      [539604.242673] Call Trace:
      [539604.243129]  <TASK>
      [539604.243925]  __schedule+0x41d/0xee0
      [539604.244797]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.245399]  ? rwsem_down_read_slowpath+0x185/0x490
      [539604.246111]  schedule+0x5d/0xf0
      [539604.246593]  rwsem_down_read_slowpath+0x2da/0x490
      [539604.247290]  ? rcu_barrier_tasks_trace+0x10/0x20
      [539604.248090]  __down_read_common+0x3d/0x150
      [539604.248702]  down_read_nested+0xc3/0x140
      [539604.249280]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
      [539604.250097]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
      [539604.250915]  btrfs_search_forward+0x59/0x460 [btrfs]
      [539604.251781]  ? btrfs_global_root+0x50/0x70 [btrfs]
      [539604.252476]  caching_thread+0x1be/0x920 [btrfs]
      [539604.253167]  btrfs_work_helper+0xf6/0x400 [btrfs]
      [539604.253848]  process_one_work+0x24f/0x5a0
      [539604.254476]  worker_thread+0x52/0x3b0
      [539604.255166]  ? __pfx_worker_thread+0x10/0x10
      [539604.256047]  kthread+0xf0/0x120
      [539604.256591]  ? __pfx_kthread+0x10/0x10
      [539604.257212]  ret_from_fork+0x29/0x50
      [539604.257822]  </TASK>
      [539604.258233] task:btrfs-transacti state:D stack:0     pid:2236474 ppid:2      flags:0x00004000
      [539604.259802] Call Trace:
      [539604.260243]  <TASK>
      [539604.260615]  __schedule+0x41d/0xee0
      [539604.261205]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.262000]  ? rwsem_down_read_slowpath+0x185/0x490
      [539604.262822]  schedule+0x5d/0xf0
      [539604.263374]  rwsem_down_read_slowpath+0x2da/0x490
      [539604.266228]  ? lock_acquire+0x160/0x310
      [539604.266917]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.267996]  ? lock_contended+0x19e/0x500
      [539604.268720]  __down_read_common+0x3d/0x150
      [539604.269400]  down_read_nested+0xc3/0x140
      [539604.270057]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
      [539604.271129]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
      [539604.272372]  btrfs_search_slot+0x143/0xf70 [btrfs]
      [539604.273295]  update_block_group_item+0x9e/0x190 [btrfs]
      [539604.274282]  btrfs_start_dirty_block_groups+0x1c4/0x4f0 [btrfs]
      [539604.275381]  ? __mutex_unlock_slowpath+0x45/0x280
      [539604.276390]  btrfs_commit_transaction+0xee/0xed0 [btrfs]
      [539604.277391]  ? lock_acquire+0x1a4/0x310
      [539604.278080]  ? start_transaction+0xcb/0x6c0 [btrfs]
      [539604.279099]  transaction_kthread+0x142/0x1c0 [btrfs]
      [539604.279996]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
      [539604.280673]  kthread+0xf0/0x120
      [539604.281050]  ? __pfx_kthread+0x10/0x10
      [539604.281496]  ret_from_fork+0x29/0x50
      [539604.281966]  </TASK>
      [539604.282255] task:fsstress        state:D stack:0     pid:2236483 ppid:1      flags:0x00004006
      [539604.283897] Call Trace:
      [539604.284700]  <TASK>
      [539604.285088]  __schedule+0x41d/0xee0
      [539604.285660]  schedule+0x5d/0xf0
      [539604.286175]  btrfs_wait_block_group_cache_progress+0xf2/0x170 [btrfs]
      [539604.287342]  ? __pfx_autoremove_wake_function+0x10/0x10
      [539604.288450]  find_free_extent+0xd93/0x1750 [btrfs]
      [539604.289256]  ? _raw_spin_unlock+0x29/0x50
      [539604.289911]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
      [539604.290843]  btrfs_reserve_extent+0x147/0x290 [btrfs]
      [539604.291943]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
      [539604.292903]  __btrfs_cow_block+0x138/0x580 [btrfs]
      [539604.293773]  btrfs_cow_block+0x10e/0x240 [btrfs]
      [539604.294595]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
      [539604.295585]  btrfs_update_device+0x71/0x1b0 [btrfs]
      [539604.296459]  btrfs_chunk_alloc_add_chunk_item+0xe0/0x340 [btrfs]
      [539604.297489]  btrfs_chunk_alloc+0x1bf/0x490 [btrfs]
      [539604.298335]  find_free_extent+0x6fa/0x1750 [btrfs]
      [539604.299174]  ? _raw_spin_unlock+0x29/0x50
      [539604.299950]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
      [539604.300918]  btrfs_reserve_extent+0x147/0x290 [btrfs]
      [539604.301797]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
      [539604.303017]  ? lock_release+0x224/0x4a0
      [539604.303855]  __btrfs_cow_block+0x138/0x580 [btrfs]
      [539604.304789]  btrfs_cow_block+0x10e/0x240 [btrfs]
      [539604.305611]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
      [539604.306682]  ? btrfs_global_root+0x50/0x70 [btrfs]
      [539604.308198]  lookup_inline_extent_backref+0x17b/0x7a0 [btrfs]
      [539604.309254]  lookup_extent_backref+0x43/0xd0 [btrfs]
      [539604.310122]  __btrfs_free_extent+0xf8/0x810 [btrfs]
      [539604.310874]  ? lock_release+0x224/0x4a0
      [539604.311724]  ? btrfs_merge_delayed_refs+0x17b/0x1d0 [btrfs]
      [539604.313023]  __btrfs_run_delayed_refs+0x2ba/0x1260 [btrfs]
      [539604.314271]  btrfs_run_delayed_refs+0x8f/0x1c0 [btrfs]
      [539604.315445]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.316706]  btrfs_commit_transaction+0xa2/0xed0 [btrfs]
      [539604.317855]  ? do_raw_spin_unlock+0x4b/0xa0
      [539604.318544]  ? _raw_spin_unlock+0x29/0x50
      [539604.319240]  create_subvol+0x53d/0x6e0 [btrfs]
      [539604.320283]  btrfs_mksubvol+0x4f5/0x590 [btrfs]
      [539604.321220]  __btrfs_ioctl_snap_create+0x11b/0x180 [btrfs]
      [539604.322307]  btrfs_ioctl_snap_create_v2+0xc6/0x150 [btrfs]
      [539604.323295]  btrfs_ioctl+0x9f7/0x33e0 [btrfs]
      [539604.324331]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.325137]  ? lock_release+0x224/0x4a0
      [539604.325808]  ? __x64_sys_ioctl+0x87/0xc0
      [539604.326467]  __x64_sys_ioctl+0x87/0xc0
      [539604.327109]  do_syscall_64+0x38/0x90
      [539604.327875]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
      [539604.328792] RIP: 0033:0x7f05a7babaeb
      
      This needs to use regular btrfs_search_slot() with some skip and stop
      logic.
      
      Since we only consider five samples (five search slots), don't bother
      with the complexity of looking for commit_root_sem contention. If
      necessary, it can be added to the load function in between samples.
      Reported-by: default avatarFilipe Manana <fdmanana@kernel.org>
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H7eKMD44Z1+=Kb-1RFMMeZpAm2fwyO59yeBwCcSOU80Pg@mail.gmail.com/
      Fixes: c7eec3d9 ("btrfs: load block group size class when caching")
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12148367
  7. 15 Feb, 2023 1 commit
  8. 13 Feb, 2023 4 commits
    • Boris Burkov's avatar
      btrfs: don't use size classes for zoned file systems · cb0922f2
      Boris Burkov authored
      When a file system has ZNS devices which are constrained by a maximum
      number of active block groups, then not being able to use all the block
      groups for every allocation is not ideal, and could cause us to loop a
      ton with mixed size allocations.
      
      In general, since zoned doesn't write into gaps behind where block
      groups are writing, it is not susceptible to the same sort of
      fragmentation that size classes are designed to solve, so we can skip
      size classes for zoned file systems in general, even though there would
      probably be no harm for SMR devices.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb0922f2
    • Boris Burkov's avatar
      btrfs: load block group size class when caching · c7eec3d9
      Boris Burkov authored
      Since the size class is an artifact of an arbitrary anti fragmentation
      strategy, it doesn't really make sense to persist it. Furthermore, most
      of the size class logic assumes fresh block groups. That is of course
      not a reasonable assumption -- we will be upgrading kernels with
      existing filesystems whose block groups are not classified.
      
      To work around those issues, implement logic to compute the size class
      of the block groups as we cache them in. To perfectly assess the state
      of a block group, we would have to read the entire extent tree (since
      the free space cache mashes together contiguous extent items) which
      would be prohibitively expensive for larger file systems with more
      extents.
      
      We can do it relatively cheaply by implementing a simple heuristic of
      sampling a handful of extents and picking the smallest one we see. In
      the happy case where the block group was classified, we will only see
      extents of the correct size. In the unhappy case, we will hopefully find
      one of the smaller extents, but there is no perfect answer anyway.
      Autorelocation will eventually churn up the block group if there is
      significant freeing anyway.
      
      There was no regression in mount performance at end state of the fsperf
      test suite, and the delay until the block group is marked cached is
      minimized by the constant number of extent samples.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7eec3d9
    • Boris Burkov's avatar
      btrfs: introduce size class to block group allocator · 52bb7a21
      Boris Burkov authored
      The aim of this patch is to reduce the fragmentation of block groups
      under certain unhappy workloads. It is particularly effective when the
      size of extents correlates with their lifetime, which is something we
      have observed causing fragmentation in the fleet at Meta.
      
      This patch categorizes extents into size classes:
      
      - x < 128KiB: "small"
      - 128KiB < x < 8MiB: "medium"
      - x > 8MiB: "large"
      
      and as much as possible reduces allocations of extents into block groups
      that don't match the size class. This takes advantage of any (possible)
      correlation between size and lifetime and also leaves behind predictable
      re-usable gaps when extents are freed; small writes don't gum up bigger
      holes.
      
      Size classes are implemented in the following way:
      
      - Mark each new block group with a size class of the first allocation
        that goes into it.
      
      - Add two new passes to ffe: "unset size class" and "wrong size class".
        First, try only matching block groups, then try unset ones, then allow
        allocation of new ones, and finally allow mismatched block groups.
      
      - Filtering is done just by skipping inappropriate ones, there is no
        special size class indexing.
      
      Other solutions I considered were:
      
      - A best fit allocator with an rb-tree. This worked well, as small
        writes didn't leak big holes from large freed extents, but led to
        regressions in ffe and write performance due to lock contention on
        the rb-tree with every allocation possibly updating it in parallel.
        Perhaps something clever could be done to do the updates in the
        background while being "right enough".
      
      - A fixed size "working set". This prevents freeing an extent
        drastically changing where writes currently land, and seems like a
        good option too. Doesn't take advantage of size in any way.
      
      - The same size class idea, but implemented with xarray marks. This
        turned out to be slower than looping the linked list and skipping
        wrong block groups, and is also less flexible since we must have only
        3 size classes (max #marks). With the current approach we can have as
        many as we like.
      
      Performance testing was done via: https://github.com/josefbacik/fsperf
      Of particular relevance are the new fragmentation specific tests.
      
      A brief summary of the testing results:
      
      - Neutral results on existing tests. There are some minor regressions
        and improvements here and there, but nothing that truly stands out as
        notable.
      - Improvement on new tests where size class and extent lifetime are
        correlated. Fragmentation in these cases is completely eliminated
        and write performance is generally a little better. There is also
        significant improvement where extent sizes are just a bit larger than
        the size class boundaries.
      - Regression on one new tests: where the allocations are sized
        intentionally a hair under the borders of the size classes. Results
        are neutral on the test that intentionally attacks this new scheme by
        mixing extent size and lifetime.
      
      The full dump of the performance results can be found here:
      https://bur.io/fsperf/size-class-2022-11-15.txt
      (there are ANSI escape codes, so best to curl and view in terminal)
      
      Here is a snippet from the full results for a new test which mixes
      buffered writes appending to a long lived set of files and large short
      lived fallocates:
      
      bufferedappendvsfallocate results
               metric             baseline       current        stdev            diff
      ======================================================================================
      avg_commit_ms                    31.13         29.20          2.67     -6.22%
      bg_count                            14         15.60             0     11.43%
      commits                          11.10         12.20          0.32      9.91%
      elapsed                          27.30         26.40          2.98     -3.30%
      end_state_mount_ns         11122551.90   10635118.90     851143.04     -4.38%
      end_state_umount_ns           1.36e+09      1.35e+09   12248056.65     -1.07%
      find_free_extent_calls       116244.30     114354.30        964.56     -1.63%
      find_free_extent_ns_max      599507.20    1047168.20     103337.08     74.67%
      find_free_extent_ns_mean       3607.19       3672.11        101.20      1.80%
      find_free_extent_ns_min            500           512          6.67      2.40%
      find_free_extent_ns_p50           2848          2876         37.65      0.98%
      find_free_extent_ns_p95           4916          5000         75.45      1.71%
      find_free_extent_ns_p99       20734.49      20920.48       1670.93      0.90%
      frag_pct_max                     61.67             0          8.05   -100.00%
      frag_pct_mean                    43.59             0          6.10   -100.00%
      frag_pct_min                     25.91             0         16.60   -100.00%
      frag_pct_p50                     42.53             0          7.25   -100.00%
      frag_pct_p95                     61.67             0          8.05   -100.00%
      frag_pct_p99                     61.67             0          8.05   -100.00%
      fragmented_bg_count               6.10             0          1.45   -100.00%
      max_commit_ms                    49.80            46          5.37     -7.63%
      sys_cpu                           2.59          2.62          0.29      1.39%
      write_bw_bytes                1.62e+08      1.68e+08   17975843.50      3.23%
      write_clat_ns_mean            57426.39      54475.95       2292.72     -5.14%
      write_clat_ns_p50             46950.40      42905.60       2101.35     -8.62%
      write_clat_ns_p99            148070.40     143769.60       2115.17     -2.90%
      write_io_kbytes                4194304       4194304             0      0.00%
      write_iops                     2476.15       2556.10        274.29      3.23%
      write_lat_ns_max            2101667.60    2251129.50     370556.59      7.11%
      write_lat_ns_mean             59374.91      55682.00       2523.09     -6.22%
      write_lat_ns_min              17353.10         16250       1646.08     -6.36%
      
      There are some mixed improvements/regressions in most metrics along with
      an elimination of fragmentation in this workload.
      
      On the balance, the drastic 1->0 improvement in the happy cases seems
      worth the mix of regressions and improvements we do observe.
      
      Some considerations for future work:
      
      - Experimenting with more size classes
      - More hinting/search ordering work to approximate a best-fit allocator
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52bb7a21
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable warning in btrfs_update_block_group · efbf35a1
      Josef Bacik authored
      reclaim isn't set in the alloc case, however we only care about
      reclaim in the !alloc case.  This isn't an actual problem, however
      -Wmaybe-uninitialized will complain, so initialize reclaim to quiet the
      compiler.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efbf35a1
  9. 05 Dec, 2022 10 commits
    • David Sterba's avatar
      btrfs: convert btrfs_block_group::needs_free_space to runtime flag · 0d7764ff
      David Sterba authored
      We already have flags in block group to track various status bits,
      convert needs_free_space as well and reduce size of btrfs_block_group.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d7764ff
    • David Sterba's avatar
      btrfs: simplify percent calculation helpers, rename div_factor · 428c8e03
      David Sterba authored
      The div_factor* helpers calculate fraction or percentage fraction. The
      name is a bit confusing, we use it only for percentage calculations and
      there are two helpers.
      
      There's a helper mult_frac that's for general fractions, that tries to
      be accurate but we multiply and divide by small numbers so we can use
      the div_u64 helper.
      
      Rename the div_factor* helpers and use 1..100 percentage range, also drop
      the case checking for percentage == 100, it's never hit.
      
      The conversions:
      
      * div_factor calculates tenths and the numbers need to be adjusted
      * div_factor_fine is direct replacement
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      428c8e03
    • David Sterba's avatar
      btrfs: update function comments · 43dd529a
      David Sterba authored
      Update, reformat or reword function comments. This also removes the kdoc
      marker so we don't get reports when the function name is missing.
      
      Changes made:
      
      - remove kdoc markers
      - reformat the brief description to be a proper sentence
      - reword to imperative voice
      - align parameter list
      - fix typos
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      43dd529a
    • Josef Bacik's avatar
      btrfs: move extent-tree helpers into their own header file · a0231804
      Josef Bacik authored
      Move all the extent tree related prototypes to extent-tree.h out of
      ctree.h, and then go include it everywhere needed so everything
      compiles.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0231804
    • Josef Bacik's avatar
      btrfs: move accessor helpers into accessors.h · 07e81dc9
      Josef Bacik authored
      This is a large patch, but because they're all macros it's impossible to
      split up.  Simply copy all of the item accessors in ctree.h and paste
      them in accessors.h, and then update any files to include the header so
      everything compiles.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat comments, style fixups ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07e81dc9
    • Josef Bacik's avatar
      btrfs: move fs wide helpers out of ctree.h · c7f13d42
      Josef Bacik authored
      We have several fs wide related helpers in ctree.h.  The bulk of these
      are the incompat flag test helpers, but there are things such as
      btrfs_fs_closing() and the read only helpers that also aren't directly
      related to the ctree code.  Move these into a fs.h header, which will
      serve as the location for file system wide related helpers.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7f13d42
    • Qu Wenruo's avatar
      btrfs: skip update of block group item if used bytes are the same · 7248e0ce
      Qu Wenruo authored
      [BACKGROUND]
      
      When committing a transaction, we will update block group items for all
      dirty block groups.
      
      But in fact, dirty block groups don't always need to update their block
      group items.
      It's pretty common to have a metadata block group which experienced
      several COW operations, but still have the same amount of used bytes.
      
      In that case, we may unnecessarily COW a tree block doing nothing.
      
      [ENHANCEMENT]
      
      This patch will introduce btrfs_block_group::commit_used member to
      remember the last used bytes, and use that new member to skip
      unnecessary block group item update.
      
      This would be more common for large filesystems, where metadata block
      group can be as large as 1GiB, containing at most 64K metadata items.
      
      In that case, if COW added and then deleted one metadata item near the
      end of the block group, then it's completely possible we don't need to
      touch the block group item at all.
      
      [BENCHMARK]
      
      The change itself can have quite a high chance (20~80%) to skip block
      group item updates in lot of workloads.
      
      As a result, it would result shorter time spent on
      btrfs_write_dirty_block_groups(), and overall reduce the execution time
      of the critical section of btrfs_commit_transaction().
      
      Here comes a fio command, which will do random writes in 4K block size,
      causing a very heavy metadata updates.
      
      fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
          --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
          --name=random_write --fallocate=none --time_based --fsync_on_close=1
      
      The file size (512M) and number of threads (4) means 2GiB file size in
      total, but during the full 300s run time, my dedicated SATA SSD is able
      to write around 20~25GiB, which is over 10 times the file size.
      
      Thus after we fill the initial 2G, we should not cause much block group
      item updates.
      
      Please note, the fio numbers by themselves don't have much change, but
      if we look deeper, there is some reduced execution time, especially for
      the critical section of btrfs_commit_transaction().
      
      I added extra trace_printk() to measure the following per-transaction
      execution time:
      
      - Critical section of btrfs_commit_transaction()
        By re-using the existing update_commit_stats() function, which
        has already calculated the interval correctly.
      
      - The while() loop for btrfs_write_dirty_block_groups()
        Although this includes the execution time of btrfs_run_delayed_refs(),
        it should still be representative overall.
      
      Both result involves transid 7~30, the same amount of transaction
      committed.
      
      The result looks like this:
      
                            |      Before       |     After      |  Diff
      ----------------------+-------------------+----------------+--------
      Transaction interval  | 229247198.5       | 215016933.6    | -6.2%
      Block group interval  | 23133.33333       | 18970.83333    | -18.0%
      
      The change in block group item updates is more obvious, as skipped block
      group item updates also mean less delayed refs.
      
      And the overall execution time for that block group update loop is
      pretty small, thus we can assume the extent tree is already mostly
      cached.  If we can skip an uncached tree block, it would cause more
      obvious change.
      
      Unfortunately the overall reduction in commit transaction critical
      section is much smaller, as the block group item updates loop is not
      really the major part, at least not for the above fio script.
      
      But still we have a observable reduction in the critical section.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7248e0ce
    • Boris Burkov's avatar
      btrfs: re-check reclaim condition in reclaim worker · 81531225
      Boris Burkov authored
      I have observed the following case play out and lead to unnecessary
      relocations:
      
      1. write a file across multiple block groups
      2. delete the file
      3. several block groups fall below the reclaim threshold
      4. reclaim the first, moving extents into the others
      5. reclaim the others which are now actually very full, leading to poor
         reclaim behavior with lots of writing, allocating new block groups,
         etc.
      
      I believe the risk of missing some reasonable reclaims is worth it
      when traded off against the savings of avoiding overfull reclaims.
      
      Going forward, it could be interesting to make the check more advanced
      (zoned aware, fragmentation aware, etc...) so that it can be a really
      strong signal both at extent delete and reclaim time.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81531225
    • Boris Burkov's avatar
      btrfs: skip reclaim if block_group is empty · cc4804bf
      Boris Burkov authored
      As we delete extents from a block group, at some deletion we cross below
      the reclaim threshold. It is possible we are still in the middle of
      deleting more extents and might soon hit 0. If the block group is empty
      by the time the reclaim worker runs, we will still relocate it.
      
      This works just fine, as relocating an empty block group ultimately
      results in properly deleting it. However, we have more direct ways of
      removing empty block groups in the cleaner thread. Those are either
      async discard or the unused_bgs list. In fact, when we decide whether to
      relocate a block group during extent deletion, we do check for emptiness
      and prefer the discard/unused_bgs mechanisms when possible.
      
      Not using relocation for this case reduces some modest overhead from
      empty bg relocation:
      
      - extra transactions
      - extra metadata use/churn for creating relocation metadata
      - trying to read the extent tree to look for extents (and in this case
        finding none)
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc4804bf
    • Josef Bacik's avatar
      btrfs: move btrfs_should_fragment_free_space into block-group.c · 06d61cb1
      Josef Bacik authored
      This function uses functions that are not defined in block-group.h, move
      it into block-group.c in order to keep the header clean.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06d61cb1
  10. 11 Oct, 2022 1 commit
    • David Sterba's avatar
      btrfs: delete stale comments after merge conflict resolution · 295a53cc
      David Sterba authored
      There are two comments in btrfs_cache_block_group that I left when
      resolving conflict between commits ced8ecf0 "btrfs: fix space cache
      corruption and potential double allocations" and 527c490f "btrfs:
      delete btrfs_wait_space_cache_v1_finished".
      
      The former reworked the caching logic to wait until the caching ends in
      btrfs_cache_block_group while the latter only open coded the waiting.
      Both removed btrfs_wait_space_cache_v1_finished, the correct code is
      with the waiting and returning error. Thus the conflict resolution was
      OK.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      295a53cc
  11. 29 Sep, 2022 1 commit
  12. 26 Sep, 2022 7 commits