1. 21 Aug, 2023 40 commits
    • Naohiro Aota's avatar
      btrfs: zoned: re-enable metadata over-commit for zoned mode · 5b135b38
      Naohiro Aota authored
      Now that, we can re-enable metadata over-commit. As we moved the activation
      from the reservation time to the write time, we no longer need to ensure
      all the reserved bytes is properly activated.
      
      Without the metadata over-commit, it suffers from lower performance because
      it needs to flush the delalloc items more often and allocate more block
      groups. Re-enabling metadata over-commit will solve the issue.
      
      Fixes: 79417d04 ("btrfs: zoned: disable metadata overcommit for zoned")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b135b38
    • Naohiro Aota's avatar
      btrfs: zoned: don't activate non-DATA BG on allocation · 5a7d107e
      Naohiro Aota authored
      Now that a non-DATA block group is activated at write time, don't
      activate it on allocation time.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a7d107e
    • Naohiro Aota's avatar
      btrfs: zoned: no longer count fresh BG region as zone unusable · 6a8ebc77
      Naohiro Aota authored
      Now that we switched to write time activation, we no longer need to (and
      must not) count the fresh region as zone unusable. This commit is similar
      to revert of commit fa2068d7 ("btrfs: zoned: count fresh BG
      region as zone unusable").
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a8ebc77
    • Naohiro Aota's avatar
      btrfs: zoned: activate metadata block group on write time · 13bb483d
      Naohiro Aota authored
      In the current implementation, block groups are activated at reservation
      time to ensure that all reserved bytes can be written to an active metadata
      block group. However, this approach has proven to be less efficient, as it
      activates block groups more frequently than necessary, putting pressure on
      the active zone resource and leading to potential issues such as early
      ENOSPC or hung_task.
      
      Another drawback of the current method is that it hampers metadata
      over-commit, and necessitates additional flush operations and block group
      allocations, resulting in decreased overall performance.
      
      To address these issues, this commit introduces a write-time activation of
      metadata and system block group. This involves reserving at least one
      active block group specifically for a metadata and system block group.
      
      Since metadata write-out is always allocated sequentially, when we need to
      write to a non-active block group, we can wait for the ongoing IOs to
      complete, activate a new block group, and then proceed with writing to the
      new block group.
      
      Fixes: b0931513 ("btrfs: zoned: activate metadata block group on flush_space")
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13bb483d
    • Naohiro Aota's avatar
      btrfs: zoned: reserve zones for an active metadata/system block group · a7e1ac7b
      Naohiro Aota authored
      Ensure a metadata and system block group can be activated on write time, by
      leaving a certain number of active zones when trying to activate a data
      block group.
      
      Zones for two metadata block groups (normal and tree-log) and one system
      block group are reserved, according to the profile type: two zones per
      block group on the DUP profile and one zone per block group otherwise.
      
      The reservation must be freed once a non-data block group is allocated. If
      not, we over-reserve the active zones and data block group activation will
      suffer. For the dynamic reservation count, we need to manage the
      reservation count per device.
      
      The reservation count variable is protected by
      fs_info->zone_active_bgs_lock.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7e1ac7b
    • Naohiro Aota's avatar
      btrfs: zoned: update meta write pointer on zone finish · c1c3c2bc
      Naohiro Aota authored
      On finishing a zone, the meta_write_pointer should be set of the end of the
      zone to reflect the actual write pointer position.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1c3c2bc
    • Naohiro Aota's avatar
      btrfs: zoned: defer advancing meta write pointer · 0356ad41
      Naohiro Aota authored
      We currently advance the meta_write_pointer in
      btrfs_check_meta_write_pointer(). That makes it necessary to revert it
      when locking the buffer failed. Instead, we can advance it just before
      sending the buffer.
      
      Also, this is necessary for the following commit. In the commit, it needs
      to release the zoned_meta_io_lock to allow IOs to come in and wait for them
      to fill the currently active block group. If we advance the
      meta_write_pointer before locking the extent buffer, the following extent
      buffer can pass the meta_write_pointer check, resulting in an unaligned
      write failure.
      
      Advancing the pointer is still thread-safe as the extent buffer is locked.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0356ad41
    • Naohiro Aota's avatar
      btrfs: zoned: return int from btrfs_check_meta_write_pointer · 2ad8c051
      Naohiro Aota authored
      Now that we have writeback_control passed to
      btrfs_check_meta_write_pointer(), we can move the wbc condition in
      submit_eb_page() to btrfs_check_meta_write_pointer() and return int.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ad8c051
    • Naohiro Aota's avatar
      btrfs: zoned: introduce block group context to btrfs_eb_write_context · 7db94301
      Naohiro Aota authored
      For metadata write out on the zoned mode, we call
      btrfs_check_meta_write_pointer() to check if an extent buffer to be written
      is aligned to the write pointer.
      
      We look up a block group containing the extent buffer for every extent
      buffer, which takes unnecessary effort as the writing extent buffers are
      mostly contiguous.
      
      Introduce "zoned_bg" to cache the block group working on.  Also, while
      at it, rename "cache" to "block_group".
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7db94301
    • Naohiro Aota's avatar
      btrfs: introduce struct to consolidate extent buffer write context · 861093ef
      Naohiro Aota authored
      Introduce btrfs_eb_write_context to consolidate writeback_control and the
      exntent buffer context.  This will help adding a block group context as
      well.
      
      While at it, move the eb context setting before
      btrfs_check_meta_write_pointer(). We can set it here because we anyway need
      to skip pages in the same eb if that eb is rejected by
      btrfs_check_meta_write_pointer().
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      861093ef
    • Filipe Manana's avatar
      btrfs: avoid start and commit empty transaction when flushing qgroups · 9c93c238
      Filipe Manana authored
      When flushing qgroups, we try to join a running transaction, with
      btrfs_join_transaction(), and then commit the transaction. However using
      btrfs_join_transaction() will result in creating a new transaction in case
      there isn't any running or if there's an existing one already committing.
      This is pointless as we only need to attach to an existing one that is
      not committing and in case there's an existing one committing, wait for
      its commit to complete. Creating and committing an empty transaction is
      wasteful, pointless IO and unnecessary rotation of the backup roots.
      
      So use btrfs_attach_transaction_barrier() instead, to avoid creating and
      committing empty transactions.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c93c238
    • Filipe Manana's avatar
      btrfs: avoid start and commit empty transaction when starting qgroup rescan · 6705b48a
      Filipe Manana authored
      When starting a qgroup rescan, we try to join a running transaction, with
      btrfs_join_transaction(), and then commit the transaction. However using
      btrfs_join_transaction() will result in creating a new transaction in case
      there isn't any running or if there's an existing one already committing.
      This is pointless as we only need to attach to an existing one that is
      not committing and in case there's an existing one committing, wait for
      its commit to complete. Creating and committing an empty transaction is
      wasteful, pointless IO and unnecessary rotation of the backup roots.
      
      So use btrfs_attach_transaction_barrier() instead, to avoid creating and
      committing empty transactions.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6705b48a
    • Filipe Manana's avatar
      btrfs: avoid starting and committing empty transaction when flushing space · 2ee70ed1
      Filipe Manana authored
      When flushing space and we are in the COMMIT_TRANS state, we join a
      transaction with btrfs_join_transaction() and then commit the returned
      transaction. However btrfs_join_transaction() starts a new transaction if
      there is none currently open, which is pointless since comitting a new,
      empty transaction, doesn't achieve anything, it only wastes time, IO and
      creates an unnecessary rotation of the backup roots.
      
      So use btrfs_attach_transaction_barrier() to avoid starting a new
      transaction. This also waits for any ongoing transaction that is
      committing (state >= TRANS_STATE_COMMIT_DOING) to fully complete, and
      therefore wait for all the extents that were pinned during the
      transaction's lifetime to be unpinned.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ee70ed1
    • Filipe Manana's avatar
      btrfs: avoid starting new transaction when flushing delayed items and refs · 2391245a
      Filipe Manana authored
      When flushing space we join a transaction to flush delayed items and
      delayed references, in order to try to release space. However using
      btrfs_join_transaction() not only joins an existing transaction as well
      as it starts a new transaction if there is none open. If there is no
      transaction open, we don't have neither delayed items nor delayed
      references, so creating a new transaction is a waste of time, IO and
      creates an unnecessary rotation of the backup roots without gaining any
      benefits (including releasing space).
      
      So use btrfs_join_transaction_nostart() when attempting to flush delayed
      items and references.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2391245a
    • Filipe Manana's avatar
      btrfs: merge find_free_dev_extent() and find_free_dev_extent_start() · ed8947bc
      Filipe Manana authored
      There is no point in having find_free_dev_extent() because it's just a
      simple wrapper around find_free_dev_extent_start() which always passes a
      value of 0 for the search_start argument. Since there are no other callers
      of find_free_dev_extent_start(), remove find_free_dev_extent() and rename
      find_free_dev_extent_start() to find_free_dev_extent(), removing its
      search_start argument because it's always 0.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed8947bc
    • Filipe Manana's avatar
      btrfs: make find_free_dev_extent() static · 883647f4
      Filipe Manana authored
      The function find_free_dev_extent() is only used within volumes.c, so make
      it static and remove its prototype from volumes.h.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      883647f4
    • Filipe Manana's avatar
      btrfs: make btrfs_cleanup_fs_roots() static · 504b1596
      Filipe Manana authored
      btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static,
      remove its prototype from disk-io.h and move its definition above the
      where it's used in disk-io.c
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      504b1596
    • Filipe Manana's avatar
      btrfs: fail priority metadata ticket with real fs error · 7e3bfd14
      Filipe Manana authored
      At priority_reclaim_metadata_space(), if we were not able to satisfy the
      the ticket after going through the various flushing states and we notice
      the fs went into an error state, likely due to a transaction abort during
      the flushing, set the ticket's error to the error that caused the
      transaction abort instead of an unconditional -EROFS.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e3bfd14
    • Filipe Manana's avatar
      btrfs: return real error when orphan cleanup fails due to a transaction abort · a7f8de50
      Filipe Manana authored
      During mount we will call btrfs_orphan_cleanup() to remove any inodes that
      were previously deleted (have a link count of 0) but for which we were not
      able before to remove their items from the subvolume tree. The removal of
      the items will happen by triggering eviction, when we do the final iput()
      on them at btrfs_orphan_cleanup(), which will end in the loop at
      btrfs_evict_inode() that truncates inode items.
      
      In a dire situation we may have a transaction abort due to -ENOSPC when
      attempting to truncate the inode items, and in that case the orphan item
      (key type BTRFS_ORPHAN_ITEM_KEY) will remain in the subvolume tree and
      when we hit the next iteration of the while loop at btrfs_orphan_cleanup()
      we will find the same orphan item as before, and then we will return
      -EINVAL from btrfs_orphan_cleanup() through the following if statement:
      
          if (found_key.offset == last_objectid) {
             btrfs_err(fs_info,
                       "Error removing orphan entry, stopping orphan cleanup");
             ret = -EINVAL;
             goto out;
          }
      
      This makes the mount operation fail with -EINVAL, when it should have been
      -ENOSPC. This is confusing because -EINVAL might lead a user into thinking
      it provided invalid mount options for example.
      
      An example where this happens:
      
         $ mount test.img /mnt
         mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
      
         $ dmesg
         [ 2542.356934] BTRFS: device fsid 977fff75-1181-4d2b-a739-384fa710d16e devid 1 transid 47409973 /dev/loop0 scanned by mount (4459)
         [ 2542.357451] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
         [ 2542.357461] BTRFS info (device loop0): disk space caching is enabled
         [ 2542.742287] BTRFS info (device loop0): auto enabling async discard
         [ 2542.764554] BTRFS info (device loop0): checking UUID tree
         [ 2551.743065] ------------[ cut here ]------------
         [ 2551.743068] BTRFS: Transaction aborted (error -28)
         [ 2551.743149] WARNING: CPU: 7 PID: 215 at fs/btrfs/block-group.c:3494 btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743311] Modules linked in: btrfs blake2b_generic (...)
         [ 2551.743353] CPU: 7 PID: 215 Comm: kworker/u24:5 Not tainted 6.4.0-rc6-btrfs-next-134+ #1
         [ 2551.743356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
         [ 2551.743357] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
         [ 2551.743405] RIP: 0010:btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743449] Code: 8b 43 0c (...)
         [ 2551.743451] RSP: 0018:ffff982c005a7c40 EFLAGS: 00010286
         [ 2551.743452] RAX: 0000000000000000 RBX: ffff88fc6e44b400 RCX: 0000000000000000
         [ 2551.743453] RDX: 0000000000000002 RSI: ffffffff8dff0878 RDI: 00000000ffffffff
         [ 2551.743454] RBP: ffff88fc51817208 R08: 0000000000000000 R09: ffff982c005a7ae0
         [ 2551.743455] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88fc43d2e570
         [ 2551.743456] R13: ffff88fc43d2e400 R14: ffff88fc8fb08ee0 R15: ffff88fc6e44b530
         [ 2551.743457] FS:  0000000000000000(0000) GS:ffff89035fbc0000(0000) knlGS:0000000000000000
         [ 2551.743458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [ 2551.743459] CR2: 00007fa8cdf2f6f4 CR3: 0000000124850003 CR4: 0000000000370ee0
         [ 2551.743462] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [ 2551.743463] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [ 2551.743464] Call Trace:
         [ 2551.743472]  <TASK>
         [ 2551.743474]  ? __warn+0x80/0x130
         [ 2551.743478]  ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743520]  ? report_bug+0x1f4/0x200
         [ 2551.743523]  ? handle_bug+0x42/0x70
         [ 2551.743526]  ? exc_invalid_op+0x14/0x70
         [ 2551.743528]  ? asm_exc_invalid_op+0x16/0x20
         [ 2551.743532]  ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs]
         [ 2551.743574]  ? _raw_spin_unlock+0x15/0x30
         [ 2551.743576]  ? btrfs_run_delayed_refs+0x1bd/0x200 [btrfs]
         [ 2551.743609]  commit_cowonly_roots+0x1e9/0x260 [btrfs]
         [ 2551.743652]  btrfs_commit_transaction+0x42e/0xfa0 [btrfs]
         [ 2551.743693]  ? __pfx_autoremove_wake_function+0x10/0x10
         [ 2551.743697]  flush_space+0xf1/0x5d0 [btrfs]
         [ 2551.743743]  ? _raw_spin_unlock+0x15/0x30
         [ 2551.743745]  ? finish_task_switch+0x91/0x2a0
         [ 2551.743748]  ? _raw_spin_unlock+0x15/0x30
         [ 2551.743750]  ? btrfs_get_alloc_profile+0xc9/0x1f0 [btrfs]
         [ 2551.743793]  btrfs_async_reclaim_metadata_space+0xe1/0x230 [btrfs]
         [ 2551.743837]  process_one_work+0x1d9/0x3e0
         [ 2551.743844]  worker_thread+0x4a/0x3b0
         [ 2551.743847]  ? __pfx_worker_thread+0x10/0x10
         [ 2551.743849]  kthread+0xee/0x120
         [ 2551.743852]  ? __pfx_kthread+0x10/0x10
         [ 2551.743854]  ret_from_fork+0x29/0x50
         [ 2551.743860]  </TASK>
         [ 2551.743861] ---[ end trace 0000000000000000 ]---
         [ 2551.743863] BTRFS info (device loop0: state A): dumping space info:
         [ 2551.743866] BTRFS info (device loop0: state A): space_info DATA has 126976 free, is full
         [ 2551.743868] BTRFS info (device loop0: state A): space_info total=13458472960, used=13458137088, pinned=143360, reserved=0, may_use=0, readonly=65536 zone_unusable=0
         [ 2551.743870] BTRFS info (device loop0: state A): space_info METADATA has -51625984 free, is full
         [ 2551.743872] BTRFS info (device loop0: state A): space_info total=771751936, used=770146304, pinned=1605632, reserved=0, may_use=51625984, readonly=0 zone_unusable=0
         [ 2551.743874] BTRFS info (device loop0: state A): space_info SYSTEM has 14663680 free, is not full
         [ 2551.743875] BTRFS info (device loop0: state A): space_info total=14680064, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
         [ 2551.743877] BTRFS info (device loop0: state A): global_block_rsv: size 53231616 reserved 51544064
         [ 2551.743878] BTRFS info (device loop0: state A): trans_block_rsv: size 0 reserved 0
         [ 2551.743879] BTRFS info (device loop0: state A): chunk_block_rsv: size 0 reserved 0
         [ 2551.743880] BTRFS info (device loop0: state A): delayed_block_rsv: size 0 reserved 0
         [ 2551.743881] BTRFS info (device loop0: state A): delayed_refs_rsv: size 786432 reserved 0
         [ 2551.743886] BTRFS: error (device loop0: state A) in btrfs_write_dirty_block_groups:3494: errno=-28 No space left
         [ 2551.743911] BTRFS info (device loop0: state EA): forced readonly
         [ 2551.743951] BTRFS warning (device loop0: state EA): could not allocate space for delete; will truncate on mount
         [ 2551.743962] BTRFS error (device loop0: state EA): Error removing orphan entry, stopping orphan cleanup
         [ 2551.743973] BTRFS warning (device loop0: state EA): Skipping commit of aborted transaction.
         [ 2551.743989] BTRFS error (device loop0: state EA): could not do orphan cleanup -22
      
      So make the btrfs_orphan_cleanup() return the value of BTRFS_FS_ERROR(),
      if it's set, and -EINVAL otherwise.
      
      For that same example, after this change, the mount operation fails with
      -ENOSPC:
      
         $ mount test.img /mnt
         mount: /mnt: mount(2) system call failed: No space left on device.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7f8de50
    • Filipe Manana's avatar
      btrfs: store the error that turned the fs into error state · ae3364e5
      Filipe Manana authored
      Currently when we turn the fs into an error state, typically after a
      transaction abort, we don't store the error anywhere, we just set a bit
      (BTRFS_FS_STATE_ERROR) at struct btrfs_fs_info::fs_state to signal the
      error state.
      
      There are cases where it would be useful to have access to the specific
      error in order to provide a more meaningful error to users/applications.
      This change adds a member to struct btrfs_fs_info to store the error and
      removes the BTRFS_FS_STATE_ERROR bit. When there's no error, the new
      member (fs_error) has a value of 0, otherwise its value is a negative
      errno value.
      
      Followup changes will make use of this new member.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae3364e5
    • Filipe Manana's avatar
      btrfs: don't steal space from global rsv after a transaction abort · 1b6948ac
      Filipe Manana authored
      When doing a priority metadata space reclaim, while we are going through
      the flush states and running their respective operations, it's possible
      that a transaction abort happened, for example when running delayed refs
      we hit -ENOSPC or in the critical section of transaction commit we failed
      with -ENOSPC or some other error. In these cases a transaction was aborted
      and the fs turned into error state. If that happened, then it makes no
      sense to steal from the global block reserve and return success to the
      caller if the stealing was successful - the caller will later get an
      error when attempting to modify the fs. Instead make the ticket fail if
      we have the fs in error state and don't attempt to steal from the global
      rsv, as it's not only it's pointless, it also simplifies debugging some
      -ENOSPC problems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b6948ac
    • Filipe Manana's avatar
      btrfs: print available space across all block groups when dumping space info · 1ff9fee3
      Filipe Manana authored
      When dumping a space info also sum the available space for all block
      groups and then print it. This often useful for debugging -ENOSPC
      related problems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ff9fee3
    • Filipe Manana's avatar
      btrfs: print available space for a block group when dumping a space info · e50b122b
      Filipe Manana authored
      When dumping a space info, we iterate over all its block groups and then
      print their size and the amounts of bytes used, reserved, pinned, etc.
      When debugging -ENOSPC problems it's also useful to know how much space
      is available (free), so calculate that and print it as well.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e50b122b
    • Filipe Manana's avatar
      btrfs: print block group super and delalloc bytes when dumping space info · b92e8f54
      Filipe Manana authored
      When dumping a space info's block groups, also print the number of bytes
      used for super blocks and delalloc. This is often useful for debugging
      -ENOSPC problems.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b92e8f54
    • Filipe Manana's avatar
      btrfs: print target number of bytes when dumping free space · 4d2024e9
      Filipe Manana authored
      When dumping free space, with btrfs_dump_free_space(), we pass a bytes
      argument in order to count how many free space entries in the block group
      have a size greater than or equal to that number of bytes. We then print
      how many suitable entries we found, but we don't print the target number
      of bytes, we just say "bytes". Change the message to actually print the
      number of bytes, which makes debugging -ENOSPC issues a bit easier.
      
      Also sligthly change the odd grammar and terminology: the sentence is
      ending with 'is', which doesn't make sense, and the term 'blocks' is
      confusing as we are referring to free space entries within the block
      group's free space cache.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d2024e9
    • Filipe Manana's avatar
      btrfs: update comment for btrfs_join_transaction_nostart() · 19288951
      Filipe Manana authored
      Update the comment for btrfs_join_transaction_nostart() to be more clear
      about how it works and how it's different from btrfs_attach_transaction().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19288951
    • Filipe Manana's avatar
      btrfs: don't start transaction when joining with TRANS_JOIN_NOSTART · 4490e803
      Filipe Manana authored
      When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a
      running transaction we end up creating one. This goes against the purpose
      of TRANS_JOIN_NOSTART which is to join a running transaction if its state
      is at or below the state TRANS_STATE_COMMIT_START, otherwise return an
      -ENOENT error and don't start a new transaction. So fix this to not create
      a new transaction if there's no running transaction at or below that
      state.
      
      CC: stable@vger.kernel.org # 4.14+
      Fixes: a6d155d2 ("Btrfs: fix deadlock between fiemap and transaction commits")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4490e803
    • Qu Wenruo's avatar
      btrfs: refactor main loop in memmove_extent_buffer() · 096d2301
      Qu Wenruo authored
      [BACKGROUND]
      Currently memove_extent_buffer() does a loop where it strop at any page
      boundary inside [dst_offset, dst_offset + len) or [src_offset,
      src_offset + len).
      
      This is mostly allowing us to do copy_pages(), but if we're going to use
      folios we will need to handle multi-page (the old behavior) or single
      folio (the new optimization).
      
      The current code would be a burden for future changes.
      
      [ENHANCEMENT]
      Instead of sticking with copy_pages(), here we utilize the new
      __write_extent_buffer() helper to handle the writes.
      
      Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
      on the write_extent_buffer() and only handle page boundaries inside src
      range.
      
      The function write_extent_buffer() itself is still doing forward
      writing, thus it cannot handle the following case: (already in the
      extent buffer memory operation tests, cross page overlapping run 2)
      
      	Src	Page boundary
      	|///////|
      	    |///|////|
      	    Dst
      
      In the above case, if we just follow page boundary in the src range, we
      have no need to do any split, just one __write_extent_buffer() with
      use_memmove = true.
      
      But __write_extent_buffer() would split the dst range into two,
      so it first copies the beginning part of the src range into the first half
      of the dst range.
      After this operation, the beginning of the dst range is already updated,
      causing corruption.
      
      So we have to follow the old behavior of handling both page boundaries.
      
      And since we're the last caller of copy_pages(), we can remove it
      completely.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      096d2301
    • Qu Wenruo's avatar
      btrfs: refactor main loop in memcpy_extent_buffer() · 13840f3f
      Qu Wenruo authored
      [BACKGROUND]
      Currently memcpy_extent_buffer() does a loop where it would stop at
      any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
      src_offset + len).
      
      This is mostly allowing us to do copy_pages(), but if we're going to use
      folios we will need to handle multi-page (the old behavior) or single
      folio (the new optimization).
      
      The current code would be a burden for future changes.
      
      [ENHANCEMENT]
      There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
      regular memcpy(), this function can handle overlapping ranges.
      
      So here we extract write_extent_buffer() into a new internal helper,
      __write_extent_buffer(), and add a new parameter @use_memmove, to
      indicate whether we should use memmove() or regular memcpy().
      
      Now we can go __write_extent_buffer() to handle writing into the dst
      range, with proper overlapping detection.
      
      This has a tiny change to the chance of calling memmove().
      As the split only happens at the source range page boundaries, the
      memcpy/memmove() range would be slightly larger than the old code,
      thus slightly increase the chance we call memmove() other than memcopy().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13840f3f
    • Qu Wenruo's avatar
      btrfs: copy all pages at once at the end of btrfs_clone_extent_buffer() · 682a0bc5
      Qu Wenruo authored
      btrfs_clone_extent_buffer() calls copy_page() at each iteration but we
      can copy all pages at the end in one go if there were no errors.
      This would make later conversion to folios easier.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      682a0bc5
    • Qu Wenruo's avatar
      btrfs: refactor main loop in copy_extent_buffer_full() · 54948681
      Qu Wenruo authored
      [BACKGROUND]
      copy_extent_buffer_full() currently does different handling for regular
      and subpage cases, for regular cases it does a page by page copying.
      For subpage cases, it just copies the content.
      
      This is fine for the page based extent buffer code, but for the incoming
      folio conversion, it can be a burden to add a new branch just to handle
      all the different combinations (subpage vs regular, one single folio vs
      multi pages).
      
      [ENHANCE]
      Instead of handling the different combinations, just go one single
      handling for all cases, utilizing write_extent_buffer() to do the
      copying.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54948681
    • Qu Wenruo's avatar
      btrfs: use write_extent_buffer() to implement write_extent_buffer_*id() · 730c374e
      Qu Wenruo authored
      Helpers write_extent_buffer_chunk_tree_uuid() and
      write_extent_buffer_fsid(), they can be implemented by
      write_extent_buffer().
      
      These two helpers are not that frequently used, they only get called
      during initialization of a new tree block.  There is not much need for
      those slightly optimized versions.  And since they can be easily
      converted to one write_extent_buffer() call, define them as inline
      helpers.
      
      This would make later page/folio switch much easier, as all change only
      need to happen in write_extent_buffer().
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      730c374e
    • Qu Wenruo's avatar
      btrfs: refactor extent buffer bitmaps operations · cb22964f
      Qu Wenruo authored
      [BACKGROUND]
      Currently we handle extent bitmaps manually in
      extent_buffer_bitmap_set() and extent_buffer_bitmap_clear().
      
      Although with various helpers like eb_bitmap_offset() it's still a little
      messy to read.  The code seems to be a copy of bitmap_set(), but with
      all the cross-page handling embedded into the code.
      
      [ENHANCEMENT]
      This patch would enhance the readability by introducing two helpers:
      
      - memset_extent_buffer()
        To handle the byte aligned range, thus all the cross-page handling is
        done there.
      
      - extent_buffer_get_byte()
        This for the first and the last byte operations, which only need to
        grab one byte, thus no need for any cross-page handling.
      
      So we can split both extent_buffer_bitmap_set() and
      extent_buffer_bitmap_clear() into 3 parts:
      
      - Handle the first byte
        If the range fits inside the first byte, we can exit early.
      
      - Handle the byte aligned part
        This is the part which can have cross-page operations, and it would
        be handled by memset_extent_buffer().
      
      - Handle the last byte
      
      This refactoring does not only make the code a little easier to read,
      but also makes later folio/page switch much easier, as the switch only
      needs to be done inside memset_extent_buffer() and extent_buffer_get_byte().
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb22964f
    • Qu Wenruo's avatar
      btrfs: tests: add self tests for extent buffer memory operations · 5864f1da
      Qu Wenruo authored
      The new self tests would populate a memory range with random bytes, then
      copy it to the extent buffer, so that we can verify if the extent buffer
      memory operation and memmove()/memcopy() are resulting the same
      contents.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5864f1da
    • Qu Wenruo's avatar
      btrfs: tests: enhance extent buffer bitmap tests · 257deed2
      Qu Wenruo authored
      Enhance extent bitmap tests for the following aspects:
      
      - Remove unnecessary @len from __test_eb_bitmaps()
        We can fetch the length from extent buffer
      
      - Explicitly distinguish bit and byte length
        Now every start/len inside bitmap tests would have either "byte_" or
        "bit_" prefix to make it more explicit.
      
      - Better error reporting
      
        If we have mismatch bits, the error report would dump the following
        contents:
      
        * start bytenr
        * bit number
        * the full byte from bitmap
        * the full byte from the extent
      
        This is to save developers time so obvious problem can be found
        immediately
      
      - Extract bitmap set/clear and check operation into two helpers
        This is to save some code lines, as we will have more tests to do.
      
      - Add new tests
      
        The following tests are added, mostly for the incoming extent bitmap
        accessor refactoring:
      
        * Set bits inside the same byte
        * Clear bits inside the same byte
        * Cross byte boundary set
        * Cross byte boundary clear
        * Cross multi-byte boundary set
        * Cross multi-byte boundary clear
      
        Those new tests have already saved my backend for the incoming extent
        buffer bitmap refactoring.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      257deed2
    • Josef Bacik's avatar
      btrfs: move comments to btrfs_loop_type definition · b9d97cff
      Josef Bacik authored
      Some of these loop types aren't described, and they should be with the
      definitions to make it easier to tell what each of them do.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9d97cff
    • Anand Jain's avatar
      btrfs: print name and pid when device scanning processes race · 7f9879eb
      Anand Jain authored
      There is a race between systemd and mount, as both of them try to register
      the device in the kernel. When systemd loses the race, it prints the
      following message:
      
        BTRFS error: device /dev/sdb7 belongs to fsid 1b3bacbf-14db-49c9-a3ef-547998aacc4e, and the fs is already mounted.
      
      The 'btrfs dev scan' registers one device at a time, so there is no way
      for the mount thread to wait in the kernel for all the devices to have
      registered as it won't know if all the devices are discovered.
      
      For now, improve the error log by printing the command name and process
      ID along with the error message.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f9879eb
    • Christoph Hellwig's avatar
      mm: remove folio_account_redirty · ed2da924
      Christoph Hellwig authored
      Fold folio_account_redirty into folio_redirty_for_writepage now
      that all other users except for the also unused account_page_redirty
      wrapper are gone.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed2da924
    • Christoph Hellwig's avatar
      btrfs: fix zoned handling in submit_uncompressed_range · 256b0cf9
      Christoph Hellwig authored
      For zoned file systems we need to use run_delalloc_zoned to submit
      writeback, as we need to write out partial allocations when running into
      zone active limits.
      
      submit_uncompressed_range currently always calls cow_file_range to
      allocate blocks and thus misses the active zone limits handling.  Fix
      this by passing the pages_dirty argument to run_delalloc_zoned and always
      using it from submit_uncompressed_range as it does the right thing for
      zoned and non-zoned file systems.
      
      To account for the fact that run_delalloc_zoned is now also used for
      non-zoned file systems rename it to run_delalloc_cow, and add comment
      describing it.
      
      Fixes: 42c01100 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      256b0cf9
    • Christoph Hellwig's avatar
      btrfs: don't redirty locked_page in run_delalloc_zoned · 778b8785
      Christoph Hellwig authored
      extent_write_locked_range currently expects that either all or no
      pages are dirty when it is called.  Bur run_delalloc_zoned is called
      directly in the writepages path, and has the dirty bit cleared only
      for locked_page and which the extent_write_cache_pages currently
      operates.  It currently works around this by redirtying locked_page,
      but that is a bit inefficient and cumbersome.  Pass a locked_page
      argument to run_delalloc_zoned so that clearing the dirty bit can
      be skipped on just that page.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      778b8785