1. 19 Jun, 2023 40 commits
    • Filipe Manana's avatar
      btrfs: fix race when deleting quota root from the dirty cow roots list · b31cb5a6
      Filipe Manana authored
      When disabling quotas we are deleting the quota root from the list
      fs_info->dirty_cowonly_roots without taking the lock that protects it,
      which is struct btrfs_fs_info::trans_lock. This unsynchronized list
      manipulation may cause chaos if there's another concurrent manipulation
      of this list, such as when adding a root to it with
      ctree.c:add_root_to_dirty_list().
      
      This can result in all sorts of weird failures caused by a race, such as
      the following crash:
      
        [337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
        [337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
        [337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.279928] Code: 85 38 06 00 (...)
        [337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
        [337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
        [337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
        [337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
        [337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
        [337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
        [337571.281723] FS:  00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
        [337571.281950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
        [337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [337571.282874] Call Trace:
        [337571.283101]  <TASK>
        [337571.283327]  ? __die_body+0x1b/0x60
        [337571.283570]  ? die_addr+0x39/0x60
        [337571.283796]  ? exc_general_protection+0x22e/0x430
        [337571.284022]  ? asm_exc_general_protection+0x22/0x30
        [337571.284251]  ? commit_cowonly_roots+0x11f/0x250 [btrfs]
        [337571.284531]  btrfs_commit_transaction+0x42e/0xf90 [btrfs]
        [337571.284803]  ? _raw_spin_unlock+0x15/0x30
        [337571.285031]  ? release_extent_buffer+0x103/0x130 [btrfs]
        [337571.285305]  reset_balance_state+0x152/0x1b0 [btrfs]
        [337571.285578]  btrfs_balance+0xa50/0x11e0 [btrfs]
        [337571.285864]  ? __kmem_cache_alloc_node+0x14a/0x410
        [337571.286086]  btrfs_ioctl+0x249a/0x3320 [btrfs]
        [337571.286358]  ? mod_objcg_state+0xd2/0x360
        [337571.286577]  ? refill_obj_stock+0xb0/0x160
        [337571.286798]  ? seq_release+0x25/0x30
        [337571.287016]  ? __rseq_handle_notify_resume+0x3ba/0x4b0
        [337571.287235]  ? percpu_counter_add_batch+0x2e/0xa0
        [337571.287455]  ? __x64_sys_ioctl+0x88/0xc0
        [337571.287675]  __x64_sys_ioctl+0x88/0xc0
        [337571.287901]  do_syscall_64+0x38/0x90
        [337571.288126]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
        [337571.288352] RIP: 0033:0x7f478aaffe9b
      
      So fix this by locking struct btrfs_fs_info::trans_lock before deleting
      the quota root from that list.
      
      Fixes: bed92eae ("Btrfs: qgroup implementation and prototypes")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b31cb5a6
    • Naohiro Aota's avatar
      btrfs: tracepoints: also show actual number of the outstanding extents · 64425500
      Naohiro Aota authored
      The btrfs_inode_mod_outstanding_extents trace event only shows the modified
      number to the number of outstanding extents. It would be helpful if we can
      see the resulting extent number as well.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64425500
    • Jeff Layton's avatar
      btrfs: update i_version in update_dev_time · c9e561c4
      Jeff Layton authored
      When updating the ctime, we also want to update i_version.
      
      This is just something I noticed by inspection. There is probably no way
      to test this today unless you can somehow get to this inode via nfsd.
      Still, I think it's the right thing to do for consistency's sake.
      
      David Sterba's comment: I don't see anything wrong with setting the
      iversion bit, however I also don't see where this would be useful.
      Agreed with the consistency, otherwise the time is updated when device
      super block is wiped or a device initialized, both are big events so
      missing that due to lack of iversion update seems unlikely. I'll add it
      to the queue, thanks.
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      [ add comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9e561c4
    • Ben Dooks's avatar
      btrfs: make btrfs_compressed_bioset static · e794203e
      Ben Dooks authored
      The 'btrfs_compressed_bioset' struct isn't exported outside of the
      fs/btrfs/compression.c file, so make it static to fix the following
      sparse warning:
      
      fs/btrfs/compression.c:40:16: warning: symbol 'btrfs_compressed_bioset' was not declared. Should it be static?
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e794203e
    • Matt Corallo's avatar
      btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile · 160fe8f6
      Matt Corallo authored
      Callers of `btrfs_reduce_alloc_profile` expect it to return exactly
      one allocation profile flag, and failing to do so may ultimately
      result in a WARN_ON and remount-ro when allocating new blocks, like
      the below transaction abort on 6.1.
      
      `btrfs_reduce_alloc_profile` has two ways of determining the profile,
      first it checks if a conversion balance is currently running and
      uses the profile we're converting to. If no balance is currently
      running, it returns the max-redundancy profile which at least one
      block in the selected block group has.
      
      This works by simply checking each known allocation profile bit in
      redundancy order. However, `btrfs_reduce_alloc_profile` has not been
      updated as new flags have been added - first with the `DUP` profile
      and later with the RAID1C34 profiles.
      
      Because of the way it checks, if we have blocks with different
      profiles and at least one is known, that profile will be selected.
      However, if none are known we may return a flag set with multiple
      allocation profiles set.
      
      This is currently only possible when a balance from one of the three
      unhandled profiles to another of the unhandled profiles is canceled
      after allocating at least one block using the new profile.
      
      In that case, a transaction abort like the below will occur and the
      filesystem will need to be mounted with -o skip_balance to get it
      mounted rw again (but the balance cannot be resumed without a
      similar abort).
      
        [770.648] ------------[ cut here ]------------
        [770.648] BTRFS: Transaction aborted (error -22)
        [770.648] WARNING: CPU: 43 PID: 1159593 at fs/btrfs/extent-tree.c:4122 find_free_extent+0x1d94/0x1e00 [btrfs]
        [770.648] CPU: 43 PID: 1159593 Comm: btrfs Tainted: G        W 6.1.0-0.deb11.7-powerpc64le #1  Debian 6.1.20-2~bpo11+1a~test
        [770.648] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV
        [770.648] NIP:  c00800000f6784fc LR: c00800000f6784f8 CTR: c000000000d746c0
        [770.648] REGS: c000200089afe9a0 TRAP: 0700   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
        [770.648] MSR:  9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 28848282  XER: 20040000
        [770.648] CFAR: c000000000135110 IRQMASK: 0
      	    GPR00: c00800000f6784f8 c000200089afec40 c00800000f7ea800 0000000000000026
      	    GPR04: 00000001004820c2 c000200089afea00 c000200089afe9f8 0000000000000027
      	    GPR08: c000200ffbfe7f98 c000000002127f90 ffffffffffffffd8 0000000026d6a6e8
      	    GPR12: 0000000028848282 c000200fff7f3800 5deadbeef0000122 c00000002269d000
      	    GPR16: c0002008c7797c40 c000200089afef17 0000000000000000 0000000000000000
      	    GPR20: 0000000000000000 0000000000000001 c000200008bc5a98 0000000000000001
      	    GPR24: 0000000000000000 c0000003c73088d0 c000200089afef17 c000000016d3a800
      	    GPR28: c0000003c7308800 c00000002269d000 ffffffffffffffea 0000000000000001
        [770.648] NIP [c00800000f6784fc] find_free_extent+0x1d94/0x1e00 [btrfs]
        [770.648] LR [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs]
        [770.648] Call Trace:
        [770.648] [c000200089afec40] [c00800000f6784f8] find_free_extent+0x1d90/0x1e00 [btrfs] (unreliable)
        [770.648] [c000200089afed30] [c00800000f681398] btrfs_reserve_extent+0x1a0/0x2f0 [btrfs]
        [770.648] [c000200089afeea0] [c00800000f681bf0] btrfs_alloc_tree_block+0x108/0x670 [btrfs]
        [770.648] [c000200089afeff0] [c00800000f66bd68] __btrfs_cow_block+0x170/0x850 [btrfs]
        [770.648] [c000200089aff100] [c00800000f66c58c] btrfs_cow_block+0x144/0x288 [btrfs]
        [770.648] [c000200089aff1b0] [c00800000f67113c] btrfs_search_slot+0x6b4/0xcb0 [btrfs]
        [770.648] [c000200089aff2a0] [c00800000f679f60] lookup_inline_extent_backref+0x128/0x7c0 [btrfs]
        [770.648] [c000200089aff3b0] [c00800000f67b338] lookup_extent_backref+0x70/0x190 [btrfs]
        [770.648] [c000200089aff470] [c00800000f67b54c] __btrfs_free_extent+0xf4/0x1490 [btrfs]
        [770.648] [c000200089aff5a0] [c00800000f67d770] __btrfs_run_delayed_refs+0x328/0x1530 [btrfs]
        [770.648] [c000200089aff740] [c00800000f67ea2c] btrfs_run_delayed_refs+0xb4/0x3e0 [btrfs]
        [770.648] [c000200089aff800] [c00800000f699aa4] btrfs_commit_transaction+0x8c/0x12b0 [btrfs]
        [770.648] [c000200089aff8f0] [c00800000f6dc628] reset_balance_state+0x1c0/0x290 [btrfs]
        [770.648] [c000200089aff9a0] [c00800000f6e2f7c] btrfs_balance+0x1164/0x1500 [btrfs]
        [770.648] [c000200089affb40] [c00800000f6f8e4c] btrfs_ioctl+0x2b54/0x3100 [btrfs]
        [770.648] [c000200089affc80] [c00000000053be14] sys_ioctl+0x794/0x1310
        [770.648] [c000200089affd70] [c00000000002af98] system_call_exception+0x138/0x250
        [770.648] [c000200089affe10] [c00000000000c654] system_call_common+0xf4/0x258
        [770.648] --- interrupt: c00 at 0x7fff94126800
        [770.648] NIP:  00007fff94126800 LR: 0000000107e0b594 CTR: 0000000000000000
        [770.648] REGS: c000200089affe80 TRAP: 0c00   Tainted: G        W (6.1.0-0.deb11.7-powerpc64le Debian 6.1.20-2~bpo11+1a~test)
        [770.648] MSR:  900000000000d033 <SF,HV,EE,PR,ME,IR,DR,RI,LE>  CR: 24002848  XER: 00000000
        [770.648] IRQMASK: 0
      	    GPR00: 0000000000000036 00007fffc9439da0 00007fff94217100 0000000000000003
      	    GPR04: 00000000c4009420 00007fffc9439ee8 0000000000000000 0000000000000000
      	    GPR08: 00000000803c7416 0000000000000000 0000000000000000 0000000000000000
      	    GPR12: 0000000000000000 00007fff9467d120 0000000107e64c9c 0000000107e64d0a
      	    GPR16: 0000000107e64d06 0000000107e64cf1 0000000107e64cc4 0000000107e64c73
      	    GPR20: 0000000107e64c31 0000000107e64bf1 0000000107e64be7 0000000000000000
      	    GPR24: 0000000000000000 00007fffc9439ee0 0000000000000003 0000000000000001
      	    GPR28: 00007fffc943f713 0000000000000000 00007fffc9439ee8 0000000000000000
        [770.648] NIP [00007fff94126800] 0x7fff94126800
        [770.648] LR [0000000107e0b594] 0x107e0b594
        [770.648] --- interrupt: c00
        [770.648] Instruction dump:
        [770.648] 3b00ffe4 e8898828 481175f5 60000000 4bfff4fc 3be00000 4bfff570 3d220000
        [770.648] 7fc4f378 e8698830 4811cd95 e8410018 <0fe00000> f9c10060 f9e10068 fa010070
        [770.648] ---[ end trace 0000000000000000 ]---
        [770.648] BTRFS: error (device dm-2: state A) in find_free_extent_update_loop:4122: errno=-22 unknown
        [770.648] BTRFS info (device dm-2: state EA): forced readonly
        [770.648] BTRFS: error (device dm-2: state EA) in __btrfs_free_extent:3070: errno=-22 unknown
        [770.648] BTRFS error (device dm-2: state EA): failed to run delayed ref for logical 17838685708288 num_bytes 24576 type 184 action 2 ref_mod 1: -22
        [770.648] BTRFS: error (device dm-2: state EA) in btrfs_run_delayed_refs:2144: errno=-22 unknown
        [770.648] BTRFS: error (device dm-2: state EA) in reset_balance_state:3599: errno=-22 unknown
      
      Fixes: 47e6f742 ("btrfs: add support for 3-copy replication (raid1c3)")
      Fixes: 8d6fac00 ("btrfs: add support for 4-copy replication (raid1c4)")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarMatt Corallo <blnxfsl@bluematt.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      160fe8f6
    • Qu Wenruo's avatar
      btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers · 81db6ae8
      Qu Wenruo authored
      Since the scrub rework introduced by commit 2af2aaf9 ("btrfs:
      scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
      and later commits, scrub only needs one single workqueue,
      fs_info::scrub_worker.
      
      That scrub_wr_completion_workers is initially to handle the delay work
      after write bios finished.  But the new scrub code goes submit-and-wait
      for write bios, thus all the work are done inside the scrub_worker.
      
      The last user of fs_info::scrub_wr_completion_workers is removed in
      commit 16f93993 ("btrfs: scrub: remove the old writeback
      infrastructure"), so we can safely remove the workqueue.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81db6ae8
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_ctx::csum_list member · c2bbc0ba
      Qu Wenruo authored
      Since the rework of scrub introduced by commit 2af2aaf9 ("btrfs:
      scrub: introduce structure for new BTRFS_STRIPE_LEN based interface")
      and later commits, scrub no longer keeps its data checksum inside sctx.
      
      Instead we have scrub_stripe::csums for the checksum of the stripe.
      Thus we can remove the unused scrub_ctx::csum_list member.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2bbc0ba
    • Filipe Manana's avatar
      btrfs: do not BUG_ON after failure to migrate space during truncation · 6822b3f7
      Filipe Manana authored
      During truncation we reserve 2 metadata units when starting a transaction
      (reserved space goes to fs_info->trans_block_rsv) and then attempt to
      migrate 1 unit (min_size bytes) from fs_info->trans_block_rsv into the
      local block reserve. If we ever fail we trigger a BUG_ON(), which should
      never happen, because we reserved 2 units. However if we happen to fail
      for some reason, we don't need to be so dire and hit a BUG_ON(), we can
      just error out the truncation and, since this is highly unexpected,
      surround the error check with WARN_ON(), to get an informative stack
      trace and tag the branh as 'unlikely'. Also make the 'min_size' variable
      const, since it's not supposed to ever change and any accidental change
      could possibly make the space migration not so unlikely to fail.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6822b3f7
    • Filipe Manana's avatar
      btrfs: do not BUG_ON on failure to get dir index for new snapshot · df9f2782
      Filipe Manana authored
      During the transaction commit path, at create_pending_snapshot(), there
      is no need to BUG_ON() in case we fail to get a dir index for the snapshot
      in the parent directory. This should fail very rarely because the parent
      inode should be loaded in memory already, with the respective delayed
      inode created and the parent inode's index_cnt field already initialized.
      
      However if it fails, it may be -ENOMEM like the comment at
      create_pending_snapshot() says or any error returned by
      btrfs_search_slot() through btrfs_set_inode_index_count(), which can be
      pretty much anything such as -EIO or -EUCLEAN for example. So the comment
      is not correct when it says it can only be -ENOMEM.
      
      However doing a BUG_ON() here is overkill, since we can instead abort
      the transaction and return the error. Note that any error returned by
      create_pending_snapshot() will eventually result in a transaction
      abort at cleanup_transaction(), called from btrfs_commit_transaction(),
      but we can explicitly abort the transaction at this point instead so that
      we get a stack trace to tell us that the call to btrfs_set_inode_index()
      failed.
      
      So just abort the transaction and return in case btrfs_set_inode_index()
      returned an error at create_pending_snapshot().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df9f2782
    • Filipe Manana's avatar
      btrfs: send: do not BUG_ON() on unexpected symlink data extent · 6f3eb72a
      Filipe Manana authored
      There's really no need to BUG_ON() if we find a symlink with an extent
      that is not inline or is compressed. We can just make send fail with
      an error (-EUCLEAN) and log an informative error message, so just do
      that instead of BUG_ON().
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f3eb72a
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() when dropping inode items from log root · fc4026e2
      Filipe Manana authored
      When dropping inode items from a log tree at drop_inode_items(), we this
      BUG_ON() on the result of btrfs_search_slot() because we don't expect an
      exact match since having a key with an offset of (u64)-1 is unexpected.
      That is generally true, but for dir index keys for example, we can get a
      key with such an offset value, even though it's very unlikely and it would
      take ages to increase the sequence counter for a dir index up to (u64)-1.
      We can deal with an exact match, we just have to delete the key at that
      slot, so there is really no need to BUG_ON(), error out or trigger any
      warning. So remove the BUG_ON().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc4026e2
    • Filipe Manana's avatar
      btrfs: replace BUG_ON() at split_item() with proper error handling · 7569141e
      Filipe Manana authored
      There's no need to BUG_ON() at split_item() if the leaf does not have
      enough free space for the new item, we can just return -ENOSPC since
      the caller can deal with errors from split_item(). Also, as this is a
      very unlikely condition to happen, because the caller has invoked
      setup_leaf_for_split() before calling split_item(), surround the
      condition with a WARN_ON() which makes it easier to notice this
      unexpected condition and tags the if branch with 'unlikely' as well.
      
      I've actually once hit this BUG_ON() with some incorrect code changes
      I had, which was very inconvenient as it required rebooting the VM.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7569141e
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr() · 751a2761
      Filipe Manana authored
      At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record
      tree mod log operations, do a transaction abort and return the error to
      the callers. There's really no need for the BUG_ON() as we can release all
      resources in the context of all callers, and we have to abort because other
      future tree searches that use the tree mod log (btrfs_search_old_slot())
      may get inconsistent results if other operations modify the tree after
      that failure and before the tree mod log based search.
      
      This implies btrfs_del_ptr() return an int instead of void, and making all
      callers check for returned errors.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      751a2761
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failures at insert_ptr() · 50b5d1fc
      Filipe Manana authored
      At insert_ptr(), instead of doing a BUG_ON() in case we fail to record
      tree mod log operations, do a transaction abort and return the error to
      the callers. There's really no need for the BUG_ON() as we can release all
      resources in the context of all callers, and we have to abort because other
      future tree searches that use the tree mod log (btrfs_search_old_slot())
      may get inconsistent results if other operations modify the tree after
      that failure and before the tree mod log based search.
      
      This implies making insert_ptr() return an int instead of void, and making
      all callers check for returned errors.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50b5d1fc
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failure at insert_new_root() · f61aa7ba
      Filipe Manana authored
      At insert_new_root(), instead of doing a BUG_ON() in case we fail to
      record the tree mod log operation, just return the error to the callers
      after releasing the allocated tree block. At this point we haven't made
      any changes to the b+tree, so we haven't left it in an inconsistent state
      and therefore have no need to abort the transaction. All we need to do is
      to unlock and free the extent buffer we just allocated with the purpose
      of making it the new root.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f61aa7ba
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert() · 11d6ae03
      Filipe Manana authored
      At push_nodes_for_insert(), instead of doing a BUG_ON() in case we fail to
      record tree mod log operations, do a transaction abort and return the
      error to the caller. There's really no need for the BUG_ON() as we can
      release all resources in this context, and we have to abort because other
      future tree searches that use the tree mod log (btrfs_search_old_slot())
      may get inconsistent results if other operations modify the tree after
      that failure and before the tree mod log based search.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11d6ae03
    • Filipe Manana's avatar
      btrfs: abort transaction at update_ref_for_cow() when ref count is zero · eced687e
      Filipe Manana authored
      At update_ref_for_cow() we are calling btrfs_handle_fs_error() if we find
      that the extent buffer has an unexpected ref count of zero, however we can
      simply use btrfs_abort_transaction(), which achieves the same purposes: to
      turn the fs to error state, abort the current transaction and turn the fs
      to RO mode as well. Besides that, btrfs_abort_transaction() also prints a
      stack trace which makes it more useful.
      
      Also, as this is a very unexpected situation, indicating a serious
      corruption/inconsistency, tag the if branch as 'unlikely', set the error
      code to -EUCLEAN instead of -EROFS, and log an explicit message.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eced687e
    • Filipe Manana's avatar
      btrfs: abort transaction at balance_level() when left child is missing · 725026ed
      Filipe Manana authored
      At balance_level() we are calling btrfs_handle_fs_error() when the middle
      child only has 1 item and the left child is missing, however we can simply
      use btrfs_abort_transaction(), which achieves the same purposes: to turn
      the fs to error state, abort the current transaction and turn the fs to
      RO mode. Besides that, btrfs_abort_transaction() also prints a stack trace
      which makes it more useful.
      
      Also, as this is a highly unexpected case and it's about a b+tree
      inconsistency, change the error code from -EROFS to -EUCLEAN, tag the if
      branch as 'unlikely' and log an explicit error message.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      725026ed
    • Filipe Manana's avatar
      btrfs: avoid unnecessarily setting the fs to RO and error state at balance_level() · 87b8e9d0
      Filipe Manana authored
      At balance_level(), when trying to promote a child node to a root node, if
      we fail to read the child we call btrfs_handle_fs_error(), which turns the
      fs to RO mode and sets it to error state as well, causing any ongoing
      transaction to abort. This however is not necessary because at that point
      we have not made any change yet at balance_level(), so any error reading
      the child node does not leaves us in any inconsistent state. Therefore we
      can just return the error to the caller.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      87b8e9d0
    • Filipe Manana's avatar
      btrfs: rename enospc label to out at balance_level() · daefe4d4
      Filipe Manana authored
      At balance_level() we have this 'enospc' label where we jump to in case
      we get an error at several places. However that error is certainly not
      -ENOSPC in call cases, it can be -EIO or -ENOMEM when reading a child
      extent buffer for example, or -ENOMEM when trying to record tree mod log
      operations. So to make this less confusing, rename the label to 'out'.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      daefe4d4
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failure at balance_level() · 39020d8a
      Filipe Manana authored
      At balance_level(), instead of doing a BUG_ON() in case we fail to record
      tree mod log operations, do a transaction abort and return the error to
      the callers. There's really no need for the BUG_ON() as we can release
      all resources in this context, and we have to abort because other future
      tree searches that use the tree mod log (btrfs_search_old_slot()) may get
      inconsistent results if other operations modify the tree after that
      failure and before the tree mod log based search.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39020d8a
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on tree mod log failure at __btrfs_cow_block() · 40b0a749
      Filipe Manana authored
      At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to
      record a tree mod log root insertion operation, do a transaction abort
      instead. There's really no need for the BUG_ON(), we can properly
      release all resources in this context and turn the filesystem to RO mode
      and in an error state instead.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40b0a749
    • Filipe Manana's avatar
      btrfs: avoid tree mod log ENOMEM failures when we don't need to log · 8793ed87
      Filipe Manana authored
      When logging tree mod log operations we start by checking, in a lockless
      manner, if we need to log - if we don't, we just return and do nothing,
      otherwise we will allocate one or more tree mod log operations and then
      check again if we need to log. This second check will take the tree mod
      log lock in write mode if we need to log, otherwise it will do nothing
      and we just free the allocated memory and return success.
      
      We can improve on this by not returning an error in case the memory
      allocations fail, unless the second check tells us that we actually need
      to log. That is, if we fail to allocate memory and the second check tells
      use that we don't need to log, we can just return success and avoid
      returning -ENOMEM to the caller. Currently tree mod log failures are
      dealt with either a BUG_ON() or a transaction abort, as tree mod log
      operations are logged in code paths that modify a b+tree.
      
      So just avoid failing with -ENOMEM if we fail to allocate a tree mod log
      operation unless we actually need to log the operations, that is, if
      tree_mod_dont_log() returns true.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8793ed87
    • Filipe Manana's avatar
      btrfs: fix extent buffer leak after tree mod log failure at split_node() · ede600e4
      Filipe Manana authored
      At split_node(), if we fail to log the tree mod log copy operation, we
      return without unlocking the split extent buffer we just allocated and
      without decrementing the reference we own on it. Fix this by unlocking
      it and decrementing the ref count before returning.
      
      Fixes: 5de865ee ("Btrfs: fix tree mod logging")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ede600e4
    • Filipe Manana's avatar
      btrfs: add missing error handling when logging operation while COWing extent buffer · d09c5152
      Filipe Manana authored
      When COWing an extent buffer that is not the root node, we need to log in
      the tree mod log that we replaced a pointer in the parent node, otherwise
      a tree mod log user doing a search on the b+tree can return incorrect
      results (that miss something). We are doing the call to
      btrfs_tree_mod_log_insert_key() but we totally ignore its return value.
      
      So fix this by adding the missing error handling, resulting in a
      transaction abort and freeing the COWed extent buffer.
      
      Fixes: f230475e ("Btrfs: put all block modifications into the tree mod log")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d09c5152
    • Christoph Hellwig's avatar
      btrfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method · f02c75e6
      Christoph Hellwig authored
      Since commit a2ad63da ("VFS: add FMODE_CAN_ODIRECT file flag") file
      systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
      wiring up a dummy direct_IO method to indicate support for direct I/O.
      Do that for btrfs so that noop_direct_IO can eventually be removed.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f02c75e6
    • Filipe Manana's avatar
      btrfs: update documentation for a block group's bg_list member · aadb164b
      Filipe Manana authored
      Currently we are only documenting two uses of the bg_list member of a
      block group, but there two more:
      
      1) To track deleted block groups for discard purposes, introduced in
         commit e33e17ee ("btrfs: add missing discards when unpinning
         extents with -o discard");
      
      2) To track block groups for automatic reclaim, introduced more recently
         by commit 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      
      So document those two other use cases.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aadb164b
    • Naohiro Aota's avatar
      btrfs: reinsert BGs failed to reclaim · 7e271809
      Naohiro Aota authored
      The reclaim process can temporarily fail. For example, if the space is
      getting tight, it fails to make the block group read-only. If there are no
      further writes on that block group, the block group will never get back to
      the reclaim list, and the BG never gets reclaimed. In a certain workload,
      we can leave many such block groups never reclaimed.
      
      So, let's get it back to the list and give it a chance to be reclaimed.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e271809
    • Naohiro Aota's avatar
      btrfs: bail out reclaim process if filesystem is read-only · 93463ff7
      Naohiro Aota authored
      When a filesystem is read-only, we cannot reclaim a block group as it
      cannot rewrite the data. Just bail out in that case.
      
      Note that it can drop block groups in this case. As we did
      sb_start_write(), read-only filesystem means we got a fatal error and
      forced read-only. There is no chance to reclaim them again.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      93463ff7
    • Naohiro Aota's avatar
      btrfs: move out now unused BG from the reclaim list · a9f18971
      Naohiro Aota authored
      An unused block group is easy to remove to free up space and should be
      reclaimed fast. Such block group can often already be a target of the
      reclaim process. As we check list_empty(&bg->bg_list), we keep it in the
      reclaim list. That block group is never reclaimed until the file system
      is filled e.g. up to 75%.
      
      Instead, we can move unused block group to the unused list and delete it
      fast.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a9f18971
    • Naohiro Aota's avatar
      btrfs: delete unused BGs while reclaiming BGs · 3ed01616
      Naohiro Aota authored
      The reclaiming process only starts after the filesystem volumes are
      allocated to a certain level (75% by default). Thus, the list of
      reclaiming target block groups can build up so huge at the time the
      reclaim process kicks in. On a test run, there were over 1000 BGs in the
      reclaim list.
      
      As the reclaim involves rewriting the data, it takes really long time to
      reclaim the BGs. While the reclaim is running, btrfs_delete_unused_bgs()
      won't proceed because the reclaim side is holding
      fs_info->reclaim_bgs_lock. As a result, we will have a large number of
      unused BGs kept in the unused list. On my test run, I got 1057 unused BGs.
      
      Since deleting a block group is relatively easy and fast work, we can call
      btrfs_delete_unused_bgs() while it reclaims BGs, to avoid building up
      unused BGs.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3ed01616
    • Christoph Hellwig's avatar
      btrfs: use btrfs_finish_ordered_extent to complete buffered writes · 0d394cca
      Christoph Hellwig authored
      Use the btrfs_finish_ordered_extent helper to complete compressed writes
      using the bbio->ordered pointer instead of requiring an rbtree lookup
      in the otherwise equivalent btrfs_mark_ordered_io_finished called from
      btrfs_writepage_endio_finish_ordered.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d394cca
    • Christoph Hellwig's avatar
      btrfs: use btrfs_finish_ordered_extent to complete direct writes · b41b6f69
      Christoph Hellwig authored
      Use the btrfs_finish_ordered_extent helper to complete compressed writes
      using the bbio->ordered pointer instead of requiring an rbtree lookup
      in the otherwise equivalent btrfs_mark_ordered_io_finished called from
      btrfs_writepage_endio_finish_ordered.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b41b6f69
    • Christoph Hellwig's avatar
      btrfs: use btrfs_finish_ordered_extent to complete compressed writes · 7dd43954
      Christoph Hellwig authored
      Use the btrfs_finish_ordered_extent helper to complete compressed writes
      using the bbio->ordered pointer instead of requiring an rbtree lookup
      in the otherwise equivalent btrfs_mark_ordered_io_finished called from
      btrfs_writepage_endio_finish_ordered.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7dd43954
    • Christoph Hellwig's avatar
      btrfs: open code end_extent_writepage in end_bio_extent_writepage · 4ba8223d
      Christoph Hellwig authored
      This prepares for switching to more efficient ordered_extent processing
      and already removes the forth and back conversion from len to end back to
      len.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4ba8223d
    • Christoph Hellwig's avatar
      btrfs: add a btrfs_finish_ordered_extent helper · 122e9ede
      Christoph Hellwig authored
      Add a helper to complete an ordered_extent without first doing a lookup.
      The tracepoint cannot use the ordered_extent class as we also want to
      print the range.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      122e9ede
    • Christoph Hellwig's avatar
      btrfs: factor out a btrfs_queue_ordered_fn helper · 2d6f107e
      Christoph Hellwig authored
      Factor out a helper to queue up an ordered_extent completion in a work
      queue.  This new helper will later be used complete an ordered_extent
      without first doing a lookup.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d6f107e
    • Christoph Hellwig's avatar
      btrfs: factor out a can_finish_ordered_extent helper · 53df2586
      Christoph Hellwig authored
      Factor out a helper from btrfs_mark_ordered_io_finished that does the
      actual per-ordered_extent work to check if we want to schedule an I/O
      completion.  This new helper will later be used complete an
      ordered_extent without first doing a lookup.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53df2586
    • Christoph Hellwig's avatar
      btrfs: use bbio->ordered in btrfs_csum_one_bio · c59360f6
      Christoph Hellwig authored
      Use the ordered_extent pointer in the btrfs_bio instead of looking it
      up manually.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c59360f6
    • Christoph Hellwig's avatar
      btrfs: add an ordered_extent pointer to struct btrfs_bio · ec63b84d
      Christoph Hellwig authored
      Add a pointer to the ordered_extent to the existing union in struct
      btrfs_bio, so all code dealing with data write bios can just use a
      pointer dereference to retrieve the ordered_extent instead of doing
      multiple rbtree lookups per I/O.
      
      The reference to this ordered_extent is dropped at end I/O time,
      which implies that an extra one must be acquired when the bio is split.
      This also requires moving the btrfs_extract_ordered_extent call into
      btrfs_split_bio so that the invariant of always having a valid
      ordered_extent reference for the btrfs_bio is kept.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec63b84d