1. 01 Oct, 2024 7 commits
    • Filipe Manana's avatar
      btrfs: wait for fixup workers before stopping cleaner kthread during umount · 41fd1e94
      Filipe Manana authored
      During unmount, at close_ctree(), we have the following steps in this order:
      
      1) Park the cleaner kthread - this doesn't destroy the kthread, it basically
         halts its execution (wake ups against it work but do nothing);
      
      2) We stop the cleaner kthread - this results in freeing the respective
         struct task_struct;
      
      3) We call btrfs_stop_all_workers() which waits for any jobs running in all
         the work queues and then free the work queues.
      
      Syzbot reported a case where a fixup worker resulted in a crash when doing
      a delayed iput on its inode while attempting to wake up the cleaner at
      btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread
      was already freed. This can happen during unmount because we don't wait
      for any fixup workers still running before we call kthread_stop() against
      the cleaner kthread, which stops and free all its resources.
      
      Fix this by waiting for any fixup workers at close_ctree() before we call
      kthread_stop() against the cleaner and run pending delayed iputs.
      
      The stack traces reported by syzbot were the following:
      
        BUG: KASAN: slab-use-after-free in __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
        Read of size 8 at addr ffff8880272a8a18 by task kworker/u8:3/52
      
        CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.12.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
        Workqueue: btrfs-fixup btrfs_work_helper
        Call Trace:
         <TASK>
         __dump_stack lib/dump_stack.c:94 [inline]
         dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
         print_address_description mm/kasan/report.c:377 [inline]
         print_report+0x169/0x550 mm/kasan/report.c:488
         kasan_report+0x143/0x180 mm/kasan/report.c:601
         __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
         lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
         __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
         _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
         class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
         try_to_wake_up+0xb0/0x1480 kernel/sched/core.c:4154
         btrfs_writepage_fixup_worker+0xc16/0xdf0 fs/btrfs/inode.c:2842
         btrfs_work_helper+0x390/0xc50 fs/btrfs/async-thread.c:314
         process_one_work kernel/workqueue.c:3229 [inline]
         process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310
         worker_thread+0x870/0xd30 kernel/workqueue.c:3391
         kthread+0x2f0/0x390 kernel/kthread.c:389
         ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
         </TASK>
      
        Allocated by task 2:
         kasan_save_stack mm/kasan/common.c:47 [inline]
         kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
         unpoison_slab_object mm/kasan/common.c:319 [inline]
         __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
         kasan_slab_alloc include/linux/kasan.h:247 [inline]
         slab_post_alloc_hook mm/slub.c:4086 [inline]
         slab_alloc_node mm/slub.c:4135 [inline]
         kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4187
         alloc_task_struct_node kernel/fork.c:180 [inline]
         dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
         copy_process+0x5d1/0x3d50 kernel/fork.c:2206
         kernel_clone+0x223/0x880 kernel/fork.c:2787
         kernel_thread+0x1bc/0x240 kernel/fork.c:2849
         create_kthread kernel/kthread.c:412 [inline]
         kthreadd+0x60d/0x810 kernel/kthread.c:765
         ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
      
        Freed by task 61:
         kasan_save_stack mm/kasan/common.c:47 [inline]
         kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
         kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
         poison_slab_object mm/kasan/common.c:247 [inline]
         __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
         kasan_slab_free include/linux/kasan.h:230 [inline]
         slab_free_hook mm/slub.c:2343 [inline]
         slab_free mm/slub.c:4580 [inline]
         kmem_cache_free+0x1a2/0x420 mm/slub.c:4682
         put_task_struct include/linux/sched/task.h:144 [inline]
         delayed_put_task_struct+0x125/0x300 kernel/exit.c:228
         rcu_do_batch kernel/rcu/tree.c:2567 [inline]
         rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
         handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
         __do_softirq kernel/softirq.c:588 [inline]
         invoke_softirq kernel/softirq.c:428 [inline]
         __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
         irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
         instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1037 [inline]
         sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1037
         asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
      
        Last potentially related work creation:
         kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
         __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541
         __call_rcu_common kernel/rcu/tree.c:3086 [inline]
         call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
         context_switch kernel/sched/core.c:5318 [inline]
         __schedule+0x184b/0x4ae0 kernel/sched/core.c:6675
         schedule_idle+0x56/0x90 kernel/sched/core.c:6793
         do_idle+0x56a/0x5d0 kernel/sched/idle.c:354
         cpu_startup_entry+0x42/0x60 kernel/sched/idle.c:424
         start_secondary+0x102/0x110 arch/x86/kernel/smpboot.c:314
         common_startup_64+0x13e/0x147
      
        The buggy address belongs to the object at ffff8880272a8000
         which belongs to the cache task_struct of size 7424
        The buggy address is located 2584 bytes inside of
         freed 7424-byte region [ffff8880272a8000, ffff8880272a9d00)
      
        The buggy address belongs to the physical page:
        page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x272a8
        head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
        flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
        page_type: f5(slab)
        raw: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
        raw: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
        head: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
        head: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
        head: 00fff00000000003 ffffea00009caa01 ffffffffffffffff 0000000000000000
        head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
        page_owner tracks the page as allocated
        page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 2, tgid 2 (kthreadd), ts 71247381401, free_ts 71214998153
         set_page_owner include/linux/page_owner.h:32 [inline]
         post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
         prep_new_page mm/page_alloc.c:1545 [inline]
         get_page_from_freelist+0x3039/0x3180 mm/page_alloc.c:3457
         __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4733
         alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
         alloc_slab_page+0x6a/0x120 mm/slub.c:2413
         allocate_slab+0x5a/0x2f0 mm/slub.c:2579
         new_slab mm/slub.c:2632 [inline]
         ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3819
         __slab_alloc+0x58/0xa0 mm/slub.c:3909
         __slab_alloc_node mm/slub.c:3962 [inline]
         slab_alloc_node mm/slub.c:4123 [inline]
         kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4187
         alloc_task_struct_node kernel/fork.c:180 [inline]
         dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
         copy_process+0x5d1/0x3d50 kernel/fork.c:2206
         kernel_clone+0x223/0x880 kernel/fork.c:2787
         kernel_thread+0x1bc/0x240 kernel/fork.c:2849
         create_kthread kernel/kthread.c:412 [inline]
         kthreadd+0x60d/0x810 kernel/kthread.c:765
         ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
        page last free pid 5230 tgid 5230 stack trace:
         reset_page_owner include/linux/page_owner.h:25 [inline]
         free_pages_prepare mm/page_alloc.c:1108 [inline]
         free_unref_page+0xcd0/0xf00 mm/page_alloc.c:2638
         discard_slab mm/slub.c:2678 [inline]
         __put_partials+0xeb/0x130 mm/slub.c:3146
         put_cpu_partial+0x17c/0x250 mm/slub.c:3221
         __slab_free+0x2ea/0x3d0 mm/slub.c:4450
         qlink_free mm/kasan/quarantine.c:163 [inline]
         qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
         kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
         __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
         kasan_slab_alloc include/linux/kasan.h:247 [inline]
         slab_post_alloc_hook mm/slub.c:4086 [inline]
         slab_alloc_node mm/slub.c:4135 [inline]
         kmem_cache_alloc_noprof+0x135/0x2a0 mm/slub.c:4142
         getname_flags+0xb7/0x540 fs/namei.c:139
         do_sys_openat2+0xd2/0x1d0 fs/open.c:1409
         do_sys_open fs/open.c:1430 [inline]
         __do_sys_openat fs/open.c:1446 [inline]
         __se_sys_openat fs/open.c:1441 [inline]
         __x64_sys_openat+0x247/0x2a0 fs/open.c:1441
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
        Memory state around the buggy address:
         ffff8880272a8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880272a8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880272a8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                    ^
         ffff8880272a8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880272a8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      Reported-by: syzbot+8aaf2df2ef0164ffe1fb@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/66fb36b1.050a0220.aab67.003b.GAE@google.com/
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      41fd1e94
    • Qu Wenruo's avatar
      btrfs: fix a NULL pointer dereference when failed to start a new trasacntion · c3b47f49
      Qu Wenruo authored
      [BUG]
      Syzbot reported a NULL pointer dereference with the following crash:
      
        FAULT_INJECTION: forcing a failure.
         start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676
         prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642
         relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678
        ...
        BTRFS info (device loop0): balance: ended with status: -12
        Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI
        KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667]
        RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926
        Call Trace:
         <TASK>
         commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496
         btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430
         del_balance_item fs/btrfs/volumes.c:3678 [inline]
         reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742
         btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574
         btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:907 [inline]
         __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      [CAUSE]
      The allocation failure happens at the start_transaction() inside
      prepare_to_relocate(), and during the error handling we call
      unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.
      
      Then we continue the error path cleanup in btrfs_balance() by calling
      reset_balance_state() which will call del_balance_item() to fully delete
      the balance item in the root tree.
      
      However during the small window between set_reloc_contrl() and
      unset_reloc_control(), we can have a subvolume tree update and created a
      reloc_root for that subvolume.
      
      Then we go into the final btrfs_commit_transaction() of
      del_balance_item(), and into btrfs_update_reloc_root() inside
      commit_fs_roots().
      
      That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
      stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
      dereference.
      
      [FIX]
      Just add extra check on fs_info->reloc_ctl inside
      btrfs_update_reloc_root(), before checking
      fs_info->reloc_ctl->merge_reloc_tree.
      
      That DEAD_RELOC_TREE handling is to prevent further modification to the
      reloc tree during merge stage, but since there is no reloc_ctl at all,
      we do not need to bother that.
      
      Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3b47f49
    • Filipe Manana's avatar
      btrfs: send: fix invalid clone operation for file that got its size decreased · fa630df6
      Filipe Manana authored
      During an incremental send we may end up sending an invalid clone
      operation, for the last extent of a file which ends at an unaligned offset
      that matches the final i_size of the file in the send snapshot, in case
      the file had its initial size (the size in the parent snapshot) decreased
      in the send snapshot. In this case the destination will fail to apply the
      clone operation because its end offset is not sector size aligned and it
      ends before the current size of the file.
      
      Sending the truncate operation always happens when we finish processing an
      inode, after we process all its extents (and xattrs, names, etc). So fix
      this by ensuring the file has a valid size before we send a clone
      operation for an unaligned extent that ends at the final i_size of the
      file. The size we truncate to matches the start offset of the clone range
      but it could be any value between that start offset and the final size of
      the file since the clone operation will expand the i_size if the current
      size is smaller than the end offset. The start offset of the range was
      chosen because it's always sector size aligned and avoids a truncation
      into the middle of a page, which results in dirtying the page due to
      filling part of it with zeroes and then making the clone operation at the
      receiver trigger IO.
      
      The following test reproduces the issue:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        # Create a file with a size of 256K + 5 bytes, having two extents, one
        # with a size of 128K and another one with a size of 128K + 5 bytes.
        last_ext_size=$((128 * 1024 + 5))
        xfs_io -f -d -c "pwrite -S 0xab -b 128K 0 128K" \
               -c "pwrite -S 0xcd -b $last_ext_size 128K $last_ext_size" \
               $MNT/foo
      
        # Another file which we will later clone foo into, but initially with
        # a larger size than foo.
        xfs_io -f -c "pwrite -S 0xef 0 1M" $MNT/bar
      
        btrfs subvolume snapshot -r $MNT/ $MNT/snap1
      
        # Now resize bar and clone foo into it.
        xfs_io -c "truncate 0" \
               -c "reflink $MNT/foo" $MNT/bar
      
        btrfs subvolume snapshot -r $MNT/ $MNT/snap2
      
        rm -f /tmp/send-full /tmp/send-inc
        btrfs send -f /tmp/send-full $MNT/snap1
        btrfs send -p $MNT/snap1 -f /tmp/send-inc $MNT/snap2
      
        umount $MNT
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        btrfs receive -f /tmp/send-full $MNT
        btrfs receive -f /tmp/send-inc $MNT
      
        umount $MNT
      
      Running it before this patch:
      
        $ ./test.sh
        (...)
        At subvol snap1
        At snapshot snap2
        ERROR: failed to clone extents to bar: Invalid argument
      
      A test case for fstests will be sent soon.
      Reported-by: default avatarBen Millwood <thebenmachine@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJhrHS2z+WViO2h=ojYvBPDLsATwLbg+7JaNCyYomv0fUxEpQQ@mail.gmail.com/
      Fixes: 46a6e10a ("btrfs: send: allow cloning non-aligned extent if it ends at i_size")
      CC: stable@vger.kernel.org # 6.11
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa630df6
    • Filipe Manana's avatar
      btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class · 50c6f6e6
      Filipe Manana authored
      While running checkpatch.pl against a patch that modifies the
      btrfs_qgroup_extent event class, it complained about using a comma instead
      of a semicolon:
      
        $ ./scripts/checkpatch.pl qgroups/0003-btrfs-qgroups-remove-bytenr-field-from-struct-btrfs_.patch
        WARNING: Possible comma where semicolon could be used
        #215: FILE: include/trace/events/btrfs.h:1720:
        +		__entry->bytenr		= bytenr,
      		__entry->num_bytes	= rec->num_bytes;
      
        total: 0 errors, 1 warnings, 184 lines checked
      
      So replace the comma with a semicolon to silence checkpatch and possibly
      other tools. It also makes the code consistent with the rest.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50c6f6e6
    • Josef Bacik's avatar
      btrfs: drop the backref cache during relocation if we commit · db7e68b5
      Josef Bacik authored
      Since the inception of relocation we have maintained the backref cache
      across transaction commits, updating the backref cache with the new
      bytenr whenever we COWed blocks that were in the cache, and then
      updating their bytenr once we detected a transaction id change.
      
      This works as long as we're only ever modifying blocks, not changing the
      structure of the tree.
      
      However relocation does in fact change the structure of the tree.  For
      example, if we are relocating a data extent, we will look up all the
      leaves that point to this data extent.  We will then call
      do_relocation() on each of these leaves, which will COW down to the leaf
      and then update the file extent location.
      
      But, a key feature of do_relocation() is the pending list.  This is all
      the pending nodes that we modified when we updated the file extent item.
      We will then process all of these blocks via finish_pending_nodes, which
      calls do_relocation() on all of the nodes that led up to that leaf.
      
      The purpose of this is to make sure we don't break sharing unless we
      absolutely have to.  Consider the case that we have 3 snapshots that all
      point to this leaf through the same nodes, the initial COW would have
      created a whole new path.  If we did this for all 3 snapshots we would
      end up with 3x the number of nodes we had originally.  To avoid this we
      will cycle through each of the snapshots that point to each of these
      nodes and update their pointers to point at the new nodes.
      
      Once we update the pointer to the new node we will drop the node we
      removed the link for and all of its children via btrfs_drop_subtree().
      This is essentially just btrfs_drop_snapshot(), but for an arbitrary
      point in the snapshot.
      
      The problem with this is that we will never reflect this in the backref
      cache.  If we do this btrfs_drop_snapshot() for a node that is in the
      backref tree, we will leave the node in the backref tree.  This becomes
      a problem when we change the transid, as now the backref cache has
      entire subtrees that no longer exist, but exist as if they still are
      pointed to by the same roots.
      
      In the best case scenario you end up with "adding refs to an existing
      tree ref" errors from insert_inline_extent_backref(), where we attempt
      to link in nodes on roots that are no longer valid.
      
      Worst case you will double free some random block and re-use it when
      there's still references to the block.
      
      This is extremely subtle, and the consequences are quite bad.  There
      isn't a way to make sure our backref cache is consistent between
      transid's.
      
      In order to fix this we need to simply evict the entire backref cache
      anytime we cross transid's.  This reduces performance in that we have to
      rebuild this backref cache every time we change transid's, but fixes the
      bug.
      
      This has existed since relocation was added, and is a pretty critical
      bug.  There's a lot more cleanup that can be done now that this
      functionality is going away, but this patch is as small as possible in
      order to fix the problem and make it easy for us to backport it to all
      the kernels it needs to be backported to.
      
      Followup series will dismantle more of this code and simplify relocation
      drastically to remove this functionality.
      
      We have a reproducer that reproduced the corruption within a few minutes
      of running.  With this patch it survives several iterations/hours of
      running the reproducer.
      
      Fixes: 3fd0a558 ("Btrfs: Metadata ENOSPC handling for balance")
      CC: stable@vger.kernel.org
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db7e68b5
    • Johannes Thumshirn's avatar
      btrfs: also add stripe entries for NOCOW writes · 97f97822
      Johannes Thumshirn authored
      NOCOW writes do not generate stripe_extent entries in the RAID stripe
      tree, as the RAID stripe-tree feature initially was designed with a
      zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
      writes. But the RAID stripe-tree feature is independent from the zoned
      feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97f97822
    • Filipe Manana's avatar
      btrfs: send: fix buffer overflow detection when copying path to cache entry · 96c6ca71
      Filipe Manana authored
      Starting with commit c0247d28 ("btrfs: send: annotate struct
      name_cache_entry with __counted_by()") we annotated the variable length
      array "name" from the name_cache_entry structure with __counted_by() to
      improve overflow detection. However that alone was not correct, because
      the length of that array does not match the "name_len" field - it matches
      that plus 1 to include the NUL string terminator, so that makes a
      fortified kernel think there's an overflow and report a splat like this:
      
        strcpy: detected buffer overflow: 20 byte write of buffer size 19
        WARNING: CPU: 3 PID: 3310 at __fortify_report+0x45/0x50
        CPU: 3 UID: 0 PID: 3310 Comm: btrfs Not tainted 6.11.0-prnet #1
        Hardware name: CompuLab Ltd.  sbc-ihsw/Intense-PC2 (IPC2), BIOS IPC2_3.330.7 X64 03/15/2018
        RIP: 0010:__fortify_report+0x45/0x50
        Code: 48 8b 34 (...)
        RSP: 0018:ffff97ebc0d6f650 EFLAGS: 00010246
        RAX: 7749924ef60fa600 RBX: ffff8bf5446a521a RCX: 0000000000000027
        RDX: 00000000ffffdfff RSI: ffff97ebc0d6f548 RDI: ffff8bf84e7a1cc8
        RBP: ffff8bf548574080 R08: ffffffffa8c40e10 R09: 0000000000005ffd
        R10: 0000000000000004 R11: ffffffffa8c70e10 R12: ffff8bf551eef400
        R13: 0000000000000000 R14: 0000000000000013 R15: 00000000000003a8
        FS:  00007fae144de8c0(0000) GS:ffff8bf84e780000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fae14691690 CR3: 00000001027a2003 CR4: 00000000001706f0
        Call Trace:
         <TASK>
         ? __warn+0x12a/0x1d0
         ? __fortify_report+0x45/0x50
         ? report_bug+0x154/0x1c0
         ? handle_bug+0x42/0x70
         ? exc_invalid_op+0x1a/0x50
         ? asm_exc_invalid_op+0x1a/0x20
         ? __fortify_report+0x45/0x50
         __fortify_panic+0x9/0x10
        __get_cur_name_and_parent+0x3bc/0x3c0
         get_cur_path+0x207/0x3b0
         send_extent_data+0x709/0x10d0
         ? find_parent_nodes+0x22df/0x25d0
         ? mas_nomem+0x13/0x90
         ? mtree_insert_range+0xa5/0x110
         ? btrfs_lru_cache_store+0x5f/0x1e0
         ? iterate_extent_inodes+0x52d/0x5a0
         process_extent+0xa96/0x11a0
         ? __pfx_lookup_backref_cache+0x10/0x10
         ? __pfx_store_backref_cache+0x10/0x10
         ? __pfx_iterate_backrefs+0x10/0x10
         ? __pfx_check_extent_item+0x10/0x10
         changed_cb+0x6fa/0x930
         ? tree_advance+0x362/0x390
         ? memcmp_extent_buffer+0xd7/0x160
         send_subvol+0xf0a/0x1520
         btrfs_ioctl_send+0x106b/0x11d0
         ? __pfx___clone_root_cmp_sort+0x10/0x10
         _btrfs_ioctl_send+0x1ac/0x240
         btrfs_ioctl+0x75b/0x850
         __se_sys_ioctl+0xca/0x150
         do_syscall_64+0x85/0x160
         ? __count_memcg_events+0x69/0x100
         ? handle_mm_fault+0x1327/0x15c0
         ? __se_sys_rt_sigprocmask+0xf1/0x180
         ? syscall_exit_to_user_mode+0x75/0xa0
         ? do_syscall_64+0x91/0x160
         ? do_user_addr_fault+0x21d/0x630
        entry_SYSCALL_64_after_hwframe+0x76/0x7e
        RIP: 0033:0x7fae145eeb4f
        Code: 00 48 89 (...)
        RSP: 002b:00007ffdf1cb09b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fae145eeb4f
        RDX: 00007ffdf1cb0ad0 RSI: 0000000040489426 RDI: 0000000000000004
        RBP: 00000000000078fe R08: 00007fae144006c0 R09: 00007ffdf1cb0927
        R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffdf1cb1ce8
        R13: 0000000000000003 R14: 000055c499fab2e0 R15: 0000000000000004
         </TASK>
      
      Fix this by not storing the NUL string terminator since we don't actually
      need it for name cache entries, this way "name_len" corresponds to the
      actual size of the "name" array. This requires marking the "name" array
      field with __nonstring and using memcpy() instead of strcpy() as
      recommended by the guidelines at:
      
         https://github.com/KSPP/linux/issues/90Reported-by: default avatarDavid Arendt <admin@prnet.org>
      Link: https://lore.kernel.org/linux-btrfs/cee4591a-3088-49ba-99b8-d86b4242b8bd@prnet.org/
      Fixes: c0247d28 ("btrfs: send: annotate struct name_cache_entry with __counted_by()")
      CC: stable@vger.kernel.org # 6.11
      Tested-by: default avatarDavid Arendt <admin@prnet.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96c6ca71
  2. 17 Sep, 2024 3 commits
    • Filipe Manana's avatar
      btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag · 7f1b63f9
      Filipe Manana authored
      When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called
      during remount and unmount, we are freeing every node from the rbtree
      that tracks inodes for auto defrag using
      rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree
      itself. So once we unlock the lock that protects the rbtree, we have a
      tree pointing to a root that was freed (and a root pointing to freed
      nodes, and their children pointing to other freed nodes, and so on).
      This makes further access to the tree result in a use-after-free with
      unpredictable results.
      
      Fix this by initializing the rbtree to an empty root after the call to
      rbtree_postorder_for_each_entry_safe() and before unlocking.
      
      Fixes: 27694091 ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()")
      Reported-by: syzbot+ad7966ca1f5dd8b001b3@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000f9aad406223eabff@google.com/Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f1b63f9
    • Qu Wenruo's avatar
      btrfs: tree-checker: fix the wrong output of data backref objectid · b0b595e6
      Qu Wenruo authored
      [BUG]
      There are some reports about invalid data backref objectids, the report
      looks like this:
      
        BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543
      
      The data ref objectid is the inode number inside the subvolume.
      
      But in above case, the value is completely sane, not really showing the
      problem.
      
      [CAUSE]
      The root cause of the problem is the deprecated feature, inode cache.
      
      This feature results a special inode number, -12ULL, and it's no longer
      recognized by tree-checker, triggering the error.
      
      The direct problem here is the output of data ref objectid. The value
      shown is in fact the dref_root (subvolume id), not the dref_objectid
      (inode number).
      
      [FIX]
      Fix the output to use dref_objectid instead.
      Reported-by: default avatarNeil Parton <njparton@gmail.com>
      Reported-by: default avatarArchange <archange@archlinux.org>
      Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/9541deea-9056-406e-be16-a996b549614d@archlinux.org/
      Fixes: f333a3c7 ("btrfs: tree-checker: validate dref root and objectid")
      CC: stable@vger.kernel.org # 6.11
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0b595e6
    • Filipe Manana's avatar
      btrfs: fix race setting file private on concurrent lseek using same fd · 7ee85f55
      Filipe Manana authored
      When doing concurrent lseek(2) system calls against the same file
      descriptor, using multiple threads belonging to the same process, we have
      a short time window where a race happens and can result in a memory leak.
      
      The race happens like this:
      
      1) A program opens a file descriptor for a file and then spawns two
         threads (with the pthreads library for example), lets call them
         task A and task B;
      
      2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
         file.c:find_desired_extent() while holding a read lock on the inode;
      
      3) At the start of find_desired_extent(), it extracts the file's
         private_data pointer into a local variable named 'private', which has
         a value of NULL;
      
      4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
         in shared mode and enters file.c:find_desired_extent(), where it also
         extracts file->private_data into its local variable 'private', which
         has a NULL value;
      
      5) Because it saw a NULL file private, task A allocates a private
         structure and assigns to the file structure;
      
      6) Task B also saw a NULL file private so it also allocates its own file
         private and then assigns it to the same file structure, since both
         tasks are using the same file descriptor.
      
         At this point we leak the private structure allocated by task A.
      
      Besides the memory leak, there's also the detail that both tasks end up
      using the same cached state record in the private structure (struct
      btrfs_file_private::llseek_cached_state), which can result in a
      use-after-free problem since one task can free it while the other is
      still using it (only one task took a reference count on it). Also, sharing
      the cached state is not a good idea since it could result in incorrect
      results in the future - right now it should not be a problem because it
      end ups being used only in extent-io-tree.c:count_range_bits() where we do
      range validation before using the cached state.
      
      Fix this by protecting the private assignment and check of a file while
      holding the inode's spinlock and keep track of the task that allocated
      the private, so that it's used only by that task in order to prevent
      user-after-free issues with the cached state record as well as potentially
      using it incorrectly in the future.
      
      Fixes: 3c32c721 ("btrfs: use cached state when looking for delalloc ranges with lseek")
      CC: stable@vger.kernel.org # 6.6+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ee85f55
  3. 10 Sep, 2024 30 commits
    • Qu Wenruo's avatar
      btrfs: only unlock the to-be-submitted ranges inside a folio · bd610c09
      Qu Wenruo authored
      [SUBPAGE COMPRESSION LIMITS]
      Currently inside writepage_delalloc(), if a delalloc range is going to
      be submitted asynchronously (inline or compression, the page
      dirty/writeback/unlock are all handled in at different time, not at the
      submission time), then we return 1 and extent_writepage() will skip the
      submission.
      
      This is fine if every sector matches page size, but if a sector is
      smaller than page size (aka, subpage case), then it can be very
      problematic, for example for the following 64K page:
      
           0     16K     32K    48K     64K
           |/|   |///////|      |/|
             |                    |
             4K                   52K
      
      Where |/| is the dirty range we need to submit.
      
      In the above case, we need the following different handling for the 3
      ranges:
      
      - [0, 4K) needs to be submitted for regular write
        A single sector cannot be compressed.
      
      - [16K, 32K) needs to be submitted for compressed write
      
      - [48K, 52K) needs to be submitted for regular write.
      
      Above, if we try to submit [16K, 32K) for compressed write, we will
      return 1 and immediately, and without submitting the remaining
      [48K, 52K) range.
      
      Furthermore, since extent_writepage() will exit without unlocking any
      sectors, the submitted range [0, 4K) will not have sector unlocked.
      
      That's the reason why for now subpage is only allowed for full page
      range.
      
      [ENHANCEMENT]
      - Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
        This records which sectors will be submitted by extent_writepage_io().
        This allows us to track which sectors needs to be submitted thus later
        to be properly unlocked.
      
        For asynchronously submitted range (inline/compression), the
        corresponding bits will be cleared from that bitmap.
      
      - Only return 1 if no sector needs to be submitted in
        writepage_delalloc()
      
      - Only submit sectors marked by submission bitmap inside
        extent_writepage_io()
        So we won't touch the asynchronously submitted part.
      
      - Introduce btrfs_folio_end_writer_lock_bitmap() helper
        This will only unlock the involved sectors specified by @bitmap
        parameter, to avoid touching the range asynchronously submitted.
      
      Please note that, since subpage compression is still limited to page
      aligned range, this change is only a preparation for future sector
      perfect compression support for subpage.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd610c09
    • Qu Wenruo's avatar
      btrfs: merge btrfs_folio_unlock_writer() into btrfs_folio_end_writer_lock() · 49a99073
      Qu Wenruo authored
      The function btrfs_folio_unlock_writer() is already calling
      btrfs_folio_end_writer_lock() to do the heavy lifting work, the only
      missing 0 writer check.
      
      Thus there is no need to keep two different functions, move the 0 writer
      check into btrfs_folio_end_writer_lock(), and remove
      btrfs_folio_unlock_writer().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49a99073
    • Leo Martins's avatar
      btrfs: BTRFS_PATH_AUTO_FREE in orphan.c · 68f32b9c
      Leo Martins authored
      All cleanup paths lead to btrfs_path_free so path can be defined with
      the automatic freeing callback in the following functions:
      
      - btrfs_insert_orphan_item()
      - btrfs_del_orphan_item()
      Signed-off-by: default avatarLeo Martins <loemra.dev@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68f32b9c
    • Leo Martins's avatar
      btrfs: use btrfs_path auto free in zoned.c · 45763a0c
      Leo Martins authored
      All cleanup paths lead to btrfs_path_free so path can be defined with
      the automatic freeing callback in the following functions:
      
      - calculate_emulated_zone_size()
      - calculate_alloc_pointer()
      Signed-off-by: default avatarLeo Martins <loemra.dev@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45763a0c
    • Leo Martins's avatar
      btrfs: DEFINE_FREE for struct btrfs_path · 4c74a32a
      Leo Martins authored
      Add a DEFINE_FREE for struct btrfs_path. This defines a function that
      can be called using the __free attribute. Define a macro
      BTRFS_PATH_AUTO_FREE to make the declaration of an auto freeing path
      very clear.
      
      The intended use is to define the auto free of path in cases where the
      path is allocated somewhere at the beginning and freed either on all
      error paths or at the end of the function.
      
        int func() {
      	  BTRFS_PATH_AUTO_FREE(path);
      
      	  if (...)
      		  return -ERROR;
      
      	  path = alloc_path();
      
      	  ...
      
      	  if (...)
      		  return -ERROR;
      
      	  ...
      	  return 0;
        }
      Signed-off-by: default avatarLeo Martins <loemra.dev@gmail.com>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c74a32a
    • Qu Wenruo's avatar
      btrfs: remove btrfs_folio_end_all_writers() · ab6eac7c
      Qu Wenruo authored
      The function btrfs_folio_end_all_writers() is only utilized in
      extent_writepage() as a way to unlock all subpage range (for both
      successful submission and error handling).
      
      Meanwhile we have a similar function, btrfs_folio_end_writer_lock().
      
      The difference is, btrfs_folio_end_writer_lock() expects a range that is
      a subset of the already locked range.
      
      This limit on btrfs_folio_end_writer_lock() is a little overkilled,
      preventing it from being utilized for error paths.
      
      So here we enhance btrfs_folio_end_writer_lock() to accept a superset of
      the locked range, and only end the locked subset.
      This means we can replace btrfs_folio_end_all_writers() with
      btrfs_folio_end_writer_lock() instead.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab6eac7c
    • David Sterba's avatar
      btrfs: constify more pointer parameters · ca283ea9
      David Sterba authored
      Continue adding const to parameters.  This is for clarity and minor
      addition to safety. There are some minor effects, in the assembly code
      and .ko measured on release config.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca283ea9
    • David Sterba's avatar
      btrfs: rework BTRFS_I as macro to preserve parameter const · 070969f1
      David Sterba authored
      Currently BTRFS_I is a static inline function that takes a const inode
      and returns btrfs inode, dropping the 'const' qualifier. This can break
      assumptions of compiler though it seems there's no real case.
      
      To make the parameter and return type consistent regardint const we can
      use the container_of_const() that preserves it. However this would not
      check the parameter type. To fix that use the same _Generic construct
      but implement only the two expected types.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      070969f1
    • Filipe Manana's avatar
      btrfs: add and use helper to verify the calling task has locked the inode · 1b6e068a
      Filipe Manana authored
      We have a few places that check if we have the inode locked by doing:
      
          ASSERT(inode_is_locked(vfs_inode));
      
      This actually proved to be useful several times as if assertions are
      enabled (and by default they are in many distros) it immediately triggers
      a crash which is impossible for users to miss.
      
      However that doesn't check if the lock is held by the calling task, so
      the check passes if some other task locked the inode.
      
      Using one of the lockdep functions to check the lock is held, like
      lockdep_assert_held() for example, does check that the calling task
      holds the lock, and if that's not the case it produces a warning and
      stack trace in dmesg. However, despite the misleading "assert" in the
      name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
      a warning and splat in dmesg, which is easy to get unnoticed by users
      who may have lockdep enabled.
      
      So add a helper that does the ASSERT() and calls lockdep_assert_held()
      immediately after and use it every where we check the inode is locked.
      Like this if the lock is held by some other task we get the warning
      in dmesg which is caught by fstests, very helpful during development,
      and may also be occassionaly noticed by users with lockdep enabled.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b6e068a
    • Luca Stefani's avatar
      btrfs: always update fstrim_range on failure in FITRIM ioctl · 33685972
      Luca Stefani authored
      Even in case of failure we could've discarded some data and userspace
      should be made aware of it, so copy fstrim_range to userspace
      regardless.
      
      Also make sure to update the trimmed bytes amount even if
      btrfs_trim_free_extents fails.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarLuca Stefani <luca.stefani.ge1@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33685972
    • Li Zetao's avatar
      btrfs: convert copy_inline_to_page() to use folio · faad57ae
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover find_or_create_page() is compatible API, and it can
      replaced with __filemap_get_folio(). Some interfaces have been converted
      to use folio before, so the conversion operation from page can be
      eliminated here.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      faad57ae
    • Li Zetao's avatar
      btrfs: convert btrfs_decompress() to take a folio · aeb6d881
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Based on the previous patch, the compression path can be
      directly used in folio without converting to page.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aeb6d881
    • Li Zetao's avatar
      btrfs: convert zstd_decompress() to take a folio · b70f3a45
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
      But there is no memzero_folio(), but it can be replaced equivalently by
      folio_zero_range().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b70f3a45
    • Li Zetao's avatar
      btrfs: convert lzo_decompress() to take a folio · 9f9a4e43
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
      But there is no memzero_folio(), but it can be replaced equivalently by
      folio_zero_range().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f9a4e43
    • Li Zetao's avatar
      btrfs: convert zlib_decompress() to take a folio · 54c78d49
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
      But there is no memzero_folio(), but it can be replaced equivalently by
      folio_zero_range().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54c78d49
    • Li Zetao's avatar
      btrfs: convert try_release_extent_mapping() to take a folio · 046c0d65
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And page_to_inode() can be replaced with folio_to_inode() now.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      046c0d65
    • Li Zetao's avatar
      btrfs: convert try_release_extent_state() to take a folio · dd0a8df4
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover, use folio_pos() instead of page_offset(),
      which is more consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd0a8df4
    • Li Zetao's avatar
      btrfs: convert submit_eb_page() to take a folio · 08dd8507
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08dd8507
    • Li Zetao's avatar
      btrfs: convert submit_eb_subpage() to take a folio · 13587325
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover, use folio_pos() instead of page_offset(),
      which is more consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13587325
    • Li Zetao's avatar
      btrfs: convert read_key_bytes() to take a folio · 88493779
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(),
      which is more consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88493779
    • Li Zetao's avatar
      btrfs: convert try_release_extent_buffer() to take a folio · b8ae2bfa
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8ae2bfa
    • Li Zetao's avatar
      btrfs: convert try_release_subpage_extent_buffer() to take a folio · 0145aa38
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. And use folio_pos instead of page_offset, which is more
      consistent with folio usage. At the same time, folio_test_private() can
      handle folio directly without converting from page to folio first.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0145aa38
    • Li Zetao's avatar
      btrfs: convert get_next_extent_buffer() to take a folio · d4aeb5f7
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Use folio_pos instead of page_offset, which is more
      consistent with folio usage.
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4aeb5f7
    • Li Zetao's avatar
      btrfs: convert clear_page_extent_mapped() to take a folio · 266a9361
      Li Zetao authored
      The old page API is being gradually replaced and converted to use folio
      to improve code readability and avoid repeated conversion between page
      and folio. Now clear_page_extent_mapped() can deal with a folio
      directly, so change its name to clear_folio_extent_mapped().
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      266a9361
    • Qu Wenruo's avatar
      btrfs: make compression path to be subpage compatible · fd1e75d0
      Qu Wenruo authored
      Currently btrfs compression path is not really subpage compatible, every
      thing is still done in page unit.
      
      That's fine for regular sector size and subpage routine. As even for
      subpage routine compression is only enabled if the whole range is page
      aligned, so reading the page cache in page unit is totally fine.
      
      However in preparation for the future subpage perfect compression
      support, we need to change the compression routine to properly handle a
      subpage range.
      
      This patch would prepare both zlib and zstd to only read the subpage
      range for compression.
      Lzo is already doing subpage aware read, as lzo's on-disk format is
      already sectorsize dependent.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd1e75d0
    • Qu Wenruo's avatar
      btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io() · 9ca0e58c
      Qu Wenruo authored
      There are only two differences between the two functions:
      
      - btrfs_orig_bbio_end_io() does extra error propagation
        This is mostly to allow tolerance for write errors.
      
      - btrfs_orig_bbio_end_io() does extra pending_ios check
        This check can handle both the original bio, or the cloned one.
        (All accounting happens in the original one).
      
      This makes btrfs_orig_bbio_end_io() a much safer call.
      In fact we already had a double freeing error due to usage of
      btrfs_bio_end_io() in the error path of btrfs_submit_chunk().
      
      So just move the whole content of btrfs_orig_bbio_end_io() into
      btrfs_bio_end_io().
      
      For normal paths this brings no change, because they are already calling
      btrfs_orig_bbio_end_io() in the first place.
      
      For error paths (not only inside bio.c but also external callers), this
      change will introduce extra checks, especially for external callers, as
      they will error out without submitting the btrfs bio.
      
      But considering it's already in the error path, such slower but much
      safer checks are still an overall win.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ca0e58c
    • Josef Bacik's avatar
      btrfs: do not hold the extent lock for entire read · ac325fc2
      Josef Bacik authored
      Historically we've held the extent lock throughout the entire read.
      There's been a few reasons for this, but it's mostly just caused us
      problems.  For example, this prevents us from allowing page faults
      during direct io reads, because we could deadlock.  This has forced us
      to only allow 4k reads at a time for io_uring NOWAIT requests because we
      have no idea if we'll be forced to page fault and thus have to do a
      whole lot of work.
      
      On the buffered side we are protected by the page lock, as long as we're
      reading things like buffered writes, punch hole, and even direct IO to a
      certain degree will get hung up on the page lock while the page is in
      flight.
      
      On the direct side we have the dio extent lock, which acts much like the
      way the extent lock worked previously to this patch, however just for
      direct reads.  This protects direct reads from concurrent direct writes,
      while we're protected from buffered writes via the inode lock.
      
      Now that we're protected in all cases, narrow the extent lock to the
      part where we're getting the extent map to submit the reads, no longer
      holding the extent lock for the entire read operation.  Push the extent
      lock down into do_readpage() so that we're only grabbing it when looking
      up the extent map.  This portion was contributed by Goldwyn.
      Co-developed-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac325fc2
    • Josef Bacik's avatar
      btrfs: take the dio extent lock during O_DIRECT operations · 07d399cb
      Josef Bacik authored
      Currently we hold the extent lock for the entire duration of a read.
      This isn't really necessary in the buffered case, we're protected by the
      page lock, however it's necessary for O_DIRECT.
      
      For O_DIRECT reads, if we only locked the extent for the part where we
      get the extent, we could potentially race with an O_DIRECT write in the
      same region.  This isn't really a problem, unless the read is delayed so
      much that the write does the COW, unpins the old extent, and some other
      application re-allocates the extent before the read is actually able to
      be submitted.  At that point at best we'd have a checksum mismatch, but
      at worse we could read data that doesn't belong to us.
      
      To address this potential race we need to make sure we don't have
      overlapping, concurrent direct io reads and writes.
      
      To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO
      case in the same spot as the current extent lock.  The writes will take
      this while they're creating the ordered extent, which is also used to
      make sure concurrent buffered reads or concurrent direct reads are not
      allowed to occur, and drop it after the ordered extent is taken.  For
      reads it will act as the current read behavior for the EXTENT_LOCKED
      bit, we set it when we're starting the read, we clear it in the end_io
      to allow other direct writes to continue.
      
      This still has the drawback of disallowing concurrent overlapping direct
      reads from occurring, but that exists with the current extent locking.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07d399cb
    • Josef Bacik's avatar
      btrfs: introduce EXTENT_DIO_LOCKED · 7e2a5950
      Josef Bacik authored
      In order to support dropping the extent lock during a read we need a way
      to make sure that direct reads and direct writes for overlapping ranges
      are protected from each other.  To accomplish this introduce another
      lock bit specifically for direct io.  Subsequent patches will utilize
      this to protect direct IO operations.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e2a5950
    • David Sterba's avatar
      btrfs: always pass readahead state to defrag · df2825e9
      David Sterba authored
      Defrag ioctl passes readahead from the file, but autodefrag does not
      have a file so the readahead state is allocated when needed.
      
      The autodefrag loop in cleaner thread iterates over inodes so we can
      simply provide an on-stack readahead state and will not need to allocate
      it in btrfs_defrag_file(). The size is 32 bytes which is acceptable.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df2825e9