1. 24 Nov, 2023 3 commits
    • Jann Horn's avatar
      btrfs: send: ensure send_fd is writable · 0ac1d13a
      Jann Horn authored
      kernel_write() requires the caller to ensure that the file is writable.
      Let's do that directly after looking up the ->send_fd.
      
      We don't need a separate bailout path because the "out" path already
      does fput() if ->send_filp is non-NULL.
      
      This has no security impact for two reasons:
      
       - the ioctl requires CAP_SYS_ADMIN
       - __kernel_write() bails out on read-only files - but only since 5.8,
         see commit a01ac27b ("fs: check FMODE_WRITE in __kernel_write")
      
      Reported-and-tested-by: syzbot+12e098239d20385264d3@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=12e098239d20385264d3
      Fixes: 31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ac1d13a
    • Qu Wenruo's avatar
      btrfs: free the allocated memory if btrfs_alloc_page_array() fails · 94dbf7c0
      Qu Wenruo authored
      [BUG]
      If btrfs_alloc_page_array() fail to allocate all pages but part of the
      slots, then the partially allocated pages would be leaked in function
      btrfs_submit_compressed_read().
      
      [CAUSE]
      As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
      caller is responsible to free the partially allocated pages.
      
      For the existing call sites, most of them are fine:
      
      - btrfs_raid_bio::stripe_pages
        Handled by free_raid_bio().
      
      - extent_buffer::pages[]
        Handled btrfs_release_extent_buffer_pages().
      
      - scrub_stripe::pages[]
        Handled by release_scrub_stripe().
      
      But there is one exception in btrfs_submit_compressed_read(), if
      btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
      the array pointer directly.
      
      Initially there is still the error handling in commit dd137dd1
      ("btrfs: factor out allocating an array of pages"), but later in commit
      544fe4a9 ("btrfs: embed a btrfs_bio into struct compressed_bio"),
      the error handling is removed, leading to the possible memory leak.
      
      [FIX]
      This patch would add back the error handling first, then to prevent such
      situation from happening again, also
      Make btrfs_alloc_page_array() to free the allocated pages as a extra
      safety net, then we don't need to add the error handling to
      btrfs_submit_compressed_read().
      
      Fixes: 544fe4a9 ("btrfs: embed a btrfs_bio into struct compressed_bio")
      CC: stable@vger.kernel.org # 6.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94dbf7c0
    • David Sterba's avatar
      btrfs: fix 64bit compat send ioctl arguments not initializing version member · 5de0434b
      David Sterba authored
      When the send protocol versioning was added in 5.16 e77fbf99
      ("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was
      not updated (added by 2351f431 ("btrfs: fix send ioctl on 32bit with
      64bit kernel")), missing the version struct member. The compat code is
      probably rarely used, nobody reported any bugs.
      
      Found by tool https://github.com/jirislaby/clang-struct .
      
      Fixes: e77fbf99 ("btrfs: send: prepare for v2 protocol")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5de0434b
  2. 23 Nov, 2023 4 commits
    • Filipe Manana's avatar
      btrfs: make error messages more clear when getting a chunk map · 7d410d5e
      Filipe Manana authored
      When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity
      checks to verify we found a chunk map and that map found covers the
      logical address the caller passed in. However the messages aren't very
      clear in the sense that don't mention the issue is with a chunk map and
      one of them prints the 'length' argument as if it were the end offset of
      the requested range (while the in the string format we use %llu-%llu
      which suggests a range, and the second %llu-%llu is actually a range for
      the chunk map). So improve these two details in the error messages.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7d410d5e
    • Filipe Manana's avatar
      btrfs: fix off-by-one when checking chunk map includes logical address · 5fba5a57
      Filipe Manana authored
      At btrfs_get_chunk_map() we get the extent map for the chunk that contains
      the given logical address stored in the 'logical' argument. Then we do
      sanity checks to verify the extent map contains the logical address. One
      of these checks verifies if the extent map covers a range with an end
      offset behind the target logical address - however this check has an
      off-by-one error since it will consider an extent map whose start offset
      plus its length matches the target logical address as inclusive, while
      the fact is that the last byte it covers is behind the target logical
      address (by 1).
      
      So fix this condition by using '<=' rather than '<' when comparing the
      extent map's "start + length" against the target logical address.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5fba5a57
    • Bragatheswaran Manickavel's avatar
      btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod() · f91192cd
      Bragatheswaran Manickavel authored
      In btrfs_ref_tree_mod(), when !parent 're' was allocated through
      kmalloc(). In the following code, if an error occurs, the execution will
      be redirected to 'out' or 'out_unlock' and the function will be exited.
      However, on some of the paths, 're' are not deallocated and may lead to
      memory leaks.
      
      For example: lookup_block_entry() for 'be' returns NULL, the out label
      will be invoked. During that flow ref and 'ra' are freed but not 're',
      which can potentially lead to a memory leak.
      
      CC: stable@vger.kernel.org # 5.10+
      Reported-and-tested-by: syzbot+d66de4cbf532749df35f@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=d66de4cbf532749df35fSigned-off-by: default avatarBragatheswaran Manickavel <bragathemanick0908@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f91192cd
    • Qu Wenruo's avatar
      btrfs: add dmesg output for first mount and last unmount of a filesystem · 2db31320
      Qu Wenruo authored
      There is a feature request to add dmesg output when unmounting a btrfs.
      There are several alternative methods to do the same thing, but with
      their own problems:
      
      - Use eBPF to watch btrfs_put_super()/open_ctree()
        Not end user friendly, they have to dip their head into the source
        code.
      
      - Watch for directory /sys/fs/<uuid>/
        This is way more simple, but still requires some simple device -> uuid
        lookups.  And a script needs to use inotify to watch /sys/fs/.
      
      Compared to all these, directly outputting the information into dmesg
      would be the most simple one, with both device and UUID included.
      
      And since we're here, also add the output when mounting a filesystem for
      the first time for parity. A more fine grained monitoring of subvolume
      mounts should be done by another layer, like audit.
      
      Now mounting a btrfs with all default mkfs options would look like this:
      
        [81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
        [81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm
        [81.908258] BTRFS info (device dm-8): using free space tree
        [81.912644] BTRFS info (device dm-8): auto enabling async discard
        [81.913277] BTRFS info (device dm-8): checking UUID tree
        [91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
      
      CC: stable@vger.kernel.org # 5.4+
      Link: https://github.com/kdave/btrfs-progs/issues/689Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2db31320
  3. 15 Nov, 2023 2 commits
    • Qu Wenruo's avatar
      btrfs: do not abort transaction if there is already an existing qgroup · 8049ba5d
      Qu Wenruo authored
      [BUG]
      Syzbot reported a regression that after commit 6ed05643 ("btrfs:
      create qgroup earlier in snapshot creation") we can trigger transaction
      abort during snapshot creation:
      
        BTRFS: Transaction aborted (error -17)
        WARNING: CPU: 0 PID: 5057 at fs/btrfs/transaction.c:1778 create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
        Modules linked in:
        CPU: 0 PID: 5057 Comm: syz-executor225 Not tainted 6.6.0-syzkaller-15365-g30523014 #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
        RIP: 0010:create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
        Call Trace:
         <TASK>
         create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1967
         btrfs_commit_transaction+0xf1c/0x3730 fs/btrfs/transaction.c:2440
         create_snapshot+0x4a5/0x7e0 fs/btrfs/ioctl.c:845
         btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:995
         btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1041
         __btrfs_ioctl_snap_create+0x344/0x460 fs/btrfs/ioctl.c:1294
         btrfs_ioctl_snap_create+0x13c/0x190 fs/btrfs/ioctl.c:1321
         btrfs_ioctl+0xbbf/0xd40
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:871 [inline]
         __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x63/0x6b
        RIP: 0033:0x7f2f791127b9
         </TASK>
      
      [CAUSE]
      The error number is -EEXIST, which can happen for qgroup if there is
      already an existing qgroup and then we're trying to create a snapshot
      for it.
      
      [FIX]
      In that case, we can continue creating the snapshot, although it may
      lead to qgroup inconsistency, it's not so critical to abort the current
      transaction.
      
      So in this case, we can just ignore the non-critical errors, mostly -EEXIST
      (there is already a qgroup).
      
      Reported-by: syzbot+4d81015bc10889fd12ea@syzkaller.appspotmail.com
      Fixes: 6ed05643 ("btrfs: create qgroup earlier in snapshot creation")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8049ba5d
    • Qu Wenruo's avatar
      btrfs: tree-checker: add type and sequence check for inline backrefs · 1645c283
      Qu Wenruo authored
      [BUG]
      There is a bug report that ntfs2btrfs had a bug that it can lead to
      transaction abort and the filesystem flips to read-only.
      
      [CAUSE]
      For inline backref items, kernel has a strict requirement for their
      ordered, they must follow the following rules:
      
      - All btrfs_extent_inline_ref::type should be in an ascending order
      
      - Within the same type, the items should follow a descending order by
        their sequence number
      
        For EXTENT_DATA_REF type, the sequence number is result from
        hash_extent_data_ref().
        For other types, their sequence numbers are
        btrfs_extent_inline_ref::offset.
      
      Thus if there is any code not following above rules, the resulted
      inline backrefs can prevent the kernel to locate the needed inline
      backref and lead to transaction abort.
      
      [FIX]
      Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
      ability to detect such problems.
      
      For kernel, let's be more noisy and be more specific about the order, so
      that the next time kernel hits such problem we would reject it in the
      first place, without leading to transaction abort.
      
      Link: https://github.com/kdave/btrfs-progs/pull/622Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1645c283
  4. 09 Nov, 2023 3 commits
    • Boris Burkov's avatar
      btrfs: make OWNER_REF_KEY type value smallest among inline refs · d3933152
      Boris Burkov authored
      BTRFS_EXTENT_OWNER_REF_KEY is the type of simple quotas extent owner
      refs. This special inline ref goes in front of all other inline refs.
      
      In general, inline refs have a required sorted order s.t. type never
      decreases (among other requirements). This was recently reified into a
      tree-checker and fsck rule, which broke simple quotas. To be fair,
      though, in a sense, the new owner ref item had also violated that not
      yet fully enforced requirement.
      
      This fix brings the owner ref item into compliance with the requirement
      that inline ref type never decrease.
      
      btrfs/301 exercises this behavior and should pass again with this fix.
      
      Fixes: d9a620f7 ("btrfs: new inline ref storing owning subvol of data extents")
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3933152
    • Filipe Manana's avatar
      btrfs: fix qgroup record leaks when using simple quotas · 609d9937
      Filipe Manana authored
      When using simple quotas we are not supposed to allocate qgroup records
      when adding delayed references. However we allocate them if either mode
      of quotas is enabled (the new simple one or the old one), but then we
      never free them because running the accounting, which frees the records,
      is only run when using the old quotas (at btrfs_qgroup_account_extents()),
      resulting in a memory leak of the records allocated when adding delayed
      references.
      
      Fix this by allocating the records only if the old quotas mode is enabled.
      Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas
      mode is not enabled - meaning the caller has to free the record.
      
      Fixes: 182940f4 ("btrfs: qgroup: add new quota mode for simple quotas")
      Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      609d9937
    • Filipe Manana's avatar
      btrfs: fix race between accounting qgroup extents and removing a qgroup · 6c8e69e4
      Filipe Manana authored
      When doing qgroup accounting for an extent, we take the spinlock
      fs_info->qgroup_lock and then add qgroups to the local list (iterator)
      named "qgroups". These qgroups are found in the fs_info->qgroup_tree
      rbtree. After we're done, we unlock fs_info->qgroup_lock and then call
      qgroup_iterator_nested_clean(), which will iterate over all the qgroups
      added to the local list "qgroups" and then delete them from the list.
      Deleting a qgroup from the list can however result in a use-after-free
      if a qgroup remove operation happens after we unlock fs_info->qgroup_lock
      and before or while we are at qgroup_iterator_nested_clean().
      
      Fix this by calling qgroup_iterator_nested_clean() while still holding
      the lock fs_info->qgroup_lock - we don't need it under the 'out' label
      since before taking the lock the "qgroups" list is always empty. This
      guarantees safety because btrfs_remove_qgroup() takes that lock before
      removing a qgroup from the rbtree fs_info->qgroup_tree.
      
      This was reported by syzbot with the following stack traces:
      
         BUG: KASAN: slab-use-after-free in __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
         Read of size 8 at addr ffff888027e420b0 by task kworker/u4:3/48
      
         CPU: 1 PID: 48 Comm: kworker/u4:3 Not tainted 6.6.0-syzkaller-10396-g4652b8e4 #0
         Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
         Workqueue: btrfs-qgroup-rescan btrfs_work_helper
         Call Trace:
          <TASK>
          __dump_stack lib/dump_stack.c:88 [inline]
          dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
          print_address_description mm/kasan/report.c:364 [inline]
          print_report+0x163/0x540 mm/kasan/report.c:475
          kasan_report+0x175/0x1b0 mm/kasan/report.c:588
          __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
          __list_del_entry_valid include/linux/list.h:124 [inline]
          __list_del_entry include/linux/list.h:215 [inline]
          list_del_init include/linux/list.h:287 [inline]
          qgroup_iterator_nested_clean fs/btrfs/qgroup.c:2623 [inline]
          btrfs_qgroup_account_extent+0x18b/0x1150 fs/btrfs/qgroup.c:2883
          qgroup_rescan_leaf fs/btrfs/qgroup.c:3543 [inline]
          btrfs_qgroup_rescan_worker+0x1078/0x1c60 fs/btrfs/qgroup.c:3604
          btrfs_work_helper+0x37c/0xbd0 fs/btrfs/async-thread.c:315
          process_one_work kernel/workqueue.c:2630 [inline]
          process_scheduled_works+0x90f/0x1400 kernel/workqueue.c:2703
          worker_thread+0xa5f/0xff0 kernel/workqueue.c:2784
          kthread+0x2d3/0x370 kernel/kthread.c:388
          ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
          ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
          </TASK>
      
         Allocated by task 6355:
          kasan_save_stack mm/kasan/common.c:45 [inline]
          kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
          ____kasan_kmalloc mm/kasan/common.c:374 [inline]
          __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383
          kmalloc include/linux/slab.h:600 [inline]
          kzalloc include/linux/slab.h:721 [inline]
          btrfs_quota_enable+0xee9/0x2060 fs/btrfs/qgroup.c:1209
          btrfs_ioctl_quota_ctl+0x143/0x190 fs/btrfs/ioctl.c:3705
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:871 [inline]
          __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
          do_syscall_x64 arch/x86/entry/common.c:51 [inline]
          do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
          entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
         Freed by task 6355:
          kasan_save_stack mm/kasan/common.c:45 [inline]
          kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
          kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522
          ____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236
          kasan_slab_free include/linux/kasan.h:164 [inline]
          slab_free_hook mm/slub.c:1800 [inline]
          slab_free_freelist_hook mm/slub.c:1826 [inline]
          slab_free mm/slub.c:3809 [inline]
          __kmem_cache_free+0x263/0x3a0 mm/slub.c:3822
          btrfs_remove_qgroup+0x764/0x8c0 fs/btrfs/qgroup.c:1787
          btrfs_ioctl_qgroup_create+0x185/0x1e0 fs/btrfs/ioctl.c:3811
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:871 [inline]
          __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
          do_syscall_x64 arch/x86/entry/common.c:51 [inline]
          do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
          entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
         Last potentially related work creation:
          kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
          __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
          __call_rcu_common kernel/rcu/tree.c:2667 [inline]
          call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
          kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
          kthread+0x2d3/0x370 kernel/kthread.c:388
          ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
          ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
         Second to last potentially related work creation:
          kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
          __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
          __call_rcu_common kernel/rcu/tree.c:2667 [inline]
          call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
          kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
          kthread+0x2d3/0x370 kernel/kthread.c:388
          ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
          ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
         The buggy address belongs to the object at ffff888027e42000
          which belongs to the cache kmalloc-512 of size 512
         The buggy address is located 176 bytes inside of
          freed 512-byte region [ffff888027e42000, ffff888027e42200)
      
         The buggy address belongs to the physical page:
         page:ffffea00009f9000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x27e40
         head:ffffea00009f9000 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
         flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
         page_type: 0xffffffff()
         raw: 00fff00000000840 ffff888012c41c80 ffffea0000a5ba00 dead000000000002
         raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
         page dumped because: kasan: bad access detected
         page_owner tracks the page as allocated
         page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4514, tgid 4514 (udevadm), ts 24598439480, free_ts 23755696267
          set_page_owner include/linux/page_owner.h:31 [inline]
          post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1536
          prep_new_page mm/page_alloc.c:1543 [inline]
          get_page_from_freelist+0x31db/0x3360 mm/page_alloc.c:3170
          __alloc_pages+0x255/0x670 mm/page_alloc.c:4426
          alloc_slab_page+0x6a/0x160 mm/slub.c:1870
          allocate_slab mm/slub.c:2017 [inline]
          new_slab+0x84/0x2f0 mm/slub.c:2070
          ___slab_alloc+0xc85/0x1310 mm/slub.c:3223
          __slab_alloc mm/slub.c:3322 [inline]
          __slab_alloc_node mm/slub.c:3375 [inline]
          slab_alloc_node mm/slub.c:3468 [inline]
          __kmem_cache_alloc_node+0x19d/0x270 mm/slub.c:3517
          kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098
          kmalloc include/linux/slab.h:600 [inline]
          kzalloc include/linux/slab.h:721 [inline]
          kernfs_fop_open+0x3e7/0xcc0 fs/kernfs/file.c:670
          do_dentry_open+0x8fd/0x1590 fs/open.c:948
          do_open fs/namei.c:3622 [inline]
          path_openat+0x2845/0x3280 fs/namei.c:3779
          do_filp_open+0x234/0x490 fs/namei.c:3809
          do_sys_openat2+0x13e/0x1d0 fs/open.c:1440
          do_sys_open fs/open.c:1455 [inline]
          __do_sys_openat fs/open.c:1471 [inline]
          __se_sys_openat fs/open.c:1466 [inline]
          __x64_sys_openat+0x247/0x290 fs/open.c:1466
          do_syscall_x64 arch/x86/entry/common.c:51 [inline]
          do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
          entry_SYSCALL_64_after_hwframe+0x63/0x6b
         page last free stack trace:
          reset_page_owner include/linux/page_owner.h:24 [inline]
          free_pages_prepare mm/page_alloc.c:1136 [inline]
          free_unref_page_prepare+0x8c3/0x9f0 mm/page_alloc.c:2312
          free_unref_page+0x37/0x3f0 mm/page_alloc.c:2405
          discard_slab mm/slub.c:2116 [inline]
          __unfreeze_partials+0x1dc/0x220 mm/slub.c:2655
          put_cpu_partial+0x17b/0x250 mm/slub.c:2731
          __slab_free+0x2b6/0x390 mm/slub.c:3679
          qlink_free mm/kasan/quarantine.c:166 [inline]
          qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:185
          kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:292
          __kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305
          kasan_slab_alloc include/linux/kasan.h:188 [inline]
          slab_post_alloc_hook+0x67/0x3d0 mm/slab.h:762
          slab_alloc_node mm/slub.c:3478 [inline]
          slab_alloc mm/slub.c:3486 [inline]
          __kmem_cache_alloc_lru mm/slub.c:3493 [inline]
          kmem_cache_alloc+0x104/0x2c0 mm/slub.c:3502
          getname_flags+0xbc/0x4f0 fs/namei.c:140
          do_sys_openat2+0xd2/0x1d0 fs/open.c:1434
          do_sys_open fs/open.c:1455 [inline]
          __do_sys_openat fs/open.c:1471 [inline]
          __se_sys_openat fs/open.c:1466 [inline]
          __x64_sys_openat+0x247/0x290 fs/open.c:1466
          do_syscall_x64 arch/x86/entry/common.c:51 [inline]
          do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
          entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
         Memory state around the buggy address:
          ffff888027e41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          ffff888027e42000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         >ffff888027e42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
          ffff888027e42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
          ffff888027e42180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Reported-by: syzbot+e0b615318f8fcfc01ceb@syzkaller.appspotmail.com
      Fixes: dce28769 ("btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()")
      CC: stable@vger.kernel.org # 6.6
      Link: https://lore.kernel.org/linux-btrfs/00000000000091a5b2060936bf6d@google.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c8e69e4
  5. 03 Nov, 2023 7 commits
  6. 12 Oct, 2023 21 commits
    • David Sterba's avatar
      btrfs: open code timespec64 in struct btrfs_inode · c6e8f898
      David Sterba authored
      The type of timespec64::tv_nsec is 'unsigned long', while we have only
      u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode.
      Add separate members for sec and nsec with the corresponding type width.
      This creates a 4 byte hole in btrfs_inode which can be utilized in the
      future.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6e8f898
    • Filipe Manana's avatar
      btrfs: remove redundant log root tree index assignment during log sync · cc687c2e
      Filipe Manana authored
      During log syncing, when we start updating the log root tree we compute
      an index value, stored in variable 'index2', once we lock the log root
      tree's mutex. This value depends on the log root's log_transid. And
      shortly after we compute again the same value for 'index2' - the value
      is exactly the same since we haven't released the mutex and therefore
      the log_transid of the log root is the same as before.
      
      This second 'index2' computation became pointless after commit
      a93e0168 ("btrfs: remove no longer needed use of log_writers for the
      log root tree"). So remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc687c2e
    • Colin Ian King's avatar
      btrfs: remove redundant initialization of variable dirty in btrfs_update_time() · a666ce9b
      Colin Ian King authored
      The variable dirty is initialized with a value that is never read, it
      is being re-assigned later on. Remove the redundant initialization.
      Cleans up clang scan build warning:
      
        fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its
        initialization is never read [deadcode.DeadStores]
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a666ce9b
    • Anand Jain's avatar
      btrfs: sysfs: show temp_fsid feature · f3623740
      Anand Jain authored
      This adds sysfs objects to indicate temp_fsid feature support and
      its status.
      
        /sys/fs/btrfs/features/temp_fsid
        /sys/fs/btrfs/<UUID>/temp_fsid
      
      For example:
      
         Consider two cloned and mounted devices.
      
            $ blkid /dev/sdc[1-2]
            /dev/sdc1: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
            /dev/sdc2: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
      
         One gets actual fsid, and the other gets the temp_fsid when
         mounted.
      
            $ btrfs filesystem show -m
            Label: none  uuid: 509ad44b-ad2a-4a8a-bc8d-fe69db7220d5
      	      Total devices 1 FS bytes used 54.14MiB
      	      devid    1 size 300.00MiB used 144.00MiB path /dev/sdc1
      
            Label: none  uuid: 33bad74e-c91b-43a5-aef8-b3cab97ae63a
      	      Total devices 1 FS bytes used 54.14MiB
      	      devid    1 size 300.00MiB used 144.00MiB path /dev/sdc2
      
         Their sysfs as below.
      
            $ cat /sys/fs/btrfs/features/temp_fsid
            0
      
            $ cat /sys/fs/btrfs/509ad44b-ad2a-4a8a-bc8d-fe69db7220d5/temp_fsid
            0
      
            $ cat /sys/fs/btrfs/33bad74e-c91b-43a5-aef8-b3cab97ae63a/temp_fsid
            1
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3623740
    • Anand Jain's avatar
      btrfs: disable the device add feature for temp-fsid · ac6ea6a9
      Anand Jain authored
      The device addition operation will transform the cloned temp-fsid mounted
      device into a multi-device filesystem. Therefore, it is marked as
      unsupported.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac6ea6a9
    • Anand Jain's avatar
      btrfs: disable the seed feature for temp-fsid · c47b02c1
      Anand Jain authored
      A seed device is an integral component of the sprout device, which
      functions as a multi-device filesystem. Therefore, temp-fsid feature
      is not supported.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c47b02c1
    • Anand Jain's avatar
      btrfs: update comment for temp-fsid, fsid, and metadata_uuid · 000331bb
      Anand Jain authored
      Update the comment to explain the relationship between temp_fsid, fsid,
      and metadata_uuid.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      000331bb
    • Filipe Manana's avatar
      btrfs: remove pointless empty log context list check when syncing log · 3cf63ddf
      Filipe Manana authored
      When syncing the log, if we get an error when updating the log root, we
      check first if the log root tree context is in a log context list, and if
      so it deletes from the log root tree context from the list. This check
      however is pointless because at this moment the context is always in a
      list, he have just added it to a context list. The check became pointless
      after commit a93e0168 ("btrfs: remove no longer needed use of
      log_writers for the log root tree"). So remove this now pointless empty
      list check.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cf63ddf
    • Filipe Manana's avatar
      btrfs: update comment for struct btrfs_inode::lock · 68539bd0
      Filipe Manana authored
      Update the comment for the lock named "lock" in struct btrfs_inode because
      it does not mention that the fields "delalloc_bytes", "defrag_bytes",
      "csum_bytes", "outstanding_extents" and "disk_i_size" are also protected
      by that lock.
      
      Also add a comment on top of each field protected by this lock to mention
      that the lock protects them.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      68539bd0
    • Filipe Manana's avatar
      btrfs: remove pointless barrier from btrfs_sync_file() · 5ca1949b
      Filipe Manana authored
      The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant
      now that fs_info->last_trans_committed is read using READ_ONCE(), with the
      helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE()
      with the helper btrfs_set_last_trans_committed().
      
      This barrier was introduced in 2011, by commit a4abeea4 ("Btrfs: kill
      trans_mutex"), but even back then it was not correct since the writer side
      (in btrfs_commit_transaction()), did not issue a pairing memory barrier
      after it updated fs_info->last_trans_committed.
      
      So remove this barrier.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5ca1949b
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing last_trans_committed · 0124855f
      Filipe Manana authored
      Currently the last_trans_committed field of struct btrfs_fs_info is
      modified and read without any locking or other protection. For example
      early in the fsync path, skip_inode_logging() is called which reads
      fs_info->last_trans_committed, but at the same time we can have a
      transaction commit completing and updating that field.
      
      In the case of an fsync this is harmless and any data race should be
      rare and at most cause an unnecessary logging of an inode.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the last_trans_committed field of struct btrfs_fs_info using
      READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0124855f
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing fs_info->generation · 4a4f8fe2
      Filipe Manana authored
      Currently the generation field of struct btrfs_fs_info is always modified
      while holding fs_info->trans_lock locked. Most readers will access this
      field without taking that lock but while holding a transaction handle,
      which is safe to do due to the transaction life cycle.
      
      However there are other readers that are neither holding the lock nor
      holding a transaction handle open:
      
      1) When reading an inode from disk, at btrfs_read_locked_inode();
      
      2) When reading the generation to expose it to sysfs, at
         btrfs_generation_show();
      
      3) Early in the fsync path, at skip_inode_logging();
      
      4) When creating a hole at btrfs_cont_expand(), during write paths,
         truncate and reflinking;
      
      5) In the fs_info ioctl (btrfs_ioctl_fs_info());
      
      6) While mounting the filesystem, in the open_ctree() path. In these
         cases it's safe to directly read fs_info->generation as no one
         can concurrently start a transaction and update fs_info->generation.
      
      In case of the fsync path, races here should be harmless, and in the worst
      case they may cause a fsync to log an inode when it's not really needed,
      so nothing bad from a functional perspective. In the other cases it's not
      so clear if functional problems may arise, though in case 1 rare things
      like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
      flag not being set on an inode and therefore result in incorrect logging
      later on in case a fsync call is made.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the generation field of struct btrfs_fs_info using READ_ONCE() and
      WRITE_ONCE(), and use these helpers where needed.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a4f8fe2
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing log_transid · 6008859b
      Filipe Manana authored
      Currently the log_transid field of a root is always modified while holding
      the root's log_mutex locked. Most readers of a root's log_transid are also
      holding the root's log_mutex locked, however there is one exception which
      is btrfs_set_inode_last_trans() where we don't take the lock to avoid
      blocking several operations if log syncing is happening in parallel.
      
      Any races here should be harmless, and in the worst case they may cause a
      fsync to log an inode when it's not really needed, so nothing bad from a
      functional perspective.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
      and use these helpers where needed.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6008859b
    • Filipe Manana's avatar
      btrfs: add and use helpers for reading and writing last_log_commit · f9850787
      Filipe Manana authored
      Currently, the last_log_commit of a root can be accessed concurrently
      without any lock protection. Readers can be calling btrfs_inode_in_log()
      early in a fsync call, which reads a root's last_log_commit, while a
      writer can change the last_log_commit while a log tree if being synced,
      at btrfs_sync_log(). Any races here should be harmless, and in the worst
      case they may cause a fsync to log an inode when it's not really needed,
      so nothing bad from a functional perspective.
      
      To avoid data race warnings from tools like KCSAN and other issues such
      as load and store tearing (amongst others, see [1]), create helpers to
      access the last_log_commit field of a root using READ_ONCE() and
      WRITE_ONCE(), and use these helpers everywhere.
      
      [1] https://lwn.net/Articles/793253/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f9850787
    • Anand Jain's avatar
      btrfs: support cloned-device mount capability · a5b8a5f9
      Anand Jain authored
      Guilherme's previous work [1] aimed at the mounting of cloned devices
      using a superblock flag SINGLE_DEV during mkfs.
       [1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/
      
      Building upon this work, here is in memory only approach. As it mounts
      we determine if the same fsid is already mounted if then we generate a
      random temp fsid which shall be used the mount, in memory only not
      written to the disk. We distinguish devices by devt.
      
      Example:
        $ fallocate -l 300m ./disk1.img
        $ mkfs.btrfs -f ./disk1.img
        $ cp ./disk1.img ./disk2.img
        $ cp ./disk1.img ./disk3.img
        $ mount -o loop ./disk1.img /btrfs
        $ mount -o ./disk2.img /btrfs1
        $ mount -o ./disk3.img /btrfs2
      
        $ btrfs fi show -m
        Label: none  uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02
      	Total devices 1 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 88.00MiB path /dev/loop0
      
        Label: none  uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16
      	Total devices 1 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 88.00MiB path /dev/loop1
      
        Label: none  uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb
      	Total devices 1 FS bytes used 144.00KiB
      	devid    1 size 300.00MiB used 88.00MiB path /dev/loop2
      Co-developed-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5b8a5f9
    • Anand Jain's avatar
      btrfs: add helper function find_fsid_by_disk · 69d427f3
      Anand Jain authored
      In preparation for adding support to mount multiple single-disk
      btrfs filesystems with the same FSID, wrap find_fsid() into
      find_fsid_by_disk().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      69d427f3
    • Filipe Manana's avatar
      btrfs: stop reserving excessive space for block group item insertions · 9ef17228
      Filipe Manana authored
      Space for block group item insertions, necessary after allocating a new
      block group, is reserved in the delayed refs block reserve. Currently we
      do this by incrementing the transaction handle's delayed_ref_updates
      counter and then calling btrfs_update_delayed_refs_rsv(), which will
      increase the size of the delayed refs block reserve by an amount that
      corresponds to the same amount we use for delayed refs, given by
      btrfs_calc_delayed_ref_bytes().
      
      That is an excessive amount because it corresponds to the amount of space
      needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
      times 2 when the free space tree feature is enabled. All we need is an
      amount as given by btrfs_calc_insert_metadata_size(), since we only need to
      insert a block group item in the extent tree (or block group tree if this
      feature is enabled). By using btrfs_calc_insert_metadata_size() we will
      need to reserve 2 times less space when using the free space tree, putting
      less pressure on space reservation.
      
      So use helpers to reserve and release space for block group item
      insertions that use btrfs_calc_insert_metadata_size() for calculation of
      the space.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ef17228
    • Filipe Manana's avatar
      btrfs: stop reserving excessive space for block group item updates · f66e0209
      Filipe Manana authored
      Space for block group item updates, necessary after allocating or
      deallocating an extent from a block group, is reserved in the delayed
      refs block reserve. Currently we do this by incrementing the transaction
      handle's delayed_ref_updates counter and then calling
      btrfs_update_delayed_refs_rsv(), which will increase the size of the
      delayed refs block reserve by an amount that corresponds to the same
      amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().
      
      That is an excessive amount because it corresponds to the amount of space
      needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
      times 2 when the free space tree feature is enabled. All we need is an
      amount as given by btrfs_calc_metadata_size(), since we only need to
      update an existing block group item in the extent tree (or block group
      tree if this feature is enabled). By using btrfs_calc_metadata_size() we
      will need to reserve 4 times less space when using the free space tree
      and 2 times less space when not using it, putting less pressure on space
      reservation.
      
      So use helpers to reserve and release space for block group item updates
      that use btrfs_calc_metadata_size() for calculation of the space.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f66e0209
    • David Sterba's avatar
      btrfs: reorder btrfs_inode to fill gaps · 398fb913
      David Sterba authored
      Previous commit created a hole in struct btrfs_inode, we can move
      outstanding_extents there. This reduces size by 8 bytes from 1120 to
      1112 on a release config.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      398fb913
    • David Sterba's avatar
      btrfs: open code btrfs_ordered_inode_tree in btrfs_inode · 54c65371
      David Sterba authored
      The structure btrfs_ordered_inode_tree is used only in one place, in
      btrfs_inode. The structure itself has a 4 byte hole which is wasted
      space.
      
      Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
      prefix 'ordered_tree_' where the hole can be utilized and shrink inode
      size.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54c65371
    • Josef Bacik's avatar
      btrfs: adjust overcommit logic when very close to full · cb6cbab7
      Josef Bacik authored
      A user reported some unpleasant behavior with very small file systems.
      The reproducer is this
      
        $ mkfs.btrfs -f -m single -b 8g /dev/vdb
        $ mount /dev/vdb /mnt/test
        $ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
      
      This will result in usage that looks like this
      
        Overall:
            Device size:                   8.00GiB
            Device allocated:              8.00GiB
            Device unallocated:            1.00MiB
            Device missing:                  0.00B
            Device slack:                  2.00GiB
            Used:                          5.47GiB
            Free (estimated):              2.52GiB      (min: 2.52GiB)
            Free (statfs, df):               0.00B
            Data ratio:                       1.00
            Metadata ratio:                   1.00
            Global reserve:                5.50MiB      (used: 0.00B)
            Multiple profiles:                  no
      
        Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
           /dev/vdb        7.99GiB
      
        Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
           /dev/vdb        8.00MiB
      
        System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
           /dev/vdb        4.00MiB
      
        Unallocated:
           /dev/vdb        1.00MiB
      
      As you can see we've gotten ourselves quite full with metadata, with all
      of the disk being allocated for data.
      
      On smaller file systems there's not a lot of time before we get full, so
      our overcommit behavior bites us here.  Generally speaking data
      reservations result in chunk allocations as we assume reservation ==
      actual use for data.  This means at any point we could end up with a
      chunk allocation for data, and if we're very close to full we could do
      this before we have a chance to figure out that we need another metadata
      chunk.
      
      Address this by adjusting the overcommit logic.  Simply put we need to
      take away 1 chunk from the available chunk space in case of a data
      reservation.  This will allow us to stop overcommitting before we
      potentially lose this space to a data allocation.  With this fix in
      place we properly allocate a metadata chunk before we're completely
      full, allowing for enough slack space in metadata.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb6cbab7