1. 31 Jan, 2022 5 commits
    • Filipe Manana's avatar
      btrfs: fix use-after-free after failure to create a snapshot · 28b21c55
      Filipe Manana authored
      At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
      then attach it to the transaction's list of pending snapshots. After that
      we call btrfs_commit_transaction(), and if that returns an error we jump
      to 'fail' label, where we kfree() the pending snapshot structure. This can
      result in a later use-after-free of the pending snapshot:
      
      1) We allocated the pending snapshot and added it to the transaction's
         list of pending snapshots;
      
      2) We call btrfs_commit_transaction(), and it fails either at the first
         call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
         In both cases, we don't abort the transaction and we release our
         transaction handle. We jump to the 'fail' label and free the pending
         snapshot structure. We return with the pending snapshot still in the
         transaction's list;
      
      3) Another task commits the transaction. This time there's no error at
         all, and then during the transaction commit it accesses a pointer
         to the pending snapshot structure that the snapshot creation task
         has already freed, resulting in a user-after-free.
      
      This issue could actually be detected by smatch, which produced the
      following warning:
      
        fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list
      
      So fix this by not having the snapshot creation ioctl directly add the
      pending snapshot to the transaction's list. Instead add the pending
      snapshot to the transaction handle, and then at btrfs_commit_transaction()
      we add the snapshot to the list only when we can guarantee that any error
      returned after that point will result in a transaction abort, in which
      case the ioctl code can safely free the pending snapshot and no one can
      access it anymore.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28b21c55
    • Su Yue's avatar
      btrfs: tree-checker: check item_size for dev_item · ea1d1ca4
      Su Yue authored
      Check item size before accessing the device item to avoid out of bound
      access, similar to inode_item check.
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea1d1ca4
    • Su Yue's avatar
      btrfs: tree-checker: check item_size for inode_item · 0c982944
      Su Yue authored
      while mounting the crafted image, out-of-bounds access happens:
      
        [350.429619] UBSAN: array-index-out-of-bounds in fs/btrfs/struct-funcs.c:161:1
        [350.429636] index 1048096 is out of range for type 'page *[16]'
        [350.429650] CPU: 0 PID: 9 Comm: kworker/u8:1 Not tainted 5.16.0-rc4 #1
        [350.429652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
        [350.429653] Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
        [350.429772] Call Trace:
        [350.429774]  <TASK>
        [350.429776]  dump_stack_lvl+0x47/0x5c
        [350.429780]  ubsan_epilogue+0x5/0x50
        [350.429786]  __ubsan_handle_out_of_bounds+0x66/0x70
        [350.429791]  btrfs_get_16+0xfd/0x120 [btrfs]
        [350.429832]  check_leaf+0x754/0x1a40 [btrfs]
        [350.429874]  ? filemap_read+0x34a/0x390
        [350.429878]  ? load_balance+0x175/0xfc0
        [350.429881]  validate_extent_buffer+0x244/0x310 [btrfs]
        [350.429911]  btrfs_validate_metadata_buffer+0xf8/0x100 [btrfs]
        [350.429935]  end_bio_extent_readpage+0x3af/0x850 [btrfs]
        [350.429969]  ? newidle_balance+0x259/0x480
        [350.429972]  end_workqueue_fn+0x29/0x40 [btrfs]
        [350.429995]  btrfs_work_helper+0x71/0x330 [btrfs]
        [350.430030]  ? __schedule+0x2fb/0xa40
        [350.430033]  process_one_work+0x1f6/0x400
        [350.430035]  ? process_one_work+0x400/0x400
        [350.430036]  worker_thread+0x2d/0x3d0
        [350.430037]  ? process_one_work+0x400/0x400
        [350.430038]  kthread+0x165/0x190
        [350.430041]  ? set_kthread_struct+0x40/0x40
        [350.430043]  ret_from_fork+0x1f/0x30
        [350.430047]  </TASK>
        [350.430077] BTRFS warning (device loop0): bad eb member start: ptr 0xffe20f4e start 20975616 member offset 4293005178 size 2
      
      check_leaf() is checking the leaf:
      
        corrupt leaf: root=4 block=29396992 slot=1, bad key order, prev (16140901064495857664 1 0) current (1 204 12582912)
        leaf 29396992 items 6 free space 3565 generation 6 owner DEV_TREE
        leaf 29396992 flags 0x1(WRITTEN) backref revision 1
        fs uuid a62e00e8-e94e-4200-8217-12444de93c2e
        chunk uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
      	  item 0 key (16140901064495857664 INODE_ITEM 0) itemoff 3955 itemsize 40
      		  generation 0 transid 0 size 0 nbytes 17592186044416
      		  block group 0 mode 52667 links 33 uid 0 gid 2104132511 rdev 94223634821136
      		  sequence 100305 flags 0x2409000a(none)
      		  atime 0.0 (1970-01-01 08:00:00)
      		  ctime 2973280098083405823.4294967295 (-269783007-01-01 21:37:03)
      		  mtime 18446744071572723616.4026825121 (1902-04-16 12:40:00)
      		  otime 9249929404488876031.4294967295 (622322949-04-16 04:25:58)
      	  item 1 key (1 DEV_EXTENT 12582912) itemoff 3907 itemsize 48
      		  dev extent chunk_tree 3
      		  chunk_objectid 256 chunk_offset 12582912 length 8388608
      		  chunk_tree_uuid cecbd0f7-9ca0-441e-ae9f-f782f9732bd8
      
      The corrupted leaf of device tree has an inode item. The leaf passed
      checksum and others checks in validate_extent_buffer until check_leaf_item().
      Because of the key type BTRFS_INODE_ITEM, check_inode_item() is called even we
      are in the device tree. Since the
      item offset + sizeof(struct btrfs_inode_item) > eb->len, out-of-bounds access
      is triggered.
      
      The item end vs leaf boundary check has been done before
      check_leaf_item(), so fix it by checking item size in check_inode_item()
      before access of the inode item in extent buffer.
      
      Other check functions except check_dev_item() in check_leaf_item()
      have their item size checks.
      The commit for check_dev_item() is followed.
      
      No regression observed during running fstests.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215299
      CC: stable@vger.kernel.org # 5.10+
      CC: Wenqing Liu <wenqingliu0120@gmail.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0c982944
    • Shin'ichiro Kawasaki's avatar
      btrfs: fix deadlock between quota disable and qgroup rescan worker · e804861b
      Shin'ichiro Kawasaki authored
      Quota disable ioctl starts a transaction before waiting for the qgroup
      rescan worker completes. However, this wait can be infinite and results
      in deadlock because of circular dependency among the quota disable
      ioctl, the qgroup rescan worker and the other task with transaction such
      as block group relocation task.
      
      The deadlock happens with the steps following:
      
      1) Task A calls ioctl to disable quota. It starts a transaction and
         waits for qgroup rescan worker completes.
      2) Task B such as block group relocation task starts a transaction and
         joins to the transaction that task A started. Then task B commits to
         the transaction. In this commit, task B waits for a commit by task A.
      3) Task C as the qgroup rescan worker starts its job and starts a
         transaction. In this transaction start, task C waits for completion
         of the transaction that task A started and task B committed.
      
      This deadlock was found with fstests test case btrfs/115 and a zoned
      null_blk device. The test case enables and disables quota, and the
      block group reclaim was triggered during the quota disable by chance.
      The deadlock was also observed by running quota enable and disable in
      parallel with 'btrfs balance' command on regular null_blk devices.
      
      An example report of the deadlock:
      
        [372.469894] INFO: task kworker/u16:6:103 blocked for more than 122 seconds.
        [372.479944]       Not tainted 5.16.0-rc8 #7
        [372.485067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [372.493898] task:kworker/u16:6   state:D stack:    0 pid:  103 ppid:     2 flags:0x00004000
        [372.503285] Workqueue: btrfs-qgroup-rescan btrfs_work_helper [btrfs]
        [372.510782] Call Trace:
        [372.514092]  <TASK>
        [372.521684]  __schedule+0xb56/0x4850
        [372.530104]  ? io_schedule_timeout+0x190/0x190
        [372.538842]  ? lockdep_hardirqs_on+0x7e/0x100
        [372.547092]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [372.555591]  schedule+0xe0/0x270
        [372.561894]  btrfs_commit_transaction+0x18bb/0x2610 [btrfs]
        [372.570506]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
        [372.578875]  ? free_unref_page+0x3f2/0x650
        [372.585484]  ? finish_wait+0x270/0x270
        [372.591594]  ? release_extent_buffer+0x224/0x420 [btrfs]
        [372.599264]  btrfs_qgroup_rescan_worker+0xc13/0x10c0 [btrfs]
        [372.607157]  ? lock_release+0x3a9/0x6d0
        [372.613054]  ? btrfs_qgroup_account_extent+0xda0/0xda0 [btrfs]
        [372.620960]  ? do_raw_spin_lock+0x11e/0x250
        [372.627137]  ? rwlock_bug.part.0+0x90/0x90
        [372.633215]  ? lock_is_held_type+0xe4/0x140
        [372.639404]  btrfs_work_helper+0x1ae/0xa90 [btrfs]
        [372.646268]  process_one_work+0x7e9/0x1320
        [372.652321]  ? lock_release+0x6d0/0x6d0
        [372.658081]  ? pwq_dec_nr_in_flight+0x230/0x230
        [372.664513]  ? rwlock_bug.part.0+0x90/0x90
        [372.670529]  worker_thread+0x59e/0xf90
        [372.676172]  ? process_one_work+0x1320/0x1320
        [372.682440]  kthread+0x3b9/0x490
        [372.687550]  ? _raw_spin_unlock_irq+0x24/0x50
        [372.693811]  ? set_kthread_struct+0x100/0x100
        [372.700052]  ret_from_fork+0x22/0x30
        [372.705517]  </TASK>
        [372.709747] INFO: task btrfs-transacti:2347 blocked for more than 123 seconds.
        [372.729827]       Not tainted 5.16.0-rc8 #7
        [372.745907] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [372.767106] task:btrfs-transacti state:D stack:    0 pid: 2347 ppid:     2 flags:0x00004000
        [372.787776] Call Trace:
        [372.801652]  <TASK>
        [372.812961]  __schedule+0xb56/0x4850
        [372.830011]  ? io_schedule_timeout+0x190/0x190
        [372.852547]  ? lockdep_hardirqs_on+0x7e/0x100
        [372.871761]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [372.886792]  schedule+0xe0/0x270
        [372.901685]  wait_current_trans+0x22c/0x310 [btrfs]
        [372.919743]  ? btrfs_put_transaction+0x3d0/0x3d0 [btrfs]
        [372.938923]  ? finish_wait+0x270/0x270
        [372.959085]  ? join_transaction+0xc75/0xe30 [btrfs]
        [372.977706]  start_transaction+0x938/0x10a0 [btrfs]
        [372.997168]  transaction_kthread+0x19d/0x3c0 [btrfs]
        [373.013021]  ? btrfs_cleanup_transaction.isra.0+0xfc0/0xfc0 [btrfs]
        [373.031678]  kthread+0x3b9/0x490
        [373.047420]  ? _raw_spin_unlock_irq+0x24/0x50
        [373.064645]  ? set_kthread_struct+0x100/0x100
        [373.078571]  ret_from_fork+0x22/0x30
        [373.091197]  </TASK>
        [373.105611] INFO: task btrfs:3145 blocked for more than 123 seconds.
        [373.114147]       Not tainted 5.16.0-rc8 #7
        [373.120401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [373.130393] task:btrfs           state:D stack:    0 pid: 3145 ppid:  3141 flags:0x00004000
        [373.140998] Call Trace:
        [373.145501]  <TASK>
        [373.149654]  __schedule+0xb56/0x4850
        [373.155306]  ? io_schedule_timeout+0x190/0x190
        [373.161965]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.168469]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
        [373.175468]  schedule+0xe0/0x270
        [373.180814]  wait_for_commit+0x104/0x150 [btrfs]
        [373.187643]  ? test_and_set_bit+0x20/0x20 [btrfs]
        [373.194772]  ? kmem_cache_free+0x124/0x550
        [373.201191]  ? btrfs_put_transaction+0x69/0x3d0 [btrfs]
        [373.208738]  ? finish_wait+0x270/0x270
        [373.214704]  ? __btrfs_end_transaction+0x347/0x7b0 [btrfs]
        [373.222342]  btrfs_commit_transaction+0x44d/0x2610 [btrfs]
        [373.230233]  ? join_transaction+0x255/0xe30 [btrfs]
        [373.237334]  ? btrfs_record_root_in_trans+0x4d/0x170 [btrfs]
        [373.245251]  ? btrfs_apply_pending_changes+0x50/0x50 [btrfs]
        [373.253296]  relocate_block_group+0x105/0xc20 [btrfs]
        [373.260533]  ? mutex_lock_io_nested+0x1270/0x1270
        [373.267516]  ? btrfs_wait_nocow_writers+0x85/0x180 [btrfs]
        [373.275155]  ? merge_reloc_roots+0x710/0x710 [btrfs]
        [373.283602]  ? btrfs_wait_ordered_extents+0xd30/0xd30 [btrfs]
        [373.291934]  ? kmem_cache_free+0x124/0x550
        [373.298180]  btrfs_relocate_block_group+0x35c/0x930 [btrfs]
        [373.306047]  btrfs_relocate_chunk+0x85/0x210 [btrfs]
        [373.313229]  btrfs_balance+0x12f4/0x2d20 [btrfs]
        [373.320227]  ? lock_release+0x3a9/0x6d0
        [373.326206]  ? btrfs_relocate_chunk+0x210/0x210 [btrfs]
        [373.333591]  ? lock_is_held_type+0xe4/0x140
        [373.340031]  ? rcu_read_lock_sched_held+0x3f/0x70
        [373.346910]  btrfs_ioctl_balance+0x548/0x700 [btrfs]
        [373.354207]  btrfs_ioctl+0x7f2/0x71b0 [btrfs]
        [373.360774]  ? lockdep_hardirqs_on_prepare+0x410/0x410
        [373.367957]  ? lockdep_hardirqs_on_prepare+0x410/0x410
        [373.375327]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
        [373.383841]  ? find_held_lock+0x2c/0x110
        [373.389993]  ? lock_release+0x3a9/0x6d0
        [373.395828]  ? mntput_no_expire+0xf7/0xad0
        [373.402083]  ? lock_is_held_type+0xe4/0x140
        [373.408249]  ? vfs_fileattr_set+0x9f0/0x9f0
        [373.414486]  ? selinux_file_ioctl+0x349/0x4e0
        [373.420938]  ? trace_raw_output_lock+0xb4/0xe0
        [373.427442]  ? selinux_inode_getsecctx+0x80/0x80
        [373.434224]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.440660]  ? force_qs_rnp+0x2a0/0x6b0
        [373.446534]  ? lock_is_held_type+0x9b/0x140
        [373.452763]  ? __blkcg_punt_bio_submit+0x1b0/0x1b0
        [373.459732]  ? security_file_ioctl+0x50/0x90
        [373.466089]  __x64_sys_ioctl+0x127/0x190
        [373.472022]  do_syscall_64+0x3b/0x90
        [373.477513]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [373.484823] RIP: 0033:0x7f8f4af7e2bb
        [373.490493] RSP: 002b:00007ffcbf936178 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [373.500197] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8f4af7e2bb
        [373.509451] RDX: 00007ffcbf936220 RSI: 00000000c4009420 RDI: 0000000000000003
        [373.518659] RBP: 00007ffcbf93774a R08: 0000000000000013 R09: 00007f8f4b02d4e0
        [373.527872] R10: 00007f8f4ae87740 R11: 0000000000000246 R12: 0000000000000001
        [373.537222] R13: 00007ffcbf936220 R14: 0000000000000000 R15: 0000000000000002
        [373.546506]  </TASK>
        [373.550878] INFO: task btrfs:3146 blocked for more than 123 seconds.
        [373.559383]       Not tainted 5.16.0-rc8 #7
        [373.565748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [373.575748] task:btrfs           state:D stack:    0 pid: 3146 ppid:  2168 flags:0x00000000
        [373.586314] Call Trace:
        [373.590846]  <TASK>
        [373.595121]  __schedule+0xb56/0x4850
        [373.600901]  ? __lock_acquire+0x23db/0x5030
        [373.607176]  ? io_schedule_timeout+0x190/0x190
        [373.613954]  schedule+0xe0/0x270
        [373.619157]  schedule_timeout+0x168/0x220
        [373.625170]  ? usleep_range_state+0x150/0x150
        [373.631653]  ? mark_held_locks+0x9e/0xe0
        [373.637767]  ? do_raw_spin_lock+0x11e/0x250
        [373.643993]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
        [373.651267]  ? _raw_spin_unlock_irq+0x24/0x50
        [373.657677]  ? lockdep_hardirqs_on+0x7e/0x100
        [373.664103]  wait_for_completion+0x163/0x250
        [373.670437]  ? bit_wait_timeout+0x160/0x160
        [373.676585]  btrfs_quota_disable+0x176/0x9a0 [btrfs]
        [373.683979]  ? btrfs_quota_enable+0x12f0/0x12f0 [btrfs]
        [373.691340]  ? down_write+0xd0/0x130
        [373.696880]  ? down_write_killable+0x150/0x150
        [373.703352]  btrfs_ioctl+0x3945/0x71b0 [btrfs]
        [373.710061]  ? find_held_lock+0x2c/0x110
        [373.716192]  ? lock_release+0x3a9/0x6d0
        [373.722047]  ? __handle_mm_fault+0x23cd/0x3050
        [373.728486]  ? btrfs_ioctl_get_supported_features+0x20/0x20 [btrfs]
        [373.737032]  ? set_pte+0x6a/0x90
        [373.742271]  ? do_raw_spin_unlock+0x55/0x1f0
        [373.748506]  ? lock_is_held_type+0xe4/0x140
        [373.754792]  ? vfs_fileattr_set+0x9f0/0x9f0
        [373.761083]  ? selinux_file_ioctl+0x349/0x4e0
        [373.767521]  ? selinux_inode_getsecctx+0x80/0x80
        [373.774247]  ? __up_read+0x182/0x6e0
        [373.780026]  ? count_memcg_events.constprop.0+0x46/0x60
        [373.787281]  ? up_write+0x460/0x460
        [373.792932]  ? security_file_ioctl+0x50/0x90
        [373.799232]  __x64_sys_ioctl+0x127/0x190
        [373.805237]  do_syscall_64+0x3b/0x90
        [373.810947]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [373.818102] RIP: 0033:0x7f1383ea02bb
        [373.823847] RSP: 002b:00007fffeb4d71f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [373.833641] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1383ea02bb
        [373.842961] RDX: 00007fffeb4d7210 RSI: 00000000c0109428 RDI: 0000000000000003
        [373.852179] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078
        [373.861408] R10: 00007f1383daec78 R11: 0000000000000202 R12: 00007fffeb4d874a
        [373.870647] R13: 0000000000493099 R14: 0000000000000001 R15: 0000000000000000
        [373.879838]  </TASK>
        [373.884018]
                     Showing all locks held in the system:
        [373.894250] 3 locks held by kworker/4:1/58:
        [373.900356] 1 lock held by khungtaskd/63:
        [373.906333]  #0: ffffffff8945ff60 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260
        [373.917307] 3 locks held by kworker/u16:6/103:
        [373.923938]  #0: ffff888127b4f138 ((wq_completion)btrfs-qgroup-rescan){+.+.}-{0:0}, at: process_one_work+0x712/0x1320
        [373.936555]  #1: ffff88810b817dd8 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x73f/0x1320
        [373.951109]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_qgroup_rescan_worker+0x1f6/0x10c0 [btrfs]
        [373.964027] 2 locks held by less/1803:
        [373.969982]  #0: ffff88813ed56098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x24/0x80
        [373.981295]  #1: ffffc90000b3b2e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x9e2/0x1060
        [373.992969] 1 lock held by btrfs-transacti/2347:
        [373.999893]  #0: ffff88813d4887a8 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0xe3/0x3c0 [btrfs]
        [374.015872] 3 locks held by btrfs/3145:
        [374.022298]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl_balance+0xc3/0x700 [btrfs]
        [374.034456]  #1: ffff88813d48a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0xfe5/0x2d20 [btrfs]
        [374.047646]  #2: ffff88813d488838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x354/0x930 [btrfs]
        [374.063295] 4 locks held by btrfs/3146:
        [374.069647]  #0: ffff888102dd4460 (sb_writers#18){.+.+}-{0:0}, at: btrfs_ioctl+0x38b1/0x71b0 [btrfs]
        [374.081601]  #1: ffff88813d488bb8 (&fs_info->subvol_sem){+.+.}-{3:3}, at: btrfs_ioctl+0x38fd/0x71b0 [btrfs]
        [374.094283]  #2: ffff888102dd4650 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_disable+0xc8/0x9a0 [btrfs]
        [374.106885]  #3: ffff88813d489800 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_disable+0xd5/0x9a0 [btrfs]
      
        [374.126780] =============================================
      
      To avoid the deadlock, wait for the qgroup rescan worker to complete
      before starting the transaction for the quota disable ioctl. Clear
      BTRFS_FS_QUOTA_ENABLE flag before the wait and the transaction to
      request the worker to complete. On transaction start failure, set the
      BTRFS_FS_QUOTA_ENABLE flag again. These BTRFS_FS_QUOTA_ENABLE flag
      changes can be done safely since the function btrfs_quota_disable is not
      called concurrently because of fs_info->subvol_sem.
      
      Also check the BTRFS_FS_QUOTA_ENABLE flag in qgroup_rescan_init to avoid
      another qgroup rescan worker to start after the previous qgroup worker
      completed.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e804861b
    • Qu Wenruo's avatar
      btrfs: don't start transaction for scrub if the fs is mounted read-only · 2d192fc4
      Qu Wenruo authored
      [BUG]
      The following super simple script would crash btrfs at unmount time, if
      CONFIG_BTRFS_ASSERT() is set.
      
       mkfs.btrfs -f $dev
       mount $dev $mnt
       xfs_io -f -c "pwrite 0 4k" $mnt/file
       umount $mnt
       mount -r ro $dev $mnt
       btrfs scrub start -Br $mnt
       umount $mnt
      
      This will trigger the following ASSERT() introduced by commit
      0a31daa4 ("btrfs: add assertion for empty list of transactions at
      late stage of umount").
      
      That patch is definitely not the cause, it just makes enough noise for
      developers.
      
      [CAUSE]
      We will start transaction for the following call chain during scrub:
      
        scrub_enumerate_chunks()
        |- btrfs_inc_block_group_ro()
           |- btrfs_join_transaction()
      
      However for RO mount, there is no running transaction at all, thus
      btrfs_join_transaction() will start a new transaction.
      
      Furthermore, since it's read-only mount, btrfs_sync_fs() will not call
      btrfs_commit_super() to commit the new but empty transaction.
      
      And leads to the ASSERT().
      
      The bug has been there for a long time. Only the new ASSERT() makes it
      noisy enough to be noticed.
      
      [FIX]
      For read-only scrub on read-only mount, there is no need to start a
      transaction nor to allocate new chunks in btrfs_inc_block_group_ro().
      
      Just do extra read-only mount check in btrfs_inc_block_group_ro(), and
      if it's read-only, skip all chunk allocation and go inc_block_group_ro()
      directly.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d192fc4
  2. 24 Jan, 2022 3 commits
    • Filipe Manana's avatar
      btrfs: update writeback index when starting defrag · 27cdfde1
      Filipe Manana authored
      When starting a defrag, we should update the writeback index of the
      inode's mapping in case it currently has a value beyond the start of the
      range we are defragging. This can help performance and often result in
      getting less extents after writeback - for e.g., if the current value
      of the writeback index sits somewhere in the middle of a range that
      gets dirty by the defrag, then after writeback we can get two smaller
      extents instead of a single, larger extent.
      
      We used to have this before the refactoring in 5.16, but it was removed
      without any reason to do so. Originally it was added in kernel 3.1, by
      commit 2a0f7f57 ("Btrfs: fix recursive auto-defrag"), in order to
      fix a loop with autodefrag resulting in dirtying and writing pages over
      and over, but some testing on current code did not show that happening,
      at least with the test described in that commit.
      
      So add back the behaviour, as at the very least it is a nice to have
      optimization.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      27cdfde1
    • Filipe Manana's avatar
      btrfs: add back missing dirty page rate limiting to defrag · 3c9d31c7
      Filipe Manana authored
      A defrag operation can dirty a lot of pages, specially if operating on
      the entire file or a large file range. Any task dirtying pages should
      periodically call balance_dirty_pages_ratelimited(), as stated in that
      function's comments, otherwise they can leave too many dirty pages in
      the system. This is what we did before the refactoring in 5.16, and
      it should have remained, just like in the buffered write path and
      relocation. So restore that behaviour.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c9d31c7
    • Filipe Manana's avatar
      btrfs: fix deadlock when reserving space during defrag · 0cb5950f
      Filipe Manana authored
      When defragging we can end up collecting a range for defrag that has
      already pages under delalloc (dirty), as long as the respective extent
      map for their range is not mapped to a hole, a prealloc extent or
      the extent map is from an old generation.
      
      Most of the time that is harmless from a functional perspective at
      least, however it can result in a deadlock:
      
      1) At defrag_collect_targets() we find an extent map that meets all
         requirements but there's delalloc for the range it covers, and we add
         its range to list of ranges to defrag;
      
      2) The defrag_collect_targets() function is called at defrag_one_range(),
         after it locked a range that overlaps the range of the extent map;
      
      3) At defrag_one_range(), while the range is still locked, we call
         defrag_one_locked_target() for the range associated to the extent
         map we collected at step 1);
      
      4) Then finally at defrag_one_locked_target() we do a call to
         btrfs_delalloc_reserve_space(), which will reserve data and metadata
         space. If the space reservations can not be satisfied right away, the
         flusher might be kicked in and start flushing delalloc and wait for
         the respective ordered extents to complete. If this happens we will
         deadlock, because both flushing delalloc and finishing an ordered
         extent, requires locking the range in the inode's io tree, which was
         already locked at defrag_collect_targets().
      
      So fix this by skipping extent maps for which there's already delalloc.
      
      Fixes: eb793cf8 ("btrfs: defrag: introduce helper to collect target file extents")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cb5950f
  3. 19 Jan, 2022 4 commits
    • Qu Wenruo's avatar
      btrfs: defrag: properly update range->start for autodefrag · c080b414
      Qu Wenruo authored
      [BUG]
      After commit 7b508037 ("btrfs: defrag: use defrag_one_cluster() to
      implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
      the file from previously finished location.
      
      [CAUSE]
      The recent refactoring of defrag only focuses on defrag ioctl subpage
      support, doesn't take autodefrag into consideration.
      
      There are two problems involved which prevents autodefrag to restart its
      scan:
      
      - No range.start update
        Previously when one defrag target is found, range->start will be
        updated to indicate where next search should start from.
      
        But now btrfs_defrag_file() doesn't update it anymore, making all
        autodefrag to rescan from file offset 0.
      
        This would also make autodefrag to mark the same range dirty again and
        again, causing extra IO.
      
      - No proper quick exit for defrag_one_cluster()
        Currently if we reached or exceed @max_sectors limit, we just exit
        defrag_one_cluster(), and let next defrag_one_cluster() call to do a
        quick exit.
        This makes @cur increase, thus no way to properly know which range is
        defragged and which range is skipped.
      
      [FIX]
      The fix involves two modifications:
      
      - Update range->start to next cluster start
        This is a little different from the old behavior.
        Previously range->start is updated to the next defrag target.
      
        But in the end, the behavior should still be pretty much the same,
        as now we skip to next defrag target inside btrfs_defrag_file().
      
        Thus if auto-defrag determines to re-scan, then we still do the skip,
        just at a different timing.
      
      - Make defrag_one_cluster() to return >0 to indicate a quick exit
        So that btrfs_defrag_file() can also do a quick exit, without
        increasing @cur to the range end, and re-use @cur to update
        @range->start.
      
      - Add comment for btrfs_defrag_file() to mention the range->start update
        Currently only autodefrag utilize this behavior, as defrag ioctl won't
        set @max_to_defrag parameter, thus unless interrupted it will always
        try to defrag the whole range.
      Reported-by: default avatarFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c080b414
    • Qu Wenruo's avatar
      btrfs: defrag: fix wrong number of defragged sectors · 484167da
      Qu Wenruo authored
      [BUG]
      There are users using autodefrag mount option reporting obvious increase
      in IO:
      
      > If I compare the write average (in total, I don't have it per process)
      > when taking idle periods on the same machine:
      >     Linux 5.16:
      >         without autodefrag: ~ 10KiB/s
      >         with autodefrag: between 1 and 2MiB/s.
      >
      >     Linux 5.15:
      >         with autodefrag:~ 10KiB/s (around the same as without
      > autodefrag on 5.16)
      
      [CAUSE]
      When autodefrag mount option is enabled, btrfs_defrag_file() will be
      called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many
      sectors we can defrag in one try.
      
      And then use the number of sectors defragged to determine if we need to
      re-defrag.
      
      But commit b18c3ab2 ("btrfs: defrag: introduce helper to defrag one
      cluster") uses wrong unit to increase @sectors_defragged, which should
      be in unit of sector, not byte.
      
      This means, if we have defragged any sector, then @sectors_defragged
      will be >= sectorsize (normally 4096), which is larger than
      BTRFS_DEFRAG_BATCH.
      
      This makes the @max_sectors check in defrag_one_cluster() to underflow,
      rendering the whole @max_sectors check useless.
      
      Thus causing way more IO for autodefrag mount options, as now there is
      no limit on how many sectors can really be defragged.
      
      [FIX]
      Fix the problems by:
      
      - Use sector as unit when increasing @sectors_defragged
      
      - Include @sectors_defragged > @max_sectors case to break the loop
      
      - Add extra comment on the return value of btrfs_defrag_file()
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: b18c3ab2 ("btrfs: defrag: introduce helper to defrag one cluster")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      484167da
    • Filipe Manana's avatar
      btrfs: allow defrag to be interruptible · b767c2fc
      Filipe Manana authored
      During defrag, at btrfs_defrag_file(), we have this loop that iterates
      over a file range in steps no larger than 256K subranges. If the range
      is too long, there's no way to interrupt it. So make the loop check in
      each iteration if there's signal pending, and if there is, break and
      return -AGAIN to userspace.
      
      Before kernel 5.16, we used to allow defrag to be cancelled through a
      signal, but that was lost with commit 7b508037 ("btrfs: defrag:
      use defrag_one_cluster() to implement btrfs_defrag_file()").
      
      This change adds back the possibility to cancel a defrag with a signal
      and keeps the same semantics, returning -EAGAIN to user space (and not
      the usually more expected -EINTR).
      
      This is also motivated by a recent bug on 5.16 where defragging a 1 byte
      file resulted in iterating from file range 0 to (u64)-1, as hitting the
      bug triggered a too long loop, basically requiring one to reboot the
      machine, as it was not possible to cancel defrag.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b767c2fc
    • Filipe Manana's avatar
      btrfs: fix too long loop when defragging a 1 byte file · 6b34cd8e
      Filipe Manana authored
      When attempting to defrag a file with a single byte, we can end up in a
      too long loop, which is nearly infinite because at btrfs_defrag_file()
      we end up with the variable last_byte assigned with a value of
      18446744073709551615 (which is (u64)-1). The problem comes from the fact
      we end up doing:
      
          last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
      
      So if last_byte was assigned 0, which is i_size - 1, we underflow and
      end up with the value 18446744073709551615.
      
      This is trivial to reproduce and the following script triggers it:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        echo -n "X" > $MNT/foobar
      
        btrfs filesystem defragment $MNT/foobar
      
        umount $MNT
      
      So fix this by not decrementing last_byte by 1 before doing the sector
      size round up. Also, to make it easier to follow, make the round up right
      after computing last_byte.
      Reported-by: default avatarAnthony Ruhier <aruhier@mailbox.org>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b34cd8e
  4. 07 Jan, 2022 28 commits
    • Qu Wenruo's avatar
      btrfs: output more debug messages for uncommitted transaction · 36c86a9e
      Qu Wenruo authored
      Print extra information about how many dirty bytes an uncommitted
      has at the end of mount.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36c86a9e
    • Filipe Manana's avatar
      btrfs: respect the max size in the header when activating swap file · c2f82263
      Filipe Manana authored
      If we extended the size of a swapfile after its header was created (by the
      mkswap utility) and then try to activate it, we will map the entire file
      when activating the swap file, instead of limiting to the max size defined
      in the swap file's header.
      
      Currently test case generic/643 from fstests fails because we do not
      respect that size limit defined in the swap file's header.
      
      So fix this by not mapping file ranges beyond the max size defined in the
      swap header.
      
      This is the same type of bug that iomap used to have, and was fixed in
      commit 36ca7943 ("mm/swap: consider max pages in
      iomap_swapfile_add_extent").
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2f82263
    • Yang Li's avatar
      btrfs: fix argument list that the kdoc format and script verified · be8d1a2a
      Yang Li authored
      The warnings were found by running scripts/kernel-doc, which is
      caused by using 'make W=1'.
      
      fs/btrfs/extent_io.c:3210: warning: Function parameter or member
      'bio_ctrl' not described in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
      description in 'btrfs_bio_add_page'
      fs/btrfs/extent_io.c:3210: warning: Excess function parameter
      'prev_bio_flags' description in 'btrfs_bio_add_page'
      fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
      description in 'btrfs_reserve_metadata_bytes'
      fs/btrfs/space-info.c:1602: warning: Function parameter or member
      'fs_info' not described in 'btrfs_reserve_metadata_bytes'
      
      Note: this is fixing only the warnings regarding parameter list, the
      first line is not strictly conforming to the kdoc format as the btrfs
      codebase does not stick to that and keeps the first line more free form
      (because it's only for internal use).
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      be8d1a2a
    • Su Yue's avatar
      btrfs: remove unnecessary parameter type from compression_decompress_bio · 4a9e803e
      Su Yue authored
      btrfs_decompress_bio, the only caller of compression_decompress_bio gets
      type from @cb and passes it to compression_decompress_bio.
      However, compression_decompress_bio can get compression type directly
      from @cb.
      
      So remove the parameter and access it through @cb.  No functional
      change.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarSu Yue <l@damenly.su>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a9e803e
    • Qu Wenruo's avatar
      btrfs: selftests: dump extent io tree if extent-io-tree test failed · 856e4794
      Qu Wenruo authored
      When code modifying extent-io-tree get modified and got that selftest
      failed, it can take some time to pin down the cause.
      
      To make it easier to expose the problem, dump the extent io tree if the
      selftest failed.
      
      This can save developers debug time, especially since the selftest we
      can not use the trace events, thus have to manually add debug trace
      points.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      856e4794
    • Qu Wenruo's avatar
      btrfs: scrub: cleanup the argument list of scrub_stripe() · 2ae8ae3d
      Qu Wenruo authored
      The argument list of btrfs_stripe() has similar problems of
      scrub_chunk():
      
      - Duplicated and ambiguous @base argument
        Can be fetched from btrfs_block_group::bg.
      
      - Ambiguous argument @length
        It's again device extent length
      
      - Ambiguous argument @num
        The instinctive guess would be mirror number, but in fact it's stripe
        index.
      
      Fix it by:
      
      - Remove @base parameter
      
      - Rename @length to @dev_extent_len
      
      - Rename @num to @stripe_index
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ae8ae3d
    • Qu Wenruo's avatar
      btrfs: scrub: cleanup the argument list of scrub_chunk() · d04fbe19
      Qu Wenruo authored
      The argument list of scrub_chunk() has the following problems:
      
      - Duplicated @chunk_offset
        It is the same as btrfs_block_group::start.
      
      - Confusing @length
        The most instinctive guess is chunk length, and one may want to delete
        it, but the truth is, it's the device extent length.
      
      Fix this by:
      
      - Remove @chunk_offset
        Use btrfs_block_group::start instead.
      
      - Rename @length to @dev_extent_len
        Also rename the caller to remove the ambiguous naming.
      
      - Rename @cache to @bg
        The "_cache" suffix for btrfs_block_group has been removed for a while.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d04fbe19
    • Qu Wenruo's avatar
      btrfs: remove reada infrastructure · f26c9238
      Qu Wenruo authored
      Currently there is only one user for btrfs metadata readahead, and
      that's scrub.
      
      But even for the single user, it's not providing the correct
      functionality it needs, as scrub needs reada for commit root, which
      current readahead can't provide. (Although it's pretty easy to add such
      feature).
      
      Despite this, there are some extra problems related to metadata
      readahead:
      
      - Duplicated feature with btrfs_path::reada
      
      - Partly duplicated feature of btrfs_fs_info::buffer_radix
        Btrfs already caches its metadata in buffer_radix, while readahead
        tries to read the tree block no matter if it's already cached.
      
      - Poor layer separation
        Metadata readahead works kinda at device level.
        This is definitely not the correct layer it should be, since metadata
        is at btrfs logical address space, it should not bother device at all.
      
        This brings extra chance for bugs to sneak in, while brings
        unnecessary complexity.
      
      - Dead code
        In the very beginning of scrub.c we have #undef DEBUG, rendering all
        the debug related code useless and unable to test.
      
      Thus here I purpose to remove the metadata readahead mechanism
      completely.
      
      [BENCHMARK]
      There is a full benchmark for the scrub performance difference using the
      old btrfs_reada_add() and btrfs_path::reada.
      
      For the worst case (no dirty metadata, slow HDD), there could be a 5%
      performance drop for scrub.
      For other cases (even SATA SSD), there is no distinguishable performance
      difference.
      
      The number is reported scrub speed, in MiB/s.
      The resolution is limited by the reported duration, which only has a
      resolution of 1 second.
      
      	Old		New		Diff
      SSD	455.3		466.332		+2.42%
      HDD	103.927 	98.012		-5.69%
      
      Comprehensive test methodology is in the cover letter of the patch.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f26c9238
    • Qu Wenruo's avatar
      btrfs: scrub: use btrfs_path::reada for extent tree readahead · dcf62b20
      Qu Wenruo authored
      For scrub, we trigger two readaheads for two trees, extent tree to get
      where to scrub, and csum tree to get the data checksum.
      
      For csum tree we already trigger readahead in
      btrfs_lookup_csums_range(), by setting path->reada.
      But for extent tree we don't have any path based readahead.
      
      Add the readahead for extent tree as well, so we can later remove the
      btrfs_reada_add() based readahead.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dcf62b20
    • Qu Wenruo's avatar
      btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity() · 2522dbe8
      Qu Wenruo authored
      In function scrub_stripe() we allocated two btrfs_path's, one @path for
      extent tree search and another @ppath for full stripe extent tree search
      for RAID56.
      
      This is totally umncessary, as the @ppath usage is completely inside
      scrub_raid56_parity(), thus we can move the path allocation into
      scrub_raid56_parity() completely.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2522dbe8
    • Nikolay Borisov's avatar
      btrfs: refactor unlock_up · c1227996
      Nikolay Borisov authored
      The purpose of this function is to unlock all nodes in a btrfs path
      which are above 'lowest_unlock' and whose slot used is different than 0.
      As such it used slightly awkward structure of 'if' as well as somewhat
      cryptic "no_skip" control variable which denotes whether we should
      check the current level of skipability or no.
      
      This patch does the following (cosmetic) refactorings:
      
      * Renames 'no_skip' to 'check_skip' and makes it a boolean. This
        variable controls whether we are below the lowest_unlock/skip_level
        levels.
      
      * Consolidates the 2 conditions which warrant checking whether the
        current level should be skipped under 1 common if (check_skip) branch,
        this increase indentation level but is not critical.
      
      * Consolidates the 'skip_level < i && i >= lowest_unlock' and
        'i >= lowest_unlock && i > skip_level' condition into a common branch
        since those are identical.
      
      * Eliminates the local extent_buffer variable as in this case it doesn't
        bring anything to function readability.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1227996
    • Filipe Manana's avatar
      btrfs: skip transaction commit after failure to create subvolume · 1b58ae0e
      Filipe Manana authored
      At ioctl.c:create_subvol(), when we fail to create a subvolume we always
      commit the transaction. In most cases this is a no-op, since all the error
      paths, except for one, abort the transaction - the only exception is when
      we fail to insert the new root item into the root tree, in that case we
      don't abort the transaction because we didn't do anything that is
      irreversible - however we end up committing the transaction which although
      is not a functional problem, it adds unnecessary rotation of the backup
      roots in the superblock and unnecessary work.
      
      So change that to commit a transaction only when no error happened,
      otherwise just call btrfs_end_transaction() to release our reference on
      the transaction.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1b58ae0e
    • Naohiro Aota's avatar
      btrfs: zoned: fix chunk allocation condition for zoned allocator · 82187d2e
      Naohiro Aota authored
      The ZNS specification defines a limit on the number of "active"
      zones. That limit impose us to limit the number of block groups which
      can be used for an allocation at the same time. Not to exceed the
      limit, we reuse the existing active block groups as much as possible
      when we can't activate any other zones without sacrificing an already
      activated block group in commit a85f05e5 ("btrfs: zoned: avoid
      chunk allocation if active block group has enough space").
      
      However, the check is wrong in two ways. First, it checks the
      condition for every raid index (ffe_ctl->index). Even if it reaches
      the condition and "ffe_ctl->max_extent_size >=
      ffe_ctl->min_alloc_size" is met, there can be other block groups
      having enough space to hold ffe_ctl->num_bytes. (Actually, this won't
      happen in the current zoned code as it only supports SINGLE
      profile. But, it can happen once it enables other RAID types.)
      
      Second, it checks the active zone availability depending on the
      raid index. The raid index is just an index for
      space_info->block_groups, so it has nothing to do with chunk allocation.
      
      These mistakes are causing a faulty allocation in a certain
      situation. Consider we are running zoned btrfs on a device whose
      max_active_zone == 0 (no limit). And, suppose no block group have a
      room to fit ffe_ctl->num_bytes but some room to meet
      ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >=
      min_alloc_size).
      
      In this situation, the following occur:
      
      - With SINGLE raid_index, it reaches the chunk allocation checking
        code
      - The check returns true because we can activate a new zone (no limit)
      - But, before allocating the chunk, it iterates to the next raid index
        (RAID5)
      - Since there are no RAID5 block groups on zoned mode, it again
        reaches the check code
      - The check returns false because of btrfs_can_activate_zone()'s "if
        (raid_index != BTRFS_RAID_SINGLE)" part
      - That results in returning -ENOSPC without allocating a new chunk
      
      As a result, we end up hitting -ENOSPC too early.
      
      Move the check to the right place in the can_allocate_chunk() hook,
      and do the active zone check depending on the allocation flag, not on
      the raid index.
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82187d2e
    • Naohiro Aota's avatar
      btrfs: add extent allocator hook to decide to allocate chunk or not · 50475cd5
      Naohiro Aota authored
      Introduce a new hook for an extent allocator policy. With the new
      hook, a policy can decide to allocate a new block group or not. If
      not, it will return -ENOSPC, so btrfs_reserve_extent() will cut the
      allocation size in half and retry the allocation if min_alloc_size is
      large enough.
      
      The hook has a place holder and will be replaced with the real
      implementation in the next patch.
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      50475cd5
    • Naohiro Aota's avatar
      btrfs: zoned: unset dedicated block group on allocation failure · 1ada69f6
      Naohiro Aota authored
      Allocating an extent from a block group can fail for various reasons.
      When an allocation from a dedicated block group (for tree-log or
      relocation data) fails, we need to unregister it as a dedicated one so
      that we can allocate a new block group for the dedicated one.
      
      However, we are returning early when the block group in case it is
      read-only, fully used, or not be able to activate the zone. As a result,
      we keep the non-usable block group as a dedicated one, leading to
      further allocation failure. With many block groups, the allocator will
      iterate hopeless loop to find a free extent, results in a hung task.
      
      Fix the issue by delaying the return and doing the proper cleanups.
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ada69f6
    • Johannes Thumshirn's avatar
      btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned · 73672710
      Johannes Thumshirn authored
      REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to
      check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the
      bio's bio_op.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      73672710
    • Johannes Thumshirn's avatar
      btrfs: zoned: sink zone check into btrfs_repair_one_zone · 554aed7d
      Johannes Thumshirn authored
      Sink zone check into btrfs_repair_one_zone() so we don't need to do it
      in all callers.
      
      Also as btrfs_repair_one_zone() doesn't return a sensible error, make it
      a boolean function and return false in case it got called on a non-zoned
      filesystem and true on a zoned filesystem.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      554aed7d
    • Johannes Thumshirn's avatar
      btrfs: zoned: simplify btrfs_check_meta_write_pointer · 8fdf54fe
      Johannes Thumshirn authored
      btrfs_check_meta_write_pointer() will always be called with a NULL
      'cache_ret' argument.
      
      As there's no need to check if we have a valid block_group passed in
      remove these checks.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fdf54fe
    • Johannes Thumshirn's avatar
      btrfs: zoned: encapsulate inode locking for zoned relocation · 869f4cdc
      Johannes Thumshirn authored
      Encapsulate the inode lock needed for serializing the data relocation
      writes on a zoned filesystem into a helper.
      
      This streamlines the code reading flow and hides special casing for
      zoned filesystems.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      869f4cdc
    • Anand Jain's avatar
      btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the device · a26d60de
      Anand Jain authored
      In the case of the seed device, the fsid can be different from the mounted
      sprout fsid.  The userland has to read the device superblock to know the
      fsid but, that idea fails if the device is missing. So add a sysfs
      interface devinfo/<devid>/fsid to show the fsid of the device.
      
      For example:
        $ cd /sys/fs/btrfs/b10b02a5-f9de-4276-b9e8-2bfd09a578a8
      
        $ cat devinfo/1/fsid
        c44d771f-639d-4df3-99ec-5bc7ad2af93b
        $ cat  devinfo/3/fsid
        b10b02a5-f9de-4276-b9e8-2bfd09a578a8
      
      Though it's related to seeding, the name of the sysfs file is plain fsid as it
      matches what blkid says.  A path to the device's fsid will aid scripting.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a26d60de
    • Josef Bacik's avatar
      btrfs: reserve extra space for the free space tree · c18e3235
      Josef Bacik authored
      Filipe reported a problem where sometimes he'd get an ENOSPC abort when
      running delayed refs with generic/619 and the free space tree enabled.
      This is partly because we do not reserve space for modifying the free
      space tree, nor do we have a block rsv associated with that tree.
      
      The delayed_refs_rsv tracks the amount of space required to run delayed
      refs.  This means 1 modification means 1 change to the extent root.
      With the free space tree this turns into 2 changes, because modifying 1
      extent means updating the extent tree and potentially updating the free
      space tree to either remove that entry or add the free space.  Thus if
      we have the FST enabled, simply double the reservation size for our
      modification.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c18e3235
    • Josef Bacik's avatar
      btrfs: include the free space tree in the global rsv minimum calculation · 9506f953
      Josef Bacik authored
      Filipe reported a problem where generic/619 was failing with an ENOSPC
      abort while running delayed refs, like the following
      
        BTRFS: Transaction aborted (error -28)
        WARNING: CPU: 3 PID: 522920 at fs/btrfs/free-space-tree.c:1049 add_to_free_space_tree+0xe5/0x110 [btrfs]
        CPU: 3 PID: 522920 Comm: kworker/u16:19 Tainted: G        W         5.16.0-rc2-btrfs-next-106 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        RIP: 0010:add_to_free_space_tree+0xe5/0x110 [btrfs]
        RSP: 0000:ffffa65087fb7b20 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffff9131eeaa RDI: 00000000ffffffff
        RBP: ffff8d62e26481b8 R08: ffffffff9ad97ce0 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffe4
        R13: ffff8d61c25fe688 R14: ffff8d61ebd88800 R15: ffff8d61ebd88a90
        FS:  0000000000000000(0000) GS:ffff8d64ed400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fa46a8b1000 CR3: 0000000148d18003 CR4: 0000000000370ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <TASK>
         __btrfs_free_extent+0x516/0x950 [btrfs]
         __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
         btrfs_run_delayed_refs+0x86/0x210 [btrfs]
         flush_space+0x403/0x630 [btrfs]
         ? call_rcu_tasks_generic+0x50/0x80
         ? lock_release+0x223/0x4a0
         ? btrfs_get_alloc_profile+0xb5/0x290 [btrfs]
         ? do_raw_spin_unlock+0x4b/0xa0
         btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
         process_one_work+0x24c/0x5b0
         worker_thread+0x55/0x3c0
         ? process_one_work+0x5b0/0x5b0
         kthread+0x17c/0x1a0
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x22/0x30
      
      There's a couple of reasons for this, but in generic/619's case the
      largest reason is because it is a very small file system, ad we do not
      reserve enough space for the global reserve.
      
      With the free space tree we now have the free space tree that we need to
      modify when running delayed refs.  This means we need the global reserve
      to take this into account when it calculates the minimum size it needs
      to be.  This is especially important for very small file systems.
      
      Fix this by adjusting the minimum global block rsv size math to include
      the size of the free space tree when calculating the size.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9506f953
    • Qu Wenruo's avatar
      btrfs: scrub: merge SCRUB_PAGES_PER_RD_BIO and SCRUB_PAGES_PER_WR_BIO · c9d328c0
      Qu Wenruo authored
      These two values were introduced in commit ff023aac ("Btrfs: add code
      to scrub to copy read data to another disk") as an optimization.
      
      But the truth is, block layer scheduler can do whatever it wants to
      merge/split bios to improve performance.
      
      Doing such "optimization" is not really going to affect much, especially
      considering how good current block layer optimizations are doing.
      Remove such old and immature optimization from our code.
      
      Since we're here, also change BUG_ON()s using these two macros to use
      ASSERT()s.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9d328c0
    • Qu Wenruo's avatar
      btrfs: update SCRUB_MAX_PAGES_PER_BLOCK · 0bb3acdc
      Qu Wenruo authored
      Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to
      calculate this value.
      
      And remove one stale comment on the value, in fact with recent subpage
      support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond
      BTRFS_STRIPE_LEN, just we don't use the full page.
      
      Also since we're here, update the BUG_ON() related to
      SCRUB_MAX_PAGES_PER_BLOCK to ASSERT().
      
      As those ASSERT() are really only for developers to catch early obvious
      bugs, not to let end users suffer.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0bb3acdc
    • Josef Bacik's avatar
      btrfs: do not check -EAGAIN when truncating inodes in the log root · 8697b8f8
      Josef Bacik authored
      We only throttle the btrfs_truncate_inode_items if the root is
      SHAREABLE, which isn't set on the log root, which means this loop is
      unnecessary.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8697b8f8
    • Josef Bacik's avatar
      btrfs: make should_throttle loop local in btrfs_truncate_inode_items · e48dac7f
      Josef Bacik authored
      We reset this bool on every loop through the truncate loop, make this
      variable local to the loop.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e48dac7f
    • Josef Bacik's avatar
      btrfs: combine extra if statements in btrfs_truncate_inode_items · 0adbc619
      Josef Bacik authored
      We have
      
          if (del_item)
      	    // do something
          else
      	    // something else
          if (del_item)
      	    // do yet another thing
          else
      	    // something else entirely
      
      back to back in btrfs_truncate_inode_items, collapse these two sets of
      if statements into one.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0adbc619
    • Josef Bacik's avatar
      btrfs: convert BUG() for pending_del_nr into an ASSERT · 376b91d5
      Josef Bacik authored
      This is a logic correctness check, convert it into an ASSERT() instead
      of a BUG().
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      376b91d5