1. 02 May, 2019 2 commits
    • Nikolay Borisov's avatar
      btrfs: Preallocate chunks in cow_file_range_async · 97db1204
      Nikolay Borisov authored
      This commit changes the implementation of cow_file_range_async in order
      to get rid of the BUG_ON in the middle of the loop. Additionally it
      reworks the inner loop in the hopes of making it more understandable.
      
      The idea is to make async_cow be a top-level structured, shared amongst
      all chunks being sent for compression. This allows to perform one memory
      allocation at the beginning and gracefully fail the IO if there isn't
      enough memory. Now, each chunk is going to be described by an
      async_chunk struct. It's the responsibility of the final chunk
      to actually free the memory.
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97db1204
    • Josef Bacik's avatar
      btrfs: reserve delalloc metadata differently · c8eaeac7
      Josef Bacik authored
      With the per-inode block reserves we started refilling the reserve based
      on the calculated size of the outstanding csum bytes and extents for the
      inode, including the amount we were adding with the new operation.
      
      However, generic/224 exposed a problem with this approach.  With 1000
      files all writing at the same time we ended up with a bunch of bytes
      being reserved but unusable.
      
      When you write to a file we reserve space for the csum leaves for those
      bytes, the number of extent items required to cover those bytes, and a
      single transaction item for updating the inode at ordered extent finish
      for that range of bytes.  This is held until the ordered extent finishes
      and we release all of the reserved space.
      
      If a second write comes in at this point we would add a single
      reservation for the new outstanding extent and however many reservations
      for the csum leaves.  At this point we find the delta of how much we
      have reserved and how much outstanding size this is and attempt to
      reserve this delta.  If the first write finishes it will not release any
      space, because the space it had reserved for the initial write is still
      needed for the second write.  However some space would have been used,
      as we have added csums, extent items, and dirtied the inode.  Our
      reserved space would be > 0 but less than the total needed reserved
      space.
      
      This is just for a single inode, now consider generic/224.  This has
      1000 inodes writing in parallel to a very small file system, 1GiB.  In
      my testing this usually means we get about a 120MiB metadata area to
      work with, more than enough to allow the writes to continue, but not
      enough if all of the inodes are stuck trying to reserve the slack space
      while continuing to hold their leftovers from their initial writes.
      
      Fix this by pre-reserved _only_ for the space we are currently trying to
      add.  Then once that is successful modify our inodes csum count and
      outstanding extents, and then add the newly reserved space to the inodes
      block_rsv.  This allows us to actually pass generic/224 without running
      out of metadata space.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c8eaeac7
  2. 29 Apr, 2019 38 commits
    • Josef Bacik's avatar
      btrfs: track DIO bytes in flight · 4297ff84
      Josef Bacik authored
      When diagnosing a slowdown of generic/224 I noticed we were not doing
      anything when calling into shrink_delalloc().  This is because all
      writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
      counter is 0, which short circuits most of the work inside of
      shrink_delalloc().  However O_DIRECT writes still consume metadata
      resources and generate ordered extents, which we can still wait on.
      
      Fix this by tracking outstanding DIO write bytes, and use this as well
      as the delalloc bytes counter to decide if we need to lookup and wait on
      any ordered extents.  If we have more DIO writes than delalloc bytes
      we'll go ahead and wait on any ordered extents regardless of our flush
      state as flushing delalloc is likely to not gain us anything.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ use dio instead of odirect in identifiers ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4297ff84
    • Anand Jain's avatar
      btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop · da9b6ec8
      Anand Jain authored
      Since now the trans argument is never NULL in btrfs_set_prop we don't
      have to check. So delete it and use btrfs_setxattr that makes use of
      that.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da9b6ec8
    • Anand Jain's avatar
      btrfs: delete unused function btrfs_set_prop_trans · 717ebdc3
      Anand Jain authored
      The last consumer of btrfs_set_prop_trans() was taken away by the patch
      ("btrfs: start transaction in xattr_handler_set_prop") so now this
      function can be deleted.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      717ebdc3
    • Anand Jain's avatar
      btrfs: start transaction in xattr_handler_set_prop · b3f6a4be
      Anand Jain authored
      btrfs specific extended attributes on the inode are set using
      btrfs_xattr_handler_set_prop(), and the required transaction for this
      update is started by btrfs_setxattr(). For better visibility of the
      transaction start and end, do this in btrfs_xattr_handler_set_prop().
      For which this patch copied code of btrfs_setxattr() as it is in the
      original, which needs proper error handling.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3f6a4be
    • Anand Jain's avatar
      btrfs: drop local copy of inode i_mode · 44e5194b
      Anand Jain authored
      There isn't real use of making struct inode::i_mode a local copy, it
      saves a dereference one time, not much. Just use it directly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44e5194b
    • Anand Jain's avatar
      btrfs: drop old_fsflags in btrfs_ioctl_setflags · 3c8d8b63
      Anand Jain authored
      btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only
      once. Instead used it directly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c8d8b63
    • Anand Jain's avatar
      btrfs: modify local copy of btrfs_inode flags · d2b8fcfe
      Anand Jain authored
      Instead of updating the binode::flags directly, update a local copy, and
      then at the point of no error, store copy it to the binode::flags.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2b8fcfe
    • Anand Jain's avatar
      btrfs: drop useless inode i_flags copy and restore · 11d3cd5c
      Anand Jain authored
      The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used
      btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the
      inode::i_flags update functions such as
      btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called
      in btrfs_ioctl_setflags() instead of
      btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the
      inode::i_flags remains unmodified until the thread has checked all the
      conditions. So drop the saved inode::i_flags in out_i_flags.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11d3cd5c
    • Anand Jain's avatar
      btrfs: start transaction in btrfs_ioctl_setflags() · ff9fef55
      Anand Jain authored
      Inode attribute can be set through the FS_IOC_SETFLAGS ioctl.  This
      flags also includes compression attribute for which we would set/reset
      the compression extended attribute. While doing this there is a bit of
      duplicate code, the following things happens twice:
      
      - start/end_transaction
      - inode_inc_iversion()
      - current_time update to inode->i_ctime
      - and btrfs_update_inode()
      
      These are updated both at btrfs_ioctl_setflags() and btrfs_set_props()
      as well.  This patch merges these two duplicate codes at
      btrfs_ioctl_setflags().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff9fef55
    • Anand Jain's avatar
      btrfs: export btrfs_set_prop · cd31af15
      Anand Jain authored
      Make btrfs_set_prop() a non-static function, so that it can be called
      from btrfs_ioctl_setflags(). We need btrfs_set_prop() instead of
      btrfs_set_prop_trans() so that we can use the transaction which is
      already started in the current thread.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd31af15
    • Anand Jain's avatar
      btrfs: refactor btrfs_set_props to validate externally · f22125e5
      Anand Jain authored
      In preparation to merge multiple transactions when setting the
      compression flags, split btrfs_set_props() validation part outside of
      it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f22125e5
    • Qu Wenruo's avatar
      btrfs: ctree: Dump the leaf before BUG_ON in btrfs_set_item_key_safe · 7c15d410
      Qu Wenruo authored
      We have a long standing problem with reversed keys that's detected by
      btrfs_set_item_key_safe. This is hard to reproduce so we'd like to
      capture more information for later analysis.
      
      Let's dump the leaf content before triggering BUG_ON() so that we can
      have some clue on what's going wrong.  The output of tree locks should
      help us to debug such problem.
      
      Sample stacktrace:
      
       generic/522             [00:07:05]
       [26946.113381] run fstests generic/522 at 2019-04-16 00:07:05
       [27161.474720] kernel BUG at fs/btrfs/ctree.c:3192!
       [27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP
       [27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G        W         5.1.0-rc5-default+ #562
       [27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
       [27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.493845] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.495608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
       [27161.498100] Call Trace:
       [27161.498771]  __btrfs_drop_extents+0x6ec/0xdf0 [btrfs]
       [27161.499872]  btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs]
       [27161.501114]  btrfs_log_inode+0x7ff/0xdc0 [btrfs]
       [27161.502114]  ? __mutex_unlock_slowpath+0x4b/0x2b0
       [27161.503172]  btrfs_log_inode_parent+0x237/0x9c0 [btrfs]
       [27161.504348]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [27161.505374]  btrfs_sync_file+0x1b7/0x480 [btrfs]
       [27161.506371]  __x64_sys_msync+0x180/0x210
       [27161.507208]  do_syscall_64+0x54/0x180
       [27161.507932]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [27161.508839] RIP: 0033:0x7f8ea5aa9c61
       [27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       [27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61
       [27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000
       [27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000
       [27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938
       [27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000
       [27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug]
       [27161.522272] ---[ end trace d5afec7ccac6a252 ]---
       [27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.534405] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.536048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c15d410
    • Qu Wenruo's avatar
      btrfs: tree-checker: Allow error injection for tree-checker · 02529d7a
      Qu Wenruo authored
      Allowing error injection for btrfs_check_leaf_full() and
      btrfs_check_node() is useful to test the failure path of btrfs write
      time tree check.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02529d7a
    • Nikolay Borisov's avatar
      btrfs: Document btrfs_csum_one_bio · 51d470ae
      Nikolay Borisov authored
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51d470ae
    • Filipe Manana's avatar
      Btrfs: improve performance on fsync of files with multiple hardlinks · b8aa330d
      Filipe Manana authored
      Commit 41bd6067 ("Btrfs: fix fsync of files with multiple hard links
      in new directories") introduced a path that makes fsync fallback to a full
      transaction commit in order to avoid losing hard links and new ancestors
      of the fsynced inode. That path is triggered only when the inode has more
      than one hard link and either has a new hard link created in the current
      transaction or the inode was evicted and reloaded in the current
      transaction.
      
      That path ends up getting triggered very often (hundreds of times) during
      the course of pgbench benchmarks, resulting in performance drops of about
      20%.
      
      This change restores the performance by not triggering the full transaction
      commit in those cases, and instead iterate the fs/subvolume tree in search
      of all possible new ancestors, for all hard links, to log them.
      Reported-by: default avatarZhao Yuhu <zyuhu@suse.com>
      Tested-by: default avatarJames Wang <jnwang@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8aa330d
    • Filipe Manana's avatar
      Btrfs: fix race between send and deduplication that lead to failures and crashes · 62d54f3a
      Filipe Manana authored
      Send operates on read only trees and expects them to never change while it
      is using them. This is part of its initial design, and this expection is
      due to two different reasons:
      
      1) When it was introduced, no operations were allowed to modifiy read-only
         subvolumes/snapshots (including defrag for example).
      
      2) It keeps send from having an impact on other filesystem operations.
         Namely send does not need to keep locks on the trees nor needs to hold on
         to transaction handles and delay transaction commits. This ends up being
         a consequence of the former reason.
      
      However the deduplication feature was introduced later (on September 2013,
      while send was introduced in July 2012) and it allowed for deduplication
      with destination files that belong to read-only trees (subvolumes and
      snapshots).
      
      That means that having a send operation (either full or incremental) running
      in parallel with a deduplication that has the destination inode in one of
      the trees used by the send operation, can result in tree nodes and leaves
      getting freed and reused while send is using them. This problem is similar
      to the problem solved for the root nodes getting freed and reused when a
      snapshot is made against one tree that is currenly being used by a send
      operation, fixed in commits [1] and [2]. These commits explain in detail
      how the problem happens and the explanation is valid for any node or leaf
      that is not the root of a tree as well. This problem was also discussed
      and explained recently in a thread [3].
      
      The problem is very easy to reproduce when using send with large trees
      (snapshots) and just a few concurrent deduplication operations that target
      files in the trees used by send. A stress test case is being sent for
      fstests that triggers the issue easily. The most common error to hit is
      the send ioctl return -EIO with the following messages in dmesg/syslog:
      
       [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
       [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
      
      The first one is very easy to hit while the second one happens much less
      frequently, except for very large trees (in that test case, snapshots
      with 100000 files having large xattrs to get deep and wide trees).
      Less frequently, at least one BUG_ON can be hit:
      
       [1631742.130080] ------------[ cut here ]------------
       [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
       [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
       [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
       [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
       [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
       [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
       [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
       [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
       [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
       [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
       [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [1631742.141272] Call Trace:
       [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
       [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
       [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
       [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
       [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
       [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
       [1631742.146662]  ? reweight_entity+0x154/0x1a0
       [1631742.147135]  ? update_curr+0x20/0x2a0
       [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
       [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
       [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
       [1631742.149361]  ? __fget+0x113/0x200
       [1631742.149767]  ksys_ioctl+0x70/0x80
       [1631742.150159]  __x64_sys_ioctl+0x16/0x20
       [1631742.150543]  do_syscall_64+0x60/0x1b0
       [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [1631742.151326] RIP: 0033:0x7f4cd9f5add7
       (...)
       [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
       [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
       [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
       [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
       [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
      
      That BUG_ON happens because while send is using a node, that node is COWed
      by a concurrent deduplication, gets freed and gets reused as a leaf (because
      a transaction commit happened in between), so when it attempts to read a
      slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
      contents were wiped out and it now matches a leaf (which can even belong to
      some other tree now), hitting the BUG_ON(level == 0).
      
      Fix this concurrency issue by not allowing send and deduplication to run
      in parallel if both operate on the same readonly trees, returning EAGAIN
      to user space and logging an exlicit warning in dmesg/syslog.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
      [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62d54f3a
    • Filipe Manana's avatar
      Btrfs: send, flush dellaloc in order to avoid data loss · 9f89d5de
      Filipe Manana authored
      When we set a subvolume to read-only mode we do not flush dellaloc for any
      of its inodes (except if the filesystem is mounted with -o flushoncommit),
      since it does not affect correctness for any subsequent operations - except
      for a future send operation. The send operation will not be able to see the
      delalloc data since the respective file extent items, inode item updates,
      backreferences, etc, have not hit yet the subvolume and extent trees.
      
      Effectively this means data loss, since the send stream will not contain
      any data from existing delalloc. Another problem from this is that if the
      writeback starts and finishes while the send operation is in progress, we
      have the subvolume tree being being modified concurrently which can result
      in send failing unexpectedly with EIO or hitting runtime errors, assertion
      failures or hitting BUG_ONs, etc.
      
      Simple reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ btrfs subvolume create /mnt/sv
        $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
      
        $ btrfs property set /mnt/sv ro true
        $ btrfs send -f /tmp/send.stream /mnt/sv
      
        $ od -t x1 -A d /mnt/sv/foo
        0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
        *
        0110592
      
        $ umount /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ btrfs receive -f /tmp/send.stream /mnt
        $ echo $?
        0
        $ od -t x1 -A d /mnt/sv/foo
        0000000
        # ---> empty file
      
      Since this a problem that affects send only, fix it in send by flushing
      dellaloc for all the roots used by the send operation before send starts
      to process the commit roots.
      
      This is a problem that affects send since it was introduced (commit
      31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
      but backporting it to older kernels has some dependencies:
      
      - For kernels between 3.19 and 4.20, it depends on commit 3cd24c69
        ("btrfs: use tagged writepage to mitigate livelock of snapshot") because
        the function btrfs_start_delalloc_snapshot() does not exist before that
        commit. So one has to either pick that commit or replace the calls to
        btrfs_start_delalloc_snapshot() in this patch with calls to
        btrfs_start_delalloc_inodes().
      
      - For kernels older than 3.19 it also requires commit e5fa8f86
        ("Btrfs: ensure send always works on roots without orphans") because
        it depends on the function ensure_commit_roots_uptodate() which that
        commits introduced.
      
      - No dependencies for 5.0+ kernels.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 3.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f89d5de
    • Filipe Manana's avatar
      Btrfs: do not start a transaction during fiemap · 03628cdb
      Filipe Manana authored
      During fiemap, for regular extents (non inline) we need to check if they
      are shared and if they are, set the shared bit. Checking if an extent is
      shared requires checking the delayed references of the currently running
      transaction, since some reference might have not yet hit the extent tree
      and be only in the in-memory delayed references.
      
      However we were using a transaction join for this, which creates a new
      transaction when there is no transaction currently running. That means
      that two more potential failures can happen: creating the transaction and
      committing it. Further, if no write activity is currently happening in the
      system, and fiemap calls keep being done, we end up creating and
      committing transactions that do nothing.
      
      In some extreme cases this can result in the commit of the transaction
      created by fiemap to fail with ENOSPC when updating the root item of a
      subvolume tree because a join does not reserve any space, leading to a
      trace like the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
      a transaction join to avoid the overhead of creating a new transaction (if
      there is currently no running transaction) and introducing a potential
      point of failure when the new transaction gets committed, instead use a
      transaction attach to grab a handle for the currently running transaction
      if any.
      Reported-by: default avatarChristoph Anton Mitterer <calestyo@scientia.net>
      Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
      Fixes: afce772e ("btrfs: fix check_shared for fiemap ioctl")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03628cdb
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
      25263cd7
    • Qu Wenruo's avatar
      btrfs: qgroup: Don't scan leaf if we're modifying reloc tree · c4140cbf
      Qu Wenruo authored
      Since reloc tree doesn't contribute to qgroup numbers, just skip them.
      
      This should catch the final cause of unnecessary data ref processing
      when running balance of metadata with qgroups on.
      
      The 4G data 16 snapshots test (*) should explain it pretty well:
      
                   | delayed subtree | refactor delayed ref | this patch
      ---------------------------------------------------------------------
      relocated    |           22653 |                22673 |         22744
      qgroup dirty |          122792 |                48360 |            70
      time         |          24.494 |               11.606 |         3.944
      
      Finally, we're at the stage where qgroup + metadata balance cost no
      obvious overhead.
      
      Test environment:
      
      Test VM:
      - vRAM		8G
      - vCPU		8
      - block dev	vitrio-blk, 'unsafe' cache mode
      - host block	850evo
      
      Test workload:
      - Copy 4G data from /usr/ to one subvolume
      - Create 16 snapshots of that subvolume, and modify 3 files in each
        snapshot
      - Enable quota, rescan
      - Time "btrfs balance start -m"
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c4140cbf
    • Qu Wenruo's avatar
      btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent() · ffd4bb2a
      Qu Wenruo authored
      Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
      parameter list and the confusing @owner parameter.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffd4bb2a
    • Qu Wenruo's avatar
      btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() · 82fa113f
      Qu Wenruo authored
      Use the new btrfs_ref structure and replace parameter list to clean up
      the usage of owner and level to distinguish the extent types.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82fa113f
    • Qu Wenruo's avatar
      btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes() · ddf30cf0
      Qu Wenruo authored
      Since add_pinned_bytes() only needs to know if the extent is metadata
      and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as
      we don't need various owner/level trick to determine extent type.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ddf30cf0
    • Qu Wenruo's avatar
      btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod() · 8a5040f7
      Qu Wenruo authored
      It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as
      btrfs_ref describes a metadata/data reference update comprehensively.
      
      Now we have one less function use confusing owner/level trick.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a5040f7
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref() · 76675593
      Qu Wenruo authored
      Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
      btrfs_add_delayed_data_ref().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76675593
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref() · ed4f255b
      Qu Wenruo authored
      btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
      some callers like btrfs_inc_extent_ref() are using @owner as level for
      delayed tree ref.
      
      Instead of making the parameter list longer, use btrfs_ref to refactor
      it, so each parameter assignment should be self-explaining without dirty
      level/owner trick, and provides the basis for later refactoring.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ed4f255b
    • Qu Wenruo's avatar
      btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref · dd28b6a5
      Qu Wenruo authored
      The process_func function pointer is local to __btrfs_mod_ref() and
      points to either btrfs_inc_extent_ref() or btrfs_free_extent().
      
      Open code it to make later delayed ref refactor easier, so we can
      refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
      patches.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd28b6a5
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Introduce better documented delayed ref structures · b28b1f0c
      Qu Wenruo authored
      Current delayed ref interface has several problems:
      
      - Longer and longer parameter lists
        bytenr
        num_bytes
        parent
        ---------- so far so good
        ref_root
        owner
        offset
        ---------- I don't feel good now
      
      - Different interpretation of the same parameter
      
        Above @owner for data ref is inode number (u64),
        while for tree ref, it's level (int).
      
        They are even in different size range.
        For level we only need 0 ~ 8, while for ino it's
        BTRFS_FIRST_FREE_OBJECTID ~ BTRFS_LAST_FREE_OBJECTID.
      
        And @offset doesn't even make sense for tree ref.
      
        Such parameter reuse may look clever as an hidden union, but it
        destroys code readability.
      
      To solve both problems, we introduce a new structure, btrfs_ref to solve
      them:
      
      - Structure instead of long parameter list
        This makes later expansion easier, and is better documented.
      
      - Use btrfs_ref::type to distinguish data and tree ref
      
      - Use proper union to store data/tree ref specific structures.
      
      - Use separate functions to fill data/tree ref data, with a common generic
        function to fill common bytenr/num_bytes members.
      
      All parameters will find its place in btrfs_ref, and an extra member,
      @real_root, inspired by ref-verify code, is newly introduced for later
      qgroup code, to record which tree is triggered by this extent modification.
      
      This patch doesn't touch any code, but provides the basis for further
      refactoring.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b28b1f0c
    • Filipe Manana's avatar
      Btrfs: do not start a transaction at iterate_extent_inodes() · bfc61c36
      Filipe Manana authored
      When finding out which inodes have references on a particular extent, done
      by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
      v1 and v2) ioctl and from scrub we use the transaction join API to grab a
      reference on the currently running transaction, since in order to give
      accurate results we need to inspect the delayed references of the currently
      running transaction.
      
      However, if there is currently no running transaction, the join operation
      will create a new transaction. This is inefficient as the transaction will
      eventually be committed, doing unnecessary IO and introducing a potential
      point of failure that will lead to a transaction abort due to -ENOSPC, as
      recently reported [1].
      
      That's because the join, creates the transaction but does not reserve any
      space, so when attempting to update the root item of the root passed to
      btrfs_join_transaction(), during the transaction commit, we can end up
      failling with -ENOSPC. Users of a join operation are supposed to actually
      do some filesystem changes and reserve space by some means, which is not
      the case of iterate_extent_inodes(), it is a read-only operation for all
      contextes from which it is called.
      
      The reported [1] -ENOSPC failure stack trace is the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      So fix that by using the attach API, which does not create a transaction
      when there is currently no running transaction.
      
      [1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bfc61c36
    • David Sterba's avatar
      btrfs: get fs_info from device in btrfs_rm_dev_replace_free_srcdev · 65237ee3
      David Sterba authored
      We can read fs_info from the device and can drop it from the parameters.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65237ee3