1. 09 May, 2019 2 commits
    • Filipe Manana's avatar
      Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path · 72bd2323
      Filipe Manana authored
      Currently when we fail to COW a path at btrfs_update_root() we end up
      always aborting the transaction. However all the current callers of
      btrfs_update_root() are able to deal with errors returned from it, many do
      end up aborting the transaction themselves (directly or not, such as the
      transaction commit path), other BUG_ON() or just gracefully cancel whatever
      they were doing.
      
      When syncing the fsync log, we call btrfs_update_root() through
      tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
      sync code does not abort the transaction, instead it gracefully handles
      the error and returns -EAGAIN to the fsync handler, so that it falls back
      to a transaction commit. Any other error different from -ENOSPC, makes the
      log sync code abort the transaction.
      
      So remove the transaction abort from btrfs_update_log() when we fail to
      COW a path to update the root item, so that if an -ENOSPC failure happens
      we avoid aborting the current transaction and have a chance of the fsync
      succeeding after falling back to a transaction commit.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Cc: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72bd2323
    • Josef Bacik's avatar
      btrfs: use the existing reserved items for our first prop for inheritance · d7400ee1
      Josef Bacik authored
      We're now reserving an extra items worth of space for property
      inheritance.  We only have one property at the moment so this covers us,
      but if we add more in the future this will allow us to not get bitten by
      the extra space reservation.  If we do add more properties in the future
      we should re-visit how we calculate the space reservation needs by the
      callers.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ refreshed on top of prop/xattr cleanups ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7400ee1
  2. 03 May, 2019 2 commits
    • Josef Bacik's avatar
      btrfs: don't double unlock on error in btrfs_punch_hole · 8fca9550
      Josef Bacik authored
      If we have an error writing out a delalloc range in
      btrfs_punch_hole_lock_range we'll unlock the inode and then goto
      out_only_mutex, where we will again unlock the inode.  This is bad,
      don't do this.
      
      Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fca9550
    • Johnny Chang's avatar
      btrfs: Check the compression level before getting a workspace · 2b90883c
      Johnny Chang authored
      When a file's compression property is set as zlib or zstd but leave
      the compression mount option not be set, that means btrfs will try
      to compress the file with default compression level. But in
      btrfs_compress_pages(), it calls get_workspace() with level = 0.
      This will return a workspace with a wrong compression level.
      For zlib, the compression level in the workspace will be 0
      (that means "store only"). And for zstd, the compression in the
      workspace will be 1, not the default level 3.
      
      How to reproduce:
        mkfs -t btrfs /dev/sdb
        mount /dev/sdb /mnt/
        mkdir /mnt/zlib
        btrfs property set /mnt/zlib/ compression zlib
        dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
        sync
        btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
      
      btrfs-debugfs output:
      * before:
        ...
        (258 9961472): ram 524288 disk 1106247680 disk_size 524288
        file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
      
      * after:
       ...
       (258 10354688): ram 131072 disk 14217216 disk_size 4096
       file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
      
      The steps for zstd are similar, but need to put a debugging message to
      show the level of the return workspace in zstd_get_workspace().
      
      This commit adds a check of the compression level before getting a
      workspace by set_level().
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarJohnny Chang <johnnyc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b90883c
  3. 02 May, 2019 9 commits
  4. 29 Apr, 2019 27 commits
    • Josef Bacik's avatar
      btrfs: track DIO bytes in flight · 4297ff84
      Josef Bacik authored
      When diagnosing a slowdown of generic/224 I noticed we were not doing
      anything when calling into shrink_delalloc().  This is because all
      writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
      counter is 0, which short circuits most of the work inside of
      shrink_delalloc().  However O_DIRECT writes still consume metadata
      resources and generate ordered extents, which we can still wait on.
      
      Fix this by tracking outstanding DIO write bytes, and use this as well
      as the delalloc bytes counter to decide if we need to lookup and wait on
      any ordered extents.  If we have more DIO writes than delalloc bytes
      we'll go ahead and wait on any ordered extents regardless of our flush
      state as flushing delalloc is likely to not gain us anything.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ use dio instead of odirect in identifiers ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4297ff84
    • Anand Jain's avatar
      btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop · da9b6ec8
      Anand Jain authored
      Since now the trans argument is never NULL in btrfs_set_prop we don't
      have to check. So delete it and use btrfs_setxattr that makes use of
      that.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da9b6ec8
    • Anand Jain's avatar
      btrfs: delete unused function btrfs_set_prop_trans · 717ebdc3
      Anand Jain authored
      The last consumer of btrfs_set_prop_trans() was taken away by the patch
      ("btrfs: start transaction in xattr_handler_set_prop") so now this
      function can be deleted.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      717ebdc3
    • Anand Jain's avatar
      btrfs: start transaction in xattr_handler_set_prop · b3f6a4be
      Anand Jain authored
      btrfs specific extended attributes on the inode are set using
      btrfs_xattr_handler_set_prop(), and the required transaction for this
      update is started by btrfs_setxattr(). For better visibility of the
      transaction start and end, do this in btrfs_xattr_handler_set_prop().
      For which this patch copied code of btrfs_setxattr() as it is in the
      original, which needs proper error handling.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3f6a4be
    • Anand Jain's avatar
      btrfs: drop local copy of inode i_mode · 44e5194b
      Anand Jain authored
      There isn't real use of making struct inode::i_mode a local copy, it
      saves a dereference one time, not much. Just use it directly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44e5194b
    • Anand Jain's avatar
      btrfs: drop old_fsflags in btrfs_ioctl_setflags · 3c8d8b63
      Anand Jain authored
      btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only
      once. Instead used it directly.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c8d8b63
    • Anand Jain's avatar
      btrfs: modify local copy of btrfs_inode flags · d2b8fcfe
      Anand Jain authored
      Instead of updating the binode::flags directly, update a local copy, and
      then at the point of no error, store copy it to the binode::flags.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2b8fcfe
    • Anand Jain's avatar
      btrfs: drop useless inode i_flags copy and restore · 11d3cd5c
      Anand Jain authored
      The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used
      btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the
      inode::i_flags update functions such as
      btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called
      in btrfs_ioctl_setflags() instead of
      btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the
      inode::i_flags remains unmodified until the thread has checked all the
      conditions. So drop the saved inode::i_flags in out_i_flags.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11d3cd5c
    • Anand Jain's avatar
      btrfs: start transaction in btrfs_ioctl_setflags() · ff9fef55
      Anand Jain authored
      Inode attribute can be set through the FS_IOC_SETFLAGS ioctl.  This
      flags also includes compression attribute for which we would set/reset
      the compression extended attribute. While doing this there is a bit of
      duplicate code, the following things happens twice:
      
      - start/end_transaction
      - inode_inc_iversion()
      - current_time update to inode->i_ctime
      - and btrfs_update_inode()
      
      These are updated both at btrfs_ioctl_setflags() and btrfs_set_props()
      as well.  This patch merges these two duplicate codes at
      btrfs_ioctl_setflags().
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ff9fef55
    • Anand Jain's avatar
      btrfs: export btrfs_set_prop · cd31af15
      Anand Jain authored
      Make btrfs_set_prop() a non-static function, so that it can be called
      from btrfs_ioctl_setflags(). We need btrfs_set_prop() instead of
      btrfs_set_prop_trans() so that we can use the transaction which is
      already started in the current thread.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd31af15
    • Anand Jain's avatar
      btrfs: refactor btrfs_set_props to validate externally · f22125e5
      Anand Jain authored
      In preparation to merge multiple transactions when setting the
      compression flags, split btrfs_set_props() validation part outside of
      it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f22125e5
    • Qu Wenruo's avatar
      btrfs: ctree: Dump the leaf before BUG_ON in btrfs_set_item_key_safe · 7c15d410
      Qu Wenruo authored
      We have a long standing problem with reversed keys that's detected by
      btrfs_set_item_key_safe. This is hard to reproduce so we'd like to
      capture more information for later analysis.
      
      Let's dump the leaf content before triggering BUG_ON() so that we can
      have some clue on what's going wrong.  The output of tree locks should
      help us to debug such problem.
      
      Sample stacktrace:
      
       generic/522             [00:07:05]
       [26946.113381] run fstests generic/522 at 2019-04-16 00:07:05
       [27161.474720] kernel BUG at fs/btrfs/ctree.c:3192!
       [27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP
       [27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G        W         5.1.0-rc5-default+ #562
       [27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014
       [27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.493845] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.495608] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
       [27161.498100] Call Trace:
       [27161.498771]  __btrfs_drop_extents+0x6ec/0xdf0 [btrfs]
       [27161.499872]  btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs]
       [27161.501114]  btrfs_log_inode+0x7ff/0xdc0 [btrfs]
       [27161.502114]  ? __mutex_unlock_slowpath+0x4b/0x2b0
       [27161.503172]  btrfs_log_inode_parent+0x237/0x9c0 [btrfs]
       [27161.504348]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [27161.505374]  btrfs_sync_file+0x1b7/0x480 [btrfs]
       [27161.506371]  __x64_sys_msync+0x180/0x210
       [27161.507208]  do_syscall_64+0x54/0x180
       [27161.507932]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [27161.508839] RIP: 0033:0x7f8ea5aa9c61
       [27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
       [27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61
       [27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000
       [27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000
       [27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938
       [27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000
       [27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug]
       [27161.522272] ---[ end trace d5afec7ccac6a252 ]---
       [27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs]
       [27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286
       [27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000
       [27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7
       [27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000
       [27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70
       [27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000
       [27161.534405] FS:  00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000
       [27161.536048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7c15d410
    • Qu Wenruo's avatar
      btrfs: tree-checker: Allow error injection for tree-checker · 02529d7a
      Qu Wenruo authored
      Allowing error injection for btrfs_check_leaf_full() and
      btrfs_check_node() is useful to test the failure path of btrfs write
      time tree check.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02529d7a
    • Nikolay Borisov's avatar
      btrfs: Document btrfs_csum_one_bio · 51d470ae
      Nikolay Borisov authored
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51d470ae
    • Filipe Manana's avatar
      Btrfs: improve performance on fsync of files with multiple hardlinks · b8aa330d
      Filipe Manana authored
      Commit 41bd6067 ("Btrfs: fix fsync of files with multiple hard links
      in new directories") introduced a path that makes fsync fallback to a full
      transaction commit in order to avoid losing hard links and new ancestors
      of the fsynced inode. That path is triggered only when the inode has more
      than one hard link and either has a new hard link created in the current
      transaction or the inode was evicted and reloaded in the current
      transaction.
      
      That path ends up getting triggered very often (hundreds of times) during
      the course of pgbench benchmarks, resulting in performance drops of about
      20%.
      
      This change restores the performance by not triggering the full transaction
      commit in those cases, and instead iterate the fs/subvolume tree in search
      of all possible new ancestors, for all hard links, to log them.
      Reported-by: default avatarZhao Yuhu <zyuhu@suse.com>
      Tested-by: default avatarJames Wang <jnwang@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8aa330d
    • Filipe Manana's avatar
      Btrfs: fix race between send and deduplication that lead to failures and crashes · 62d54f3a
      Filipe Manana authored
      Send operates on read only trees and expects them to never change while it
      is using them. This is part of its initial design, and this expection is
      due to two different reasons:
      
      1) When it was introduced, no operations were allowed to modifiy read-only
         subvolumes/snapshots (including defrag for example).
      
      2) It keeps send from having an impact on other filesystem operations.
         Namely send does not need to keep locks on the trees nor needs to hold on
         to transaction handles and delay transaction commits. This ends up being
         a consequence of the former reason.
      
      However the deduplication feature was introduced later (on September 2013,
      while send was introduced in July 2012) and it allowed for deduplication
      with destination files that belong to read-only trees (subvolumes and
      snapshots).
      
      That means that having a send operation (either full or incremental) running
      in parallel with a deduplication that has the destination inode in one of
      the trees used by the send operation, can result in tree nodes and leaves
      getting freed and reused while send is using them. This problem is similar
      to the problem solved for the root nodes getting freed and reused when a
      snapshot is made against one tree that is currenly being used by a send
      operation, fixed in commits [1] and [2]. These commits explain in detail
      how the problem happens and the explanation is valid for any node or leaf
      that is not the root of a tree as well. This problem was also discussed
      and explained recently in a thread [3].
      
      The problem is very easy to reproduce when using send with large trees
      (snapshots) and just a few concurrent deduplication operations that target
      files in the trees used by send. A stress test case is being sent for
      fstests that triggers the issue easily. The most common error to hit is
      the send ioctl return -EIO with the following messages in dmesg/syslog:
      
       [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
       [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
      
      The first one is very easy to hit while the second one happens much less
      frequently, except for very large trees (in that test case, snapshots
      with 100000 files having large xattrs to get deep and wide trees).
      Less frequently, at least one BUG_ON can be hit:
      
       [1631742.130080] ------------[ cut here ]------------
       [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
       [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
       [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
       [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
       [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
       [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
       [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
       [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
       [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
       [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
       [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [1631742.141272] Call Trace:
       [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
       [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
       [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
       [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
       [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
       [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
       [1631742.146662]  ? reweight_entity+0x154/0x1a0
       [1631742.147135]  ? update_curr+0x20/0x2a0
       [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
       [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
       [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
       [1631742.149361]  ? __fget+0x113/0x200
       [1631742.149767]  ksys_ioctl+0x70/0x80
       [1631742.150159]  __x64_sys_ioctl+0x16/0x20
       [1631742.150543]  do_syscall_64+0x60/0x1b0
       [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [1631742.151326] RIP: 0033:0x7f4cd9f5add7
       (...)
       [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
       [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
       [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
       [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
       [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
      
      That BUG_ON happens because while send is using a node, that node is COWed
      by a concurrent deduplication, gets freed and gets reused as a leaf (because
      a transaction commit happened in between), so when it attempts to read a
      slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
      contents were wiped out and it now matches a leaf (which can even belong to
      some other tree now), hitting the BUG_ON(level == 0).
      
      Fix this concurrency issue by not allowing send and deduplication to run
      in parallel if both operate on the same readonly trees, returning EAGAIN
      to user space and logging an exlicit warning in dmesg/syslog.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
      [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62d54f3a
    • Filipe Manana's avatar
      Btrfs: send, flush dellaloc in order to avoid data loss · 9f89d5de
      Filipe Manana authored
      When we set a subvolume to read-only mode we do not flush dellaloc for any
      of its inodes (except if the filesystem is mounted with -o flushoncommit),
      since it does not affect correctness for any subsequent operations - except
      for a future send operation. The send operation will not be able to see the
      delalloc data since the respective file extent items, inode item updates,
      backreferences, etc, have not hit yet the subvolume and extent trees.
      
      Effectively this means data loss, since the send stream will not contain
      any data from existing delalloc. Another problem from this is that if the
      writeback starts and finishes while the send operation is in progress, we
      have the subvolume tree being being modified concurrently which can result
      in send failing unexpectedly with EIO or hitting runtime errors, assertion
      failures or hitting BUG_ONs, etc.
      
      Simple reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ btrfs subvolume create /mnt/sv
        $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
      
        $ btrfs property set /mnt/sv ro true
        $ btrfs send -f /tmp/send.stream /mnt/sv
      
        $ od -t x1 -A d /mnt/sv/foo
        0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
        *
        0110592
      
        $ umount /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ btrfs receive -f /tmp/send.stream /mnt
        $ echo $?
        0
        $ od -t x1 -A d /mnt/sv/foo
        0000000
        # ---> empty file
      
      Since this a problem that affects send only, fix it in send by flushing
      dellaloc for all the roots used by the send operation before send starts
      to process the commit roots.
      
      This is a problem that affects send since it was introduced (commit
      31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
      but backporting it to older kernels has some dependencies:
      
      - For kernels between 3.19 and 4.20, it depends on commit 3cd24c69
        ("btrfs: use tagged writepage to mitigate livelock of snapshot") because
        the function btrfs_start_delalloc_snapshot() does not exist before that
        commit. So one has to either pick that commit or replace the calls to
        btrfs_start_delalloc_snapshot() in this patch with calls to
        btrfs_start_delalloc_inodes().
      
      - For kernels older than 3.19 it also requires commit e5fa8f86
        ("Btrfs: ensure send always works on roots without orphans") because
        it depends on the function ensure_commit_roots_uptodate() which that
        commits introduced.
      
      - No dependencies for 5.0+ kernels.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 3.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f89d5de
    • Filipe Manana's avatar
      Btrfs: do not start a transaction during fiemap · 03628cdb
      Filipe Manana authored
      During fiemap, for regular extents (non inline) we need to check if they
      are shared and if they are, set the shared bit. Checking if an extent is
      shared requires checking the delayed references of the currently running
      transaction, since some reference might have not yet hit the extent tree
      and be only in the in-memory delayed references.
      
      However we were using a transaction join for this, which creates a new
      transaction when there is no transaction currently running. That means
      that two more potential failures can happen: creating the transaction and
      committing it. Further, if no write activity is currently happening in the
      system, and fiemap calls keep being done, we end up creating and
      committing transactions that do nothing.
      
      In some extreme cases this can result in the commit of the transaction
      created by fiemap to fail with ENOSPC when updating the root item of a
      subvolume tree because a join does not reserve any space, leading to a
      trace like the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
      a transaction join to avoid the overhead of creating a new transaction (if
      there is currently no running transaction) and introducing a potential
      point of failure when the new transaction gets committed, instead use a
      transaction attach to grab a handle for the currently running transaction
      if any.
      Reported-by: default avatarChristoph Anton Mitterer <calestyo@scientia.net>
      Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
      Fixes: afce772e ("btrfs: fix check_shared for fiemap ioctl")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03628cdb
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
    • David Sterba's avatar
      25263cd7