1. 15 Oct, 2018 40 commits
    • Liu Bo's avatar
      Btrfs: preftree: use rb_first_cached · ecf160b4
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      While resolving indirect refs and missing refs, it always looks for the
      first rb entry in a while loop, it's helpful to use rb_first_cached
      instead.
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ecf160b4
    • Liu Bo's avatar
      Btrfs: extent_map: use rb_first_cached · 07e1ce09
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the
      same job as rb_first() but in O(1).
      
      As evict_inode_truncate_pages() removes all extent mapping by always
      looking for the first rb entry, it's helpful to use rb_first_cached
      instead.
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      07e1ce09
    • Liu Bo's avatar
      Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root · 03a1d4c8
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
      rb_first() but in O(1).
      
      Functions manipulating delayed_item need to get the first entry, this converts
      it to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03a1d4c8
    • Liu Bo's avatar
      Btrfs: delayed-refs: use rb_first_cached for ref_tree · e3d03965
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href->ref_tree need to get the first entry, this
      converts href->ref_tree to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3d03965
    • Liu Bo's avatar
      Btrfs: delayed-refs: use rb_first_cached for href_root · 5c9d028b
      Liu Bo authored
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href_root need to get the first entry, this
      converts href_root to use rb_first_cached().
      
      This patch is first in the sequenct of similar updates to other rbtrees
      and this is analysis of the expected behaviour and improvements.
      
      There's a common pattern:
      
      while (node = rb_first) {
              entry = rb_entry(node)
              next = rb_next(node)
              rb_erase(node)
              cleanup(entry)
      }
      
      rb_first needs to traverse the tree up to logN depth, rb_erase can
      completely reshuffle the tree. With the caching we'll skip the traversal
      in rb_first.  That's a cached memory access vs looped pointer
      dereference trade-off that IMHO has a clear winner.
      
      Measurements show there's not much difference in a sample tree with
      10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
      caching and pointer chasing are unpredictable though.
      
      Further optimzations can be done to avoid the expensive rb_erase step.
      In some cases it's ok to process the nodes in any order, so the tree can
      be traversed in post-order, not rebalancing the children nodes and just
      calling free. Care must be taken regarding the next node.
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog from mail discussions ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c9d028b
    • Josef Bacik's avatar
      btrfs: wait on caching when putting the bg cache · 3aa7c7a3
      Josef Bacik authored
      While testing my backport I noticed there was a panic if I ran
      generic/416 generic/417 generic/418 all in a row.  This just happened to
      uncover a race where we had outstanding IO after we destroy all of our
      workqueues, and then we'd go to queue the endio work on those free'd
      workqueues.
      
      This is because we aren't waiting for the caching threads to be done
      before freeing everything up, so to fix this make sure we wait on any
      outstanding caching that's being done before we free up the block group,
      so we're sure to be done with all IO by the time we get to
      btrfs_stop_all_workers().  This fixes the panic I was seeing
      consistently in testing.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:6112!
      SMP PTI
      Modules linked in:
      CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: btrfs-cache btrfs_cache_helper
      RIP: 0010:btrfs_map_bio+0x346/0x370
      RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
      RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
      RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
      R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
      FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btree_submit_bio_hook+0x8a/0xd0
       submit_one_bio+0x5d/0x80
       read_extent_buffer_pages+0x18a/0x320
       btree_read_extent_buffer_pages+0xbc/0x200
       ? alloc_extent_buffer+0x359/0x3e0
       read_tree_block+0x3d/0x60
       read_block_for_search.isra.30+0x1a5/0x360
       btrfs_search_slot+0x41b/0xa10
       btrfs_next_old_leaf+0x212/0x470
       caching_thread+0x323/0x490
       normal_work_helper+0xc5/0x310
       process_one_work+0x141/0x340
       worker_thread+0x44/0x3c0
       kthread+0xf8/0x130
       ? process_one_work+0x340/0x340
       ? kthread_bind+0x10/0x10
       ret_from_fork+0x35/0x40
      RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
      ---[ end trace 827eb13e50846033 ]---
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3aa7c7a3
    • Jeff Mahoney's avatar
      btrfs: keep trim from interfering with transaction commits · fee7acc3
      Jeff Mahoney authored
      Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      fixed free space trimming, but introduced latency when it was running.
      This is due to it pinning the transaction using both a incremented
      refcount and holding the commit root sem for the duration of a single
      trim operation.
      
      This was to ensure safety but it's unnecessary.  We already hold the the
      chunk mutex so we know that the chunk we're using can't be allocated
      while we're trimming it.
      
      In order to check against chunks allocated already in this transaction,
      we need to check the pending chunks list.  To to that safely without
      joining the transaction (or attaching than then having to commit it) we
      need to ensure that the dev root's commit root doesn't change underneath
      us and the pending chunk lists stays around until we're done with it.
      
      We can ensure the former by holding the commit root sem and the latter
      by pinning the transaction.  We do this now, but the critical section
      covers the trim operation itself and we don't need to do that.
      
      This patch moves the pinning and unpinning logic into helpers and unpins
      the transaction after performing the search and check for pending
      chunks.
      
      Limiting the critical section of the transaction pinning improves the
      latency substantially on slower storage (e.g. image files over NFS).
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fee7acc3
    • Jeff Mahoney's avatar
      btrfs: don't attempt to trim devices that don't support it · 0be88e36
      Jeff Mahoney authored
      We check whether any device the file system is using supports discard in
      the ioctl call, but then we attempt to trim free extents on every device
      regardless of whether discard is supported.  Due to the way we mask off
      EOPNOTSUPP, we can end up issuing the trim operations on each free range
      on devices that don't support it, just wasting time.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0be88e36
    • Jeff Mahoney's avatar
      btrfs: iterate all devices during trim, instead of fs_devices::alloc_list · d4e329de
      Jeff Mahoney authored
      btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
      device_list_mutex.  The problem is that ->alloc_list is protected by the
      chunk mutex.  We don't want to hold the chunk mutex over the trim of the
      entire file system.  Fortunately, the ->dev_list list is protected by
      the dev_list mutex and while it will give us all devices, including
      read-only devices, we already just skip the read-only devices.  Then we
      can continue to take and release the chunk mutex while scanning each
      device.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4e329de
    • Qu Wenruo's avatar
      btrfs: Ensure btrfs_trim_fs can trim the whole filesystem · 6ba9fc8e
      Qu Wenruo authored
      [BUG]
      fstrim on some btrfs only trims the unallocated space, not trimming any
      space in existing block groups.
      
      [CAUSE]
      Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
      range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
      able to trim block groups in range [0, super->total_bytes).
      
      While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
      uses its logical address space, there is nothing limiting the location
      where we put block groups.
      
      For filesystem with frequent balance, it's quite easy to relocate all
      block groups and bytenr of block groups will start beyond
      super->total_bytes.
      
      In that case, btrfs will not trim existing block groups.
      
      [FIX]
      Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
      can get the unmodified range, which is normally set to [0, U64_MAX].
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
      CC: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ba9fc8e
    • Qu Wenruo's avatar
      btrfs: Enhance btrfs_trim_fs function to handle error better · 93bba24d
      Qu Wenruo authored
      Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
      error happens when trimming existing block groups, it will skip the
      remaining blocks and continue to trim unallocated space for each device.
      
      The return value will only reflect the final error from device trimming.
      
      This patch will fix such behavior by:
      
      1) Recording the last error from block group or device trimming
         The return value will also reflect the last error during trimming.
         Make developer more aware of the problem.
      
      2) Continuing trimming if possible
         If we failed to trim one block group or device, we could still try
         the next block group or device.
      
      3) Report number of failures during block group and device trimming
         It would be less noisy, but still gives user a brief summary of
         what's going wrong.
      
      Such behavior can avoid confusion for cases like failure to trim the
      first block group and then only unallocated space is trimmed.
      Reported-by: default avatarChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add bg_ret and dev_ret to the messages ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      93bba24d
    • Jeff Mahoney's avatar
      btrfs: fix error handling in btrfs_dev_replace_start · 5c061471
      Jeff Mahoney authored
      When we fail to start a transaction in btrfs_dev_replace_start, we leave
      dev_replace->replace_start set to STARTED but clear ->srcdev and
      ->tgtdev.  Later, that can result in an Oops in
      btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
      implies that ->srcdev is valid.
      
      Also fix error handling when the state is already STARTED or SUSPENDED
      while starting.  That, too, will clear ->srcdev and ->tgtdev even though
      it doesn't own them.  This should be an impossible case to hit since we
      should be protected by the BTRFS_FS_EXCL_OP bit being set.  Let's add an
      ASSERT there while we're at it.
      
      Fixes: e93c89c1 (Btrfs: add new sources for device replace code)
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5c061471
    • zhong jiang's avatar
      btrfs: change remove_extent_mapping to return void · c1766dd7
      zhong jiang authored
      remove_extent_mapping uses the variable "ret" for return value, but it
      is not modified after initialzation. Further, I find that any of the
      callers do not handle the return value and the callees are only simple
      functions so the return values does not need to be passed.
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1766dd7
    • Nikolay Borisov's avatar
      btrfs: handle error of get_old_root · 315bed43
      Nikolay Borisov authored
      In btrfs_search_old_slot get_old_root is always used with the assumption
      it cannot fail. However, this is not true in rare circumstance it can
      fail and return null. This will lead to null point dereference when the
      header is read. Fix this by checking the return value and properly
      handling NULL by setting ret to -EIO and returning gracefully.
      
      Coverity-id: 1087503
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      315bed43
    • Nikolay Borisov's avatar
      btrfs: Remove logically dead code from btrfs_orphan_cleanup · 28bee489
      Nikolay Borisov authored
      In btrfs_orphan_cleanup the final 'if (ret) goto out' cannot ever be
      executed. This is due to the last assignment to 'ret' depending on the
      return value of btrfs_iget. If an error other than -ENOENT is returned
      then the loop is prematurely terminated by 'goto out'.  On the other
      hand, if the error value is ENOENT then a subsequent if branch is
      executed that always re-assigns 'ret' and in case it's an error just
      terminates the loop. No functional changes.
      
      Coverity-id: 1437392
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28bee489
    • Liu Bo's avatar
      Btrfs: remove wait_ordered_range in btrfs_evict_inode · 4183c52c
      Liu Bo authored
      When we delete an inode,
      
      btrfs_evict_inode() {
          truncate_inode_pages_final()
              truncate_inode_pages_range()
                  lock_page()
                  truncate_cleanup_page()
                       btrfs_invalidatepage()
                            wait_on_page_writeback
                                 btrfs_lookup_ordered_range()
                       cancel_dirty_page()
                 unlock_page()
           ...
           btrfs_wait_ordered_range()
           ...
      
      As VFS has called ->invalidatepage() to get all ordered extents done (if
      there are any) and truncated all page cache pages (no dirty pages to
      writeback after this step), wait_ordered_range() is just a noop.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4183c52c
    • Liu Bo's avatar
      Btrfs: skip set_page_dirty if eb pages are already dirty · abb57ef3
      Liu Bo authored
      As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
      are dirty, so no need to set pages dirty again.
      
      Ftrace showed that the loop took 10us on my dev box, so removing this
      can save us at least 10us if eb is already dirty and otherwise avoid a
      potentially expensive calls to set_page_dirty.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      abb57ef3
    • Liu Bo's avatar
      Btrfs: assert page dirty bit on extent buffer pages · 51995c39
      Liu Bo authored
      Just in case that someone breaks the rule that pages are dirty as long
      as eb is dirty. The next patch will dirty the pages conditionally.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      51995c39
    • Liu Bo's avatar
      Btrfs: remove unnecessary level check in balance_level · 98e6b1eb
      Liu Bo authored
      In the callchain:
      
      btrfs_search_slot()
         if (level != 0)
            setup_nodes_for_search()
                balance_level()
      
      It is just impossible to have level=0 in balance_level, we can drop the
      check.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      98e6b1eb
    • Liu Bo's avatar
      Btrfs: unify error handling of btrfs_lookup_dir_item · 3cf5068f
      Liu Bo authored
      Unify the error handling of directory item lookups using IS_ERR_OR_NULL.
      No functional changes.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3cf5068f
    • Liu Bo's avatar
      Btrfs: use args in the correct order for kcalloc in btrfsic_read_block · 3b2fd801
      Liu Bo authored
      kcalloc is defined as:
      
        kcalloc(size_t n, size_t size, gfp_t flags)
      
      Although this won't cause problems in practice, btrfsic_read_block()
      uses kcalloc with n and size in the opposite order.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3b2fd801
    • Nikolay Borisov's avatar
      btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly · a27a94c2
      Nikolay Borisov authored
      Instead of returning an error value and using one of the parameters for
      returning the actual object we are interested in just refactor the
      function to directly return btrfs_device *. Also bubble up the error
      handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into
      btrfs_rm_device. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a27a94c2
    • Nikolay Borisov's avatar
      btrfs: Make btrfs_find_device_missing_or_by_path return directly a device · 6c050407
      Nikolay Borisov authored
      This function returns a numeric error value and additionally the
      device found in one of its input parameters. Simplify this by making
      the function directly return a pointer to btrfs_device. Additionally
      adjust the caller to handle the case when we want to remove the
      'missing' device and ENOENT is returned to return the expected
      positive error value, parsed by progs. Finally, unexport the function
      since it's not called outside of volume.c. No functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c050407
    • Nikolay Borisov's avatar
      btrfs: Make btrfs_find_device_by_path return struct btrfs_device · b444ad46
      Nikolay Borisov authored
      Currently this function returns an error code as well as uses one of
      its arguments as a return value for struct btrfs_device. Change the
      function so that it returns btrfs_device directly and use the usual
      "encode error in pointer" mechanics if something goes wrong. No
      functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b444ad46
    • Jeff Mahoney's avatar
      btrfs: fix error handling in free_log_tree · 374b0e2d
      Jeff Mahoney authored
      When we hit an I/O error in free_log_tree->walk_log_tree during file system
      shutdown we can crash due to there not being a valid transaction handle.
      
      Use btrfs_handle_fs_error when there's no transaction handle to use.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
        IP: free_log_tree+0xd2/0x140 [btrfs]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        Modules linked in: <modules>
        CPU: 2 PID: 23544 Comm: umount Tainted: G        W        4.12.14-kvmsmall #9 SLE15 (unreleased)
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        task: ffff96bfd3478880 task.stack: ffffa7cf40d78000
        RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
        RSP: 0018:ffffa7cf40d7bd10 EFLAGS: 00010282
        RAX: 00000000fffffffb RBX: 00000000fffffffb RCX: 0000000000000002
        RDX: 0000000000000000 RSI: ffff96c02f07d4c8 RDI: 0000000000000282
        RBP: ffff96c013cf1000 R08: ffff96c02f07d4c8 R09: ffff96c02f07d4d0
        R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
        R13: ffff96c005e800c0 R14: ffffa7cf40d7bdb8 R15: 0000000000000000
        FS:  00007f17856bcfc0(0000) GS:ffff96c03f600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 0000000045ed6002 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? wait_for_writer+0xb0/0xb0 [btrfs]
         btrfs_free_log+0x17/0x30 [btrfs]
         btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
         btrfs_free_fs_roots+0xc0/0x130 [btrfs]
         ? wait_for_completion+0xf2/0x100
         close_ctree+0xea/0x2e0 [btrfs]
         ? kthread_stop+0x161/0x260
         generic_shutdown_super+0x6c/0x120
         kill_anon_super+0xe/0x20
         btrfs_kill_super+0x13/0x100 [btrfs]
         deactivate_locked_super+0x3f/0x70
         cleanup_mnt+0x3b/0x70
         task_work_run+0x78/0x90
         exit_to_usermode_loop+0x77/0xa6
         do_syscall_64+0x1c5/0x1e0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
        RIP: 0033:0x7f1784f90827
        RSP: 002b:00007ffdeeb03118 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000556a60c62970 RCX: 00007f1784f90827
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556a60c62b50
        RBP: 0000000000000000 R08: 0000000000000005 R09: 00000000ffffffff
        R10: 0000556a60c63900 R11: 0000000000000246 R12: 0000556a60c62b50
        R13: 00007f17854a81c4 R14: 0000000000000000 R15: 0000000000000000
        RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: ffffa7cf40d7bd10
        CR2: 0000000000000060
      
      Fixes: 681ae509 ("Btrfs: cleanup reserved space when freeing tree log on error")
      CC: <stable@vger.kernel.org> # v3.13
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      374b0e2d
    • Misono Tomohiro's avatar
      btrfs: remove redundant variable from btrfs_cross_ref_exist · 380fd066
      Misono Tomohiro authored
      Since commit d7df2c79 ("Btrfs attach delayed ref updates to delayed
      ref heads"), check_delayed_ref() won't return -ENOENT.
      
      In btrfs_cross_ref_exist(), two variables 'ret' and 'ret2' are
      originally used to handle -ENOENT error case.  Since the code is not
      needed anymore, let's just remove 'ret2'.
      Signed-off-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      380fd066
    • Liu Bo's avatar
      Btrfs: set leave_spinning in btrfs_get_extent · e49aabd9
      Liu Bo authored
      Unless it's going to read inline extents from btree leaf to page,
      btrfs_get_extent won't sleep during the period of holding path lock.
      
      This sets leave_spinning at first and sets path to blocking mode right
      before reading inline extent if that's the case.  The benefit is that a
      path in spinning mode typically has lower impact (faster) on waiters
      rather than that in the blocking mode.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e49aabd9
    • Liu Bo's avatar
      Btrfs: fix alignment in declaration and prototype of btrfs_get_extent · de2c6615
      Liu Bo authored
      This fixes btrfs_get_extent to be consistent with our existing
      declaration style.
      
      Note: For the record, indentation styles that are accepted are both,
      aligning under the opening ( and tab or double tab indentation on the
      next line.  Preferrably not spliting the type or long expressions in the
      argument lists.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      de2c6615
    • Colin Ian King's avatar
      btrfs: remove unused pointer 'tree' in btrfs_submit_compressed_read · 29c5e5d4
      Colin Ian King authored
      Pointer 'tree' is being assigned but is never used hence it is redundant
      and can be removed. This is a leftover from cleanup patch
      00032d38 ("btrfs: drop extent_io_ops::merge_bio_hook
      callback").
      
      Cleans up clang warning:
      
        warning: variable 'tree' set but not used [-Wunused-but-set-variable]
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29c5e5d4
    • Colin Ian King's avatar
      btrfs: remove unused pointer inode in relink_file_extents · d005dbec
      Colin Ian King authored
      Pointer inode is being assigned but is never used hence it is redundant
      and can be removed. It's been unused since the introduction in
      38c227d8 ("Btrfs: snapshot-aware defrag").
      
      Cleans up clang warning:
      
        variable ‘inode’ set but not used [-Wunused-but-set-variable]
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d005dbec
    • Su Yue's avatar
      btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag · 28c4a3e2
      Su Yue authored
      Since commit 8b62f87b ("Btrfs: rework outstanding_extents"),
      manual operations of outstanding_extent in btrfs_inode are replaced by
      btrfs_mod_outstanding_extents().
      The one in cluster_pages_for_defrag seems to be lost, so replace it
      of btrfs_mod_outstanding_extents().
      
      Fixes: 8b62f87b ("Btrfs: rework outstanding_extents")
      Signed-off-by: default avatarSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28c4a3e2
    • Liu Bo's avatar
      Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes · 6aadd9eb
      Liu Bo authored
      Here we're not releasing any space, but transferring bytes from
      ->bytes_may_use to ->bytes_reserved. The last change to the code in
      commit 18513091 ("btrfs: update btrfs_space_info's
      bytes_may_use timely") removed a conditional tracepoint and the logic
      changed too but the tracepiont remained.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6aadd9eb
    • Liu Bo's avatar
      btrfs: free path at an earlier point in btrfs_get_extent · c6414280
      Liu Bo authored
      trace_btrfs_get_extent() has nothing to do with path, place
      btrfs_free_path ahead so that we can unlock path on error.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6414280
    • Liu Bo's avatar
      Btrfs: use next_state in find_first_extent_bit · 9688e9a9
      Liu Bo authored
      As next_state() is already defined to get the next state, use it in
      find_first_extent_bit. No functional changes.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9688e9a9
    • Qu Wenruo's avatar
      btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock · b72c3aba
      Qu Wenruo authored
      [BUG]
      For certain crafted image, whose csum root leaf has missing backref, if
      we try to trigger write with data csum, it could cause deadlock with the
      following kernel WARN_ON():
      
        WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
        CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
        Workqueue: btrfs-endio-write btrfs_endio_write_helper
        RIP: 0010:btrfs_tree_lock+0x3e2/0x400
        Call Trace:
         btrfs_alloc_tree_block+0x39f/0x770
         __btrfs_cow_block+0x285/0x9e0
         btrfs_cow_block+0x191/0x2e0
         btrfs_search_slot+0x492/0x1160
         btrfs_lookup_csum+0xec/0x280
         btrfs_csum_file_blocks+0x2be/0xa60
         add_pending_csums+0xaf/0xf0
         btrfs_finish_ordered_io+0x74b/0xc90
         finish_ordered_fn+0x15/0x20
         normal_work_helper+0xf6/0x500
         btrfs_endio_write_helper+0x12/0x20
         process_one_work+0x302/0x770
         worker_thread+0x81/0x6d0
         kthread+0x180/0x1d0
         ret_from_fork+0x35/0x40
      
      [CAUSE]
      That crafted image has missing backref for csum tree root leaf.  And
      when we try to allocate new tree block, since there is no
      EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
      and use it.
      
      The extent tree of the image looks like:
      
        Normal image                      |       This fuzzed image
        ----------------------------------+--------------------------------
        BG 29360128                       | BG 29360128
         One empty slot                   |  One empty slot
        29364224: backref to UUID tree    | 29364224: backref to UUID tree
         Two empty slots                  |  Two empty slots
        29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
        29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
        ...                               | ...
      
      Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
      alloc tree block, it's an valid slot for btrfs.
      
      And for finish_ordered_write, when we need to insert csum, we try to CoW
      csum tree root.
      
      By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
      BG_OFFSET + 12K is already used by tree block COW for other trees, the
      next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
      tree.
      
      But due to the bad type, btrfs can recognize it and still consider it as
      an empty slot, and will try to use it for csum tree CoW.
      
      Then in the following call trace, we will try to lock the new tree
      block, which turns out to be the old csum tree root which is already
      locked:
      
      btrfs_search_slot() called on csum tree root, which is at 29376512
      |- btrfs_cow_block()
         |- btrfs_set_lock_block()
         |  |- Now locks tree block 29376512 (old csum tree root)
         |- __btrfs_cow_block()
            |- btrfs_alloc_tree_block()
               |- btrfs_reserve_extent()
                  | Now it returns tree block 29376512, which extent tree
                  | shows its empty slot, but it's already hold by csum tree
                  |- btrfs_init_new_buffer()
                     |- btrfs_tree_lock()
                        | Triggers WARN_ON(eb->lock_owner == current->pid)
                        |- wait_event()
                           Wait lock owner to release the lock, but it's
                           locked by ourself, so it will deadlock
      
      [FIX]
      This patch will do the lock_owner and current->pid check at
      btrfs_init_new_buffer().
      So above deadlock can be avoided.
      
      Since such problem can only happen in crafted image, we will still
      trigger kernel warning for later aborted transaction, but with a little
      more meaningful warning message.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: default avatarXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b72c3aba
    • Qu Wenruo's avatar
      btrfs: Handle owner mismatch gracefully when walking up tree · 65c6e82b
      Qu Wenruo authored
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: default avatarXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65c6e82b
    • zhong jiang's avatar
      btrfs: change btrfs_pin_log_trans to return void · 45128b08
      zhong jiang authored
      btrfs_pin_log_trans defines the variable "ret" for return value, but it
      is not modified after initialization. Further, I find that none of the
      callers do handles the return value, so it is safe to drop the unneeded
      "ret" and make it return void.
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45128b08
    • zhong jiang's avatar
      btrfs: change btrfs_free_reserved_bytes to return void · 556f3ca8
      zhong jiang authored
      btrfs_free_reserved_bytes uses the variable "ret" for return value,
      but it is not modified after initialzation. Further, I find that any of
      the callers do not handle the return value, so it is safe to drop the
      unneeded "ret" and return void. There are no callees that would need the
      function to handle or pass the value either.
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      556f3ca8
    • Liu Bo's avatar
      Btrfs: remove always true if branch in btrfs_get_extent · bee6ec82
      Liu Bo authored
      @path is always NULL when it comes to the if branch.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bee6ec82
    • Qu Wenruo's avatar
      btrfs: qgroup: Dirty all qgroups before rescan · 9c7b0c2e
      Qu Wenruo authored
      [BUG]
      In the following case, rescan won't zero out the number of qgroup 1/0:
      
        $ mkfs.btrfs -fq $DEV
        $ mount $DEV /mnt
      
        $ btrfs quota enable /mnt
        $ btrfs qgroup create 1/0 /mnt
        $ btrfs sub create /mnt/sub
        $ btrfs qgroup assign 0/257 1/0 /mnt
      
        $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
        $ btrfs sub snap /mnt/sub /mnt/snap
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre /mnt
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none 1/0     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     0/257
      
      So far so good, but:
      
        $ btrfs qgroup remove 0/257 1/0 /mnt
        WARNING: quotas may be inconsistent, rescan needed
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre  /mnt
        qgoupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none ---     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     ---
      	     ^^^^^^^^^^     ^^^^^^^^ not cleared
      
      [CAUSE]
      Before rescan we call qgroup_rescan_zero_tracking() to zero out all
      qgroups' accounting numbers.
      
      However we don't mark all qgroups dirty, but rely on rescan to do so.
      
      If we have any high level qgroup without children, it won't be marked
      dirty during rescan, since we cannot reach that qgroup.
      
      This will cause QGROUP_INFO items of childless qgroups never get updated
      in the quota tree, thus their numbers will stay the same in "btrfs
      qgroup show" output.
      
      [FIX]
      Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
      we have childless qgroups, their QGROUP_INFO items will still get
      updated during rescan.
      Reported-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Tested-by: default avatarMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c7b0c2e