1. 03 Nov, 2015 2 commits
    • chandan's avatar
      Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state · 13a0db5a
      chandan authored
      When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
      and nodesize set to 64k), the following call trace is observed,
      
      WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
      Modules linked in:
      CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d9 #54
      task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
      NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
      REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d9)
      MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
      CFAR: c0000000004a3b30 SOFTE: 1
      GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
      GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
      GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
      GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
      GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
      GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
      GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
      GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
      NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
      LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      Call Trace:
      [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
      [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
      [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
      [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
      [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
      [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
      [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
      [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
      [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
      [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
      [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
      [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
      [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
      [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
      [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
      [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
      [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
      [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
      [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
      [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
      [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
      [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
      [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
      [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
      Instruction dump:
      fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
      812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
      
      The above call trace is seen even on x86_64; albeit very rarely and that too
      with nodesize set to 64k and with nospace_cache mount option being used.
      
      The reason for the above call trace is,
      btrfs_remove_chunk
        check_system_chunk
          Allocate chunk if required
        For each physical stripe on underlying device,
          btrfs_free_dev_extent
            ...
            Take lock on Device tree's root node
            btrfs_cow_block("dev tree's root node");
              btrfs_reserve_extent
                find_free_extent
      	    index = BTRFS_RAID_DUP;
      	    have_caching_bg = false;
      
                  When in LOOP_CACHING_NOWAIT state, Assume we find a block group
      	    which is being cached; Hence have_caching_bg is set to true
      
                  When repeating the search for the next RAID index, we set
      	    have_caching_bg to false.
      
      Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
      skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
      allocate a chunk and try to add entries corresponding to the chunk's physical
      stripe into the device tree. When doing so the task deadlocks itself waiting
      for the blocking lock on the root node of the device tree.
      
      This commit fixes the issue by introducing a new local variable to help
      indicate as to whether a block group of any RAID type is being cached.
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      13a0db5a
    • Qu Wenruo's avatar
      btrfs: Fix a data space underflow warning · 485290a7
      Qu Wenruo authored
      Even with quota disabled, generic/127 will trigger a kernel warning by
      underflow data space info.
      
      The bug is caused by buffered write, which in case of short copy, the
      start parameter for btrfs_delalloc_release_space() is wrong, and
      round_up/down() in btrfs_delalloc_release() extents the range to page
      aligned, decreasing one more page than expected.
      
      This patch will fix it by passing correct start.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      485290a7
  2. 27 Oct, 2015 10 commits
  3. 25 Oct, 2015 2 commits
    • Filipe Manana's avatar
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana authored
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarElias Probst <mail@eliasprobst.eu>
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
    • Filipe Manana's avatar
      Btrfs: fix regression when running delayed references · 2c3cf7d5
      Filipe Manana authored
      In the kernel 4.2 merge window we had a refactoring/rework of the delayed
      references implementation in order to fix certain problems with qgroups.
      However that rework introduced one more regression that leads to the
      following trace when running delayed references for metadata:
      
      [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
      [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
      [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
      [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
      [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
      [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
      [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
      [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
      [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
      [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
      [35908.065201] Stack:
      [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
      [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
      [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
      [35908.065201] Call Trace:
      [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
      [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
      [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
      [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
      [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
      [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
      [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
      [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
      [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201]  RSP <ffff88010c4cbb08>
      [35908.310885] ---[ end trace fe4299baf0666457 ]---
      
      This happens because the new delayed references code no longer merges
      delayed references that have different sequence values. The following
      steps are an example sequence leading to this issue:
      
      1) Transaction N starts, fs_info->tree_mod_seq has value 0;
      
      2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
         bytenr A is created, with a value of 1 and a seq value of 0;
      
      3) fs_info->tree_mod_seq is incremented to 1;
      
      4) Extent buffer A is deleted through btrfs_del_items(), which calls
         btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
         later returns the metadata extent associated to extent buffer A to
         the free space cache (the range is not pinned), because the extent
         buffer was created in the current transaction (N) and writeback never
         happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
         in the extent buffer).
         This creates the delayed reference Ref2 for bytenr A, with a value
         of -1 and a seq value of 1;
      
      5) Delayed reference Ref2 is not merged with Ref1 when we create it,
         because they have different sequence numbers (decided at
         add_delayed_ref_tail_merge());
      
      6) fs_info->tree_mod_seq is incremented to 2;
      
      7) Some task attempts to allocate a new extent buffer (done at
         extent-tree.c:find_free_extent()), but due to heavy fragmentation
         and running low on metadata space the clustered allocation fails
         and we fall back to unclustered allocation, which finds the
         extent at offset A, so a new extent buffer at offset A is allocated.
         This creates delayed reference Ref3 for bytenr A, with a value of 1
         and a seq value of 2;
      
      8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
         all have different seq values;
      
      9) We start running the delayed references (__btrfs_run_delayed_refs());
      
      10) The delayed Ref1 is the first one being applied, which ends up
          creating an inline extent backref in the extent tree;
      
      10) Next the delayed reference Ref3 is selected for execution, and not
          Ref2, because select_delayed_ref() always gives a preference for
          positive references (that have an action of BTRFS_ADD_DELAYED_REF);
      
      11) When running Ref3 we encounter alreay the inline extent backref
          in the extent tree at insert_inline_extent_backref(), which makes
          us hit the following BUG_ON:
      
              BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
      
          This is always true because owner corresponds to the level of the
          extent buffer/btree node in the btree.
      
      For the scenario described above we hit the BUG_ON because we never merge
      references that have different seq values.
      
      We used to do the merging before the 4.2 kernel, more specifically, before
      the commmits:
      
        c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
        c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
      
      This issue became more exposed after the following change that was added
      to 4.2 as well:
      
        cffc3374 ("Btrfs: fix order by which delayed references are run")
      
      Which in turn fixed another regression by the two commits previously
      mentioned.
      
      So fix this by bringing back the delayed reference merge code, with the
      proper adaptations so that it operates against the new data structure
      (linked list vs old red black tree implementation).
      
      This issue was hit running fstest btrfs/063 in a loop. Several people have
      reported this issue in the mailing list when running on kernels 4.2+.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
      Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: default avatarMalte Schröder <malte@tnxip.de>
      Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
      Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      2c3cf7d5
  4. 22 Oct, 2015 26 commits