• Filipe Manana's avatar
    Btrfs: fix regression when running delayed references · 480d3c86
    Filipe Manana authored
    commit 2c3cf7d5 upstream.
    
    In the kernel 4.2 merge window we had a refactoring/rework of the delayed
    references implementation in order to fix certain problems with qgroups.
    However that rework introduced one more regression that leads to the
    following trace when running delayed references for metadata:
    
    [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
    [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
    [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
    [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
    [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
    [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
    [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
    [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
    [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
    [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
    [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
    [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
    [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
    [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
    [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
    [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
    [35908.065201] Stack:
    [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
    [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
    [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
    [35908.065201] Call Trace:
    [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
    [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
    [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
    [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
    [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
    [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
    [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
    [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
    [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
    [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
    [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
    [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
    [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
    [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
    [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
    [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
    [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
    [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
    [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
    [35908.065201]  RSP <ffff88010c4cbb08>
    [35908.310885] ---[ end trace fe4299baf0666457 ]---
    
    This happens because the new delayed references code no longer merges
    delayed references that have different sequence values. The following
    steps are an example sequence leading to this issue:
    
    1) Transaction N starts, fs_info->tree_mod_seq has value 0;
    
    2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
       bytenr A is created, with a value of 1 and a seq value of 0;
    
    3) fs_info->tree_mod_seq is incremented to 1;
    
    4) Extent buffer A is deleted through btrfs_del_items(), which calls
       btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
       later returns the metadata extent associated to extent buffer A to
       the free space cache (the range is not pinned), because the extent
       buffer was created in the current transaction (N) and writeback never
       happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
       in the extent buffer).
       This creates the delayed reference Ref2 for bytenr A, with a value
       of -1 and a seq value of 1;
    
    5) Delayed reference Ref2 is not merged with Ref1 when we create it,
       because they have different sequence numbers (decided at
       add_delayed_ref_tail_merge());
    
    6) fs_info->tree_mod_seq is incremented to 2;
    
    7) Some task attempts to allocate a new extent buffer (done at
       extent-tree.c:find_free_extent()), but due to heavy fragmentation
       and running low on metadata space the clustered allocation fails
       and we fall back to unclustered allocation, which finds the
       extent at offset A, so a new extent buffer at offset A is allocated.
       This creates delayed reference Ref3 for bytenr A, with a value of 1
       and a seq value of 2;
    
    8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
       all have different seq values;
    
    9) We start running the delayed references (__btrfs_run_delayed_refs());
    
    10) The delayed Ref1 is the first one being applied, which ends up
        creating an inline extent backref in the extent tree;
    
    10) Next the delayed reference Ref3 is selected for execution, and not
        Ref2, because select_delayed_ref() always gives a preference for
        positive references (that have an action of BTRFS_ADD_DELAYED_REF);
    
    11) When running Ref3 we encounter alreay the inline extent backref
        in the extent tree at insert_inline_extent_backref(), which makes
        us hit the following BUG_ON:
    
            BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
    
        This is always true because owner corresponds to the level of the
        extent buffer/btree node in the btree.
    
    For the scenario described above we hit the BUG_ON because we never merge
    references that have different seq values.
    
    We used to do the merging before the 4.2 kernel, more specifically, before
    the commmits:
    
      c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
    
    This issue became more exposed after the following change that was added
    to 4.2 as well:
    
      cffc3374 ("Btrfs: fix order by which delayed references are run")
    
    Which in turn fixed another regression by the two commits previously
    mentioned.
    
    So fix this by bringing back the delayed reference merge code, with the
    proper adaptations so that it operates against the new data structure
    (linked list vs old red black tree implementation).
    
    This issue was hit running fstest btrfs/063 in a loop. Several people have
    reported this issue in the mailing list when running on kernels 4.2+.
    
    Very special thanks to Stéphane Lesimple for helping debugging this issue
    and testing this fix on his multi terabyte filesystem (which took more
    than one day to balance alone, plus fsck, etc).
    
    Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
    Reported-by: default avatarPeter Becker <floyd.net@gmail.com>
    Reported-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
    Tested-by: default avatarStéphane Lesimple <stephane_btrfs@lesimple.fr>
    Reported-by: default avatarMalte Schröder <malte@tnxip.de>
    Reported-by: default avatarDerek Dongray <derek@valedon.co.uk>
    Reported-by: default avatarErkki Seppala <flux-btrfs@inside.org>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    480d3c86
extent-tree.c 273 KB