1. 03 Feb, 2015 10 commits
    • Naohiro Aota's avatar
      btrfs: clear bio reference after submit_one_bio() · 289454ad
      Naohiro Aota authored
      After submit_one_bio(), `bio' can go away. However submit_extent_page()
      leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
      It will cause invalid paging request when submit_extent_page() is called
      next time.
      
      I reproduced ENOMEM case with the following script (need
      CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).
      
        #!/bin/bash
      
        dmesgout=dmesg.txt
        start=100000
        end=300000
        step=1000
      
        # btrfs options
        device=/dev/vdb1
        directory=/mnt/btrfs
      
        # fault-injection options
        percent=100
        times=3
      
        mkdir -p $directory || exit 1
        mount -o compress $device $directory || exit 1
      
        rm -f $directory/file || exit 1
        dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1
      
        for interval in `seq $start $step $end`; do
                dmesg -C
                echo 1 > /proc/sys/vm/drop_caches
                sync
                export FAILCMD_TYPE=fail_page_alloc
                ./failcmd.sh -p $percent -t $times -i $interval \
                        --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \
                        -- \
                        cat $directory/file > /dev/null
                dmesg > ${dmesgout}
                if grep -q BUG: ${dmesgout}; then
                        cat ${dmesgout}
                        exit 1
                fi
        done
      
        umount $directory
        exit 0
      Signed-off-by: default avatarNaohiro Aota <naota@elisp.net>
      Tested-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      289454ad
    • Filipe Manana's avatar
      Btrfs: fix scrub race leading to use-after-free · de554a4f
      Filipe Manana authored
      While running a scrub on a kernel with CONFIG_DEBUG_PAGEALLOC=y, I got
      the following trace:
      
      [68127.807663] BUG: unable to handle kernel paging request at ffff8803f8947a50
      [68127.807663] IP: [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663] PGD 3003067 PUD 43e1f5067 PMD 43e030067 PTE 80000003f8947060
      [68127.807663] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [68127.807663] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc processor parpo
      [68127.807663] CPU: 2 PID: 3081 Comm: kworker/u8:5 Not tainted 3.18.0-rc6-btrfs-next-3+ #4
      [68127.807663] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [68127.807663] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
      [68127.807663] task: ffff880101fc5250 ti: ffff8803f097c000 task.ti: ffff8803f097c000
      [68127.807663] RIP: 0010:[<ffffffff8107da31>]  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663] RSP: 0018:ffff8803f097fbb8  EFLAGS: 00010093
      [68127.807663] RAX: 0000000028dd386c RBX: ffff8803f8947a50 RCX: 0000000028dd3854
      [68127.807663] RDX: 0000000000000018 RSI: 0000000000000002 RDI: 0000000000000001
      [68127.807663] RBP: ffff8803f097fbd8 R08: 0000000000000004 R09: 0000000000000001
      [68127.807663] R10: ffff880102620980 R11: ffff8801f3e8c900 R12: 000000000001d390
      [68127.807663] R13: 00000000cabd13c8 R14: ffff8803f8947800 R15: ffff88037c574f00
      [68127.807663] FS:  0000000000000000(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
      [68127.807663] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [68127.807663] CR2: ffff8803f8947a50 CR3: 00000000b6481000 CR4: 00000000000006e0
      [68127.807663] Stack:
      [68127.807663]  ffffffff823942a8 ffff8803f8947a50 ffff8802a3416f80 0000000000000000
      [68127.807663]  ffff8803f097fc18 ffffffff8141e7c0 ffffffff81072948 000000000034f314
      [68127.807663]  ffff8803f097fc08 0000000000000292 ffff8803f097fc48 ffff8803f8947a50
      [68127.807663] Call Trace:
      [68127.807663]  [<ffffffff8141e7c0>] _raw_spin_lock_irqsave+0x4b/0x55
      [68127.807663]  [<ffffffff81072948>] ? __wake_up+0x22/0x4b
      [68127.807663]  [<ffffffff81072948>] __wake_up+0x22/0x4b
      [68127.807663]  [<ffffffffa0392327>] scrub_pending_bio_dec+0x32/0x36 [btrfs]
      [68127.807663]  [<ffffffffa0395e70>] scrub_bio_end_io_worker+0x5a3/0x5c9 [btrfs]
      [68127.807663]  [<ffffffff810e0c7c>] ? time_hardirqs_off+0x15/0x28
      [68127.807663]  [<ffffffff81078106>] ? trace_hardirqs_off_caller+0x4c/0xb9
      [68127.807663]  [<ffffffffa0372a7c>] normal_work_helper+0xf1/0x238 [btrfs]
      [68127.807663]  [<ffffffffa0372d3d>] btrfs_scrub_helper+0x12/0x14 [btrfs]
      [68127.807663]  [<ffffffff810582d2>] process_one_work+0x1e4/0x3b6
      [68127.807663]  [<ffffffff81078180>] ? trace_hardirqs_off+0xd/0xf
      [68127.807663]  [<ffffffff81058dc9>] worker_thread+0x1fb/0x2a8
      [68127.807663]  [<ffffffff81058bce>] ? rescuer_thread+0x219/0x219
      [68127.807663]  [<ffffffff8105cd75>] kthread+0xdb/0xe3
      [68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
      [68127.807663]  [<ffffffff8141f1ec>] ret_from_fork+0x7c/0xb0
      [68127.807663]  [<ffffffff8105cc9a>] ? __kthread_parkme+0x67/0x67
      [68127.807663] Code: 39 c2 75 14 8d 8a 00 00 01 00 89 d0 f0 0f b1 0b 39 d0 0f 84 81 00 00 00 4c 69 2d 27 86 99 00 fa 00 00 00 45 31 e4 4d 39 ec 74 2b <8b> 13 89 d0 c1 e8 10 66 39 c2 75
      [68127.807663] RIP  [<ffffffff8107da31>] do_raw_spin_lock+0x94/0x122
      [68127.807663]  RSP <ffff8803f097fbb8>
      [68127.807663] CR2: ffff8803f8947a50
      [68127.807663] ---[ end trace d7045aac00a66cd8 ]---
      
      This is due to a race that can happen in a very tiny time window and is
      illustrated by the following sequence diagram:
      
               CPU 1                                                     CPU 2
      
                                                                      btrfs_scrub_dev()
      scrub_bio_end_io_worker()
         scrub_pending_bio_dec()
             atomic_dec(&sctx->bios_in_flight)
                                                                         wait sctx->bios_in_flight == 0
                                                                         wait sctx->workers_pending == 0
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         (...)
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         scrub_free_ctx(sctx)
                                                                            kfree(sctx)
             wake_up(&sctx->list_wait)
                __wake_up()
                    spin_lock_irqsave(&sctx->list_wait->lock, flags)
      
      Another variation of this scenario that results in the same use-after-free
      issue is:
      
               CPU 1                                                     CPU 2
      
                                                                      btrfs_scrub_dev()
                                                                         wait sctx->bios_in_flight == 0
      scrub_bio_end_io_worker()
         scrub_pending_bio_dec()
             __wake_up(&sctx->list_wait)
                spin_lock_irqsave(&sctx->list_wait->lock, flags)
                default_wake_function()
                    wake up task at CPU 2
                                                                         wait sctx->workers_pending == 0
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         (...)
                                                                         mutex_lock(&fs_info->scrub_lock)
                                                                         scrub_free_ctx(sctx)
                                                                            kfree(sctx)
                spin_unlock_irqrestore(&sctx->list_wait->lock, flags)
      
      Fix this by holding the scrub lock while doing the wakeup.
      
      This isn't a recent regression, the issue as been around since the scrub
      feature was added (2011, commit a2de733c).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      de554a4f
    • Filipe Manana's avatar
      Btrfs: add missing cleanup on sysfs init failure · 001a648d
      Filipe Manana authored
      If we failed during initialization of sysfs, we weren't unregistering the
      top level btrfs sysfs entry nor the debugfs stuff.
      Not unregistering the top level sysfs entry makes future attempts to reload
      the btrfs module impossible and the following is reported in dmesg:
      
      [ 2246.451296] WARNING: CPU: 3 PID: 10999 at fs/sysfs/dir.c:486 sysfs_warn_dup+0x91/0xb0()
      [ 2246.451298] sysfs: cannot create duplicate filename '/fs/btrfs'
      [ 2246.451298] Modules linked in: btrfs(+) raid6_pq xor bnep rfcomm bluetooth binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc parport_pc parport psmouse serio_raw pcspkr evbug i2c_piix4 e1000 floppy [last unloaded: btrfs]
      [ 2246.451310] CPU: 3 PID: 10999 Comm: modprobe Tainted: G        W    3.13.0-fdm-btrfs-next-24+ #7
      [ 2246.451311] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 2246.451312]  0000000000000009 ffff8800d353fa08 ffffffff816f1da6 0000000000000410
      [ 2246.451314]  ffff8800d353fa58 ffff8800d353fa48 ffffffff8104a32c ffff88020821a290
      [ 2246.451316]  ffff88020821a290 ffff88020821a290 ffff8802148f0000 ffff8800d353fb80
      [ 2246.451318] Call Trace:
      [ 2246.451322]  [<ffffffff816f1da6>] dump_stack+0x4e/0x68
      [ 2246.451324]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
      [ 2246.451325]  [<ffffffff8104a416>] warn_slowpath_fmt+0x46/0x50
      [ 2246.451328]  [<ffffffff81367dc5>] ? strlcat+0x65/0x90
      (....)
      
      This fixes the following change:
      
          btrfs: add simple debugfs interface
          commit 1bae3098Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      001a648d
    • Filipe Manana's avatar
      Btrfs: fix race between transaction commit and empty block group removal · d4b450cd
      Filipe Manana authored
      Committing a transaction can race with automatic removal of empty block
      groups (cleaner kthread), leading to a BUG_ON() in the transaction
      commit code while running btrfs_finish_extent_commit(). The following
      sequence diagram shows how it can happen:
      
                 CPU 1                                       CPU 2
      
      btrfs_commit_transaction()
        fs_info->running_transaction = NULL
        btrfs_finish_extent_commit()
          find_first_extent_bit()
            -> found range for block group X
               in fs_info->freed_extents[]
      
                                                     btrfs_delete_unused_bgs()
                                                       -> found block group X
      
                                                       Removed block group X's range
                                                       from fs_info->freed_extents[]
      
                                                       btrfs_remove_chunk()
                                                          btrfs_remove_block_group(bg X)
      
          unpin_extent_range(bg X range)
             btrfs_lookup_block_group(bg X)
                -> returns NULL
                  -> BUG_ON()
      
      The trace that results from the BUG_ON() is:
      
      [48665.187808] ------------[ cut here ]------------
      [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
      [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
      [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
      [48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
      [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
      [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
      [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
      [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
      [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
      [48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
      [48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
      [48665.197388] Stack:
      [48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
      [48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
      [48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
      [48665.197388] Call Trace:
      [48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
      [48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
      [48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
      [48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
      [48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
      [48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
      [48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
      [48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
      [48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
      [48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
      [48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
      [48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
      [48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388]  RSP <ffff8801b56a7b88>
      [48665.272246] ---[ end trace b9c6ab9957521376 ]---
      
      Fix this by ensuring that unpining the block group's range in
      btrfs_finish_extent_commit() is done in a synchronized fashion
      with removing the block group's range from freed_extents[]
      in btrfs_delete_unused_bgs()
      
      This race got introduced with the change:
      
          Btrfs: remove empty block groups automatically
          commit 47ab2a6cSigned-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d4b450cd
    • David Sterba's avatar
      btrfs: add more checks to btrfs_read_sys_array · e3540eab
      David Sterba authored
      Verify that the sys_array has enough bytes to read the next item.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e3540eab
    • David Sterba's avatar
      btrfs: cleanup, rename a few variables in btrfs_read_sys_array · 1ffb22cf
      David Sterba authored
      There's a pointer to buffer, integer offset and offset passed as
      pointer, try to find matching names for them.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1ffb22cf
    • David Sterba's avatar
      btrfs: add checks for sys_chunk_array sizes · ce7fca5f
      David Sterba authored
      Verify that possible minimum and maximum size is set, validity of
      contents is checked in btrfs_read_sys_array.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ce7fca5f
    • David Sterba's avatar
      btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize · 75d6ad38
      David Sterba authored
      I received a few crafted images from Jiri, all got through the recently
      added superblock checks. The lower bounds checks for num_devices and
      sector/node -sizes were missing and caused a crash during mount.
      
      Tools for symbolic code execution were used to prepare the images
      contents.
      Reported-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      75d6ad38
    • chandan r's avatar
      Btrfs: Add code to support file creation time · 9cc97d64
      chandan r authored
      This patch adds a new member to the 'struct btrfs_inode' structure to hold
      the file creation time.
      Signed-off-by: default avatarchandan <chandanrmail@gmail.com>
      [refreshed, removed btrfs_inode_otime]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9cc97d64
    • David Sterba's avatar
      btrfs: kill btrfs_inode_*time helpers · a937b979
      David Sterba authored
      They just opencode taking address of the timespec member.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a937b979
  2. 22 Jan, 2015 30 commits
    • chandan's avatar
      Btrfs: insert_new_root: Fix lock type of the extent buffer. · 95449a16
      chandan authored
      btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has
      been taken. Hence assign the appropriate value to path->locks[level].
      Signed-off-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      95449a16
    • Anand Jain's avatar
      Btrfs: fix unused members in struct btrfs_root · 78f55e5e
      Anand Jain authored
      There isn't any real use of following members of struct btrfs_root
      so delete them.
      
      struct kobject root_kobj;
      struct completion kobj_unregister;
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      78f55e5e
    • Yang Dongsheng's avatar
      btrfs: qgroup: move WARN_ON() to the correct location. · 0ee13fe2
      Yang Dongsheng authored
      In function qgroup_excl_accounting(), we need to WARN when
      qg->excl is less than what we want to free, same to child
      and parents. But currently, for parent qgroup, the WARN_ON()
      is located after freeing qg->excl. It will WARN out even we
      free it normally.
      
      This patch move this WARN_ON() before freeing qg->excl.
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0ee13fe2
    • Liu Bo's avatar
      Btrfs: cleanup unused run_most · 26455d33
      Liu Bo authored
      "run_most" is not used anymore.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      26455d33
    • Zhao Lei's avatar
      Rename all ref_count to refs in struct · 57019345
      Zhao Lei authored
      refs is better than ref_count to record a struct's ref count.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Suggested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      57019345
    • Zhao Lei's avatar
      Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply · ffe2d203
      Zhao Lei authored
      So we can check raid56 with:
       (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
      instead of long:
       (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ffe2d203
    • Zhao Lei's avatar
      Btrfs: Include map_type in raid_bio · 10f11900
      Zhao Lei authored
      Corrent code use many kinds of "clever" way to determine operation
      target's raid type, as:
        raid_map != NULL
        or
        raid_map[MAX_NR] == RAID[56]_Q_STRIPE
      
      To make code easy to maintenance, this patch put raid type into
      bbio, and we can always get raid type from bbio with a "stupid"
      way.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      10f11900
    • Zhao Lei's avatar
      Btrfs: Simplify scrub_setup_recheck_block()'s argument · be50a8dd
      Zhao Lei authored
      scrub_setup_recheck_block() have many arguments but most of them
      can be get from one of them, we can remove them to make code clean.
      Some other cleanup for that function also included in this patch.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      be50a8dd
    • Zhao Lei's avatar
      Btrfs: Combine per-page recover in dev-replace and scrub · b968fed1
      Zhao Lei authored
      The code are similar, combine them to make code clean and easy to maintenance.
      Some lost condition are also completed with benefit of this combination.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b968fed1
    • Zhao Lei's avatar
      Btrfs: Separate finding-right-mirror and writing-to-target's process in... · 8d6738c1
      Zhao Lei authored
      Btrfs: Separate finding-right-mirror and writing-to-target's process in scrub_handle_errored_block()
      
      In corrent code, code of finding-right-mirror and writing-to-target
      are mixed in logic, if we find a right mirror but failed in writing
      to target, it will treat as "hadn't found right block", and fill the
      target with sblock_bad.
      
      Actually, "failed in writing to target" does not mean "source
      block is wrong", this patch separate above two condition in logic,
      and do some cleanup to make code clean.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8d6738c1
    • Zhao Lei's avatar
      Btrfs: Break loop when reach BTRFS_MAX_MIRRORS in scrub_setup_recheck_block() · dc5f7a3b
      Zhao Lei authored
      Use break instead of useless loop should be more suitable in this
      case.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      dc5f7a3b
    • Zhao Lei's avatar
      7653947f
    • Zhao Lei's avatar
      Btrfs: Cleanup btrfs_bio_counter_inc_blocked() · 09dd7a01
      Zhao Lei authored
      1: Remove no-need DEFINE_WAIT(wait)
      2: Add likely() for BTRFS_FS_STATE_DEV_REPLACING condition
      3: Use while loop instead of goto
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      09dd7a01
    • Zhao Lei's avatar
      Btrfs: Remove noneed force_write in scrub_write_block_to_dev_replace · 114ab50d
      Zhao Lei authored
      It is always 1 in this place, because !1 case was already jumped
      out in previous code.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      114ab50d
    • Zhao Lei's avatar
      Btrfs: Fix a jump typo of nodatasum_case to avoid wrong WARN_ON() · b25c94c5
      Zhao Lei authored
      if (sctx->is_dev_replace && !is_metadata && !have_csum) {
          ...
          goto nodatasum_case;
      }
      ...
      nodatasum_case:
          WARN_ON(sctx->is_dev_replace);
      
      In above code, nodatasum_case marker should be moved after
      WARN_ON().
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b25c94c5
    • Zhao Lei's avatar
      Btrfs: add ref_count and free function for btrfs_bio · 6e9606d2
      Zhao Lei authored
      1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
         to keep btrfs_bio's memory in raid56 recovery implement.
      2: free function for bbio will make code clean and flexible, plus
         forced data type checking in compile.
      
      Changelog v1->v2:
       Rename following by David Sterba's suggestion:
       put_btrfs_bio() -> btrfs_put_bio()
       get_btrfs_bio() -> btrfs_get_bio()
       bbio->ref_count -> bbio->refs
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6e9606d2
    • Zhao Lei's avatar
      Btrfs: Make raid_map array be inlined in btrfs_bio structure · 8e5cfb55
      Zhao Lei authored
      It can make code more simple and clear, we need not care about
      free bbio and raid_map together.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8e5cfb55
    • Zhao Lei's avatar
      Btrfs: sort raid_map before adding tgtdev stripes · cc7539ed
      Zhao Lei authored
      It can avoid complex calculation of real stripes in sort,
      moreover, we can clean up code of sorting tgtdev_map because it
      will be in order initially.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cc7539ed
    • Zhao Lei's avatar
      Btrfs: fix a out-of-bound access of raid_map · e34c330d
      Zhao Lei authored
      We add the number of stripes on target devices into bbio->num_stripes
      if we are under device replacement, and we just sort the raid_map of
      those stripes that not on the target devices, so if when we need
      real raid_map, we need skip the stripes on the target devices.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e34c330d
    • Filipe Manana's avatar
      Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs · df8d116f
      Filipe Manana authored
      If we have an inode with a large number of hard links, some of which may
      be extrefs, turn a regular ref into an extref, fsync the inode and then
      replay the fsync log (after a crash/reboot), we can endup with an fsync
      log that makes the replay code always fail with -EOVERFLOW when processing
      the inode's references.
      
      This is easy to reproduce with the test case I made for xfstests. Its steps
      are the following:
      
         _scratch_mkfs "-O extref" >> $seqres.full 2>&1
         _init_flakey
         _mount_flakey
      
         # Create a test file with 3001 hard links. This number is large enough to
         # make btrfs start using extrefs at some point even if the fs has the maximum
         # possible leaf/node size (64Kb).
         echo "hello world" > $SCRATCH_MNT/foo
         for i in `seq 1 3000`; do
             ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
         done
      
         # Make sure all metadata and data are durably persisted.
         sync
      
         # Now remove one link, add a new one with a new name, add another new one with
         # the same name as the one we just removed and fsync the inode.
         rm -f $SCRATCH_MNT/foo_link_0001
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
         rm -f $SCRATCH_MNT/foo_link_0002
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
         ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
         $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
         # Simulate a crash/power loss. This makes sure the next mount
         # will see an fsync log and will replay that log.
      
         _load_flakey_table $FLAKEY_DROP_WRITES
         _unmount_flakey
      
         _load_flakey_table $FLAKEY_ALLOW_WRITES
         _mount_flakey
      
         # Check that the number of hard links is correct, we are able to remove all
         # the hard links and read the file's data. This is just to verify we don't
         # get stale file handle errors (due to dangling directory index entries that
         # point to inodes that no longer exist).
         echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
         [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
         for ((i = 1; i <= 3003; i++)); do
             name=foo_link_`printf "%04d" $i`
             if [ $i -eq 2 ]; then
                 [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
             else
                 [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
             fi
         done
         rm -f $SCRATCH_MNT/foo_link_*
         cat $SCRATCH_MNT/foo
         rm -f $SCRATCH_MNT/foo
      
         status=0
         exit
      
      The fix is simply to correct the overflow condition when overwriting a
      reference item because it was wrong, trying to increase the item in the
      fs/subvol tree by an impossible amount. Also ensure that we don't insert
      one normal ref and one ext ref for the same dentry - this happened because
      processing a dir index entry from the parent in the log happened when
      the normal ref item was full, which made the logic insert an extref and
      later when the normal ref had enough room, it would be inserted again
      when processing the ref item from the child inode in the log.
      
      This issue has been present since the introduction of the extrefs feature
      (2012).
      
      A test case for xfstests follows soon. This test only passes if the previous
      patch titled "Btrfs: fix fsync when extend references are added to an inode"
      is applied too.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      df8d116f
    • Filipe Manana's avatar
      Btrfs: fix fsync when extend references are added to an inode · 2c2c452b
      Filipe Manana authored
      If we added an extended reference to an inode and fsync'ed it, the log
      replay code would make our inode have an incorrect link count, which
      was lower then the expected/correct count.
      This resulted in stale directory index entries after deleting some of
      the hard links, and any access to the dangling directory entries resulted
      in -ESTALE errors because the entries pointed to inode items that don't
      exist anymore.
      
      This is easy to reproduce with the test case I made for xfstests, and
      the bulk of that test is:
      
          _scratch_mkfs "-O extref" >> $seqres.full 2>&1
          _init_flakey
          _mount_flakey
      
          # Create a test file with 3001 hard links. This number is large enough to
          # make btrfs start using extrefs at some point even if the fs has the maximum
          # possible leaf/node size (64Kb).
          echo "hello world" > $SCRATCH_MNT/foo
          for i in `seq 1 3000`; do
              ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
          done
      
          # Make sure all metadata and data are durably persisted.
          sync
      
          # Add one more link to the inode that ends up being a btrfs extref and fsync
          # the inode.
          ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
          $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
          # Simulate a crash/power loss. This makes sure the next mount
          # will see an fsync log and will replay that log.
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          # Now after the fsync log replay btrfs left our inode with a wrong link count N,
          # which was smaller than the correct link count M (N < M).
          # So after removing N hard links, the remaining M - N directory entries were
          # still visible to user space but it was impossible to do anything with them
          # because they pointed to an inode that didn't exist anymore. This resulted in
          # stale file handle errors (-ESTALE) when accessing those dentries for example.
          #
          # So remove all hard links except the first one and then attempt to read the
          # file, to verify we don't get an -ESTALE error when accessing the inodel
          #
          # The btrfs fsck tool also detected the incorrect inode link count and it
          # reported an error message like the following:
          #
          # root 5 inode 257 errors 2001, no inode item, link count wrong
          #   unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref
          #
          # The fstests framework automatically calls fsck after a test is run, so we
          # don't need to call fsck explicitly here.
      
          rm -f $SCRATCH_MNT/foo_link_*
          cat $SCRATCH_MNT/foo
      
          status=0
          exit
      
      So make sure an fsync always flushes the delayed inode item, so that the
      fsync log contains it (needed in order to trigger the link count fixup
      code) and fix the extref counting function, which always return -ENOENT
      to its caller (and made it assume there were always 0 extrefs).
      
      This issue has been present since the introduction of the extrefs feature
      (2012).
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      2c2c452b
    • Filipe Manana's avatar
      Btrfs: fix directory inconsistency after fsync log replay · d36808e0
      Filipe Manana authored
      If we have an inode (file) with a link count greater than 1, remove
      one of its hard links, fsync the inode, power fail/crash and then
      replay the fsync log on the next mount, we end up getting the parent
      directory's metadata inconsistent - its i_size still reflects the
      deleted hard link and has dangling index entries (with no matching
      inode reference entries). This prevents the directory from ever being
      deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
      even if all of its children inodes are deleted, and the dangling index
      entries can never be removed (as they point to an inode that does not
      exist anymore).
      
      This is easy to reproduce with the following excerpt from the test case
      for xfstests that I just made:
      
          _scratch_mkfs >> $seqres.full 2>&1
      
          _init_flakey
          _mount_flakey
      
          # Create a test file with 2 hard links in the same directory.
          mkdir -p $SCRATCH_MNT/a/b
          echo "hello world" > $SCRATCH_MNT/a/b/foo
          ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar
      
          # Make sure all metadata and data are durably persisted.
          sync
      
          # Now remove one of the hard links and fsync the inode.
          rm -f $SCRATCH_MNT/a/b/bar
          $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo
      
          # Simulate a crash/power loss. This makes sure the next mount
          # will see an fsync log and will replay that log.
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
      
          # Remove the last hard link of the file and attempt to remove its parent
          # directory - this failed in btrfs because the fsync log and replay code
          # didn't decrement the parent directory's i_size and left dangling directory
          # index entries - this made the btrfs rmdir implementation always fail with
          # the error -ENOTEMPTY.
          #
          # The dangling directory index entries were visible to user space, but it was
          # impossible to do anything on them (unlink, open, read, write, stat, etc)
          # because the inode they pointed to did not exist anymore.
          #
          # The parent directory's metadata inconsistency (stale index entries) was
          # also detected by btrfs' fsck tool, which is run automatically by the fstests
          # framework when the test finishes. The error message reported by fsck was:
          #
          # root 5 inode 259 errors 2001, no inode item, link count wrong
          #   unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref
          #
          rm -f $SCRATCH_MNT/a/b/*
          rmdir $SCRATCH_MNT/a/b
          rmdir $SCRATCH_MNT/a
      
      To fix this just make sure that after an unlink, if the inode is fsync'ed,
      he parent inode is fully logged in the fsync log.
      
      A test case for xfstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d36808e0
    • Filipe Manana's avatar
      Btrfs: lookup for block group only if needed when freeing a tree block · 6219872d
      Filipe Manana authored
      Very often our extent buffer's header generation doesn't match the current
      transaction's id or it is also referenced by other trees (snapshots), so
      we don't need the corresponding block group cache object. Therefore only
      search for it if we are going to use it, so we avoid an unnecessary search
      in the block groups rbtree (and acquiring and releasing its spinlock).
      
      Freeing a tree block is performed when COWing or deleting a node/leaf,
      which implies we are holding the node/leaf's parent node lock, therefore
      reducing the amount of time spent when freeing a tree block helps reducing
      the amount of time we are holding the parent node's lock.
      
      For example, for a run of xfstests/generic/083, the block group cache
      object was needed only 682 times for a total of 226691 calls to free
      a tree block.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      6219872d
    • David Sterba's avatar
      btrfs: remove a no-op unfreeze superbock callback · 730a78c7
      David Sterba authored
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      730a78c7
    • David Sterba's avatar
      btrfs: switch extent_state state to unsigned · 9ee49a04
      David Sterba authored
      Currently there's a 4B hole in the structure between refs and state and there
      are only 16 bits used so we can make it unsigned. This will get a better
      packing and may save some stack space for local variables.
      
      The size of extent_state gets reduced by 8B and there are usually a lot
      of slab objects.
      
      struct extent_state {
      	u64                        start;                /*     0     8 */
      	u64                        end;                  /*     8     8 */
      	struct rb_node             rb_node;              /*    16    24 */
      	wait_queue_head_t          wq;                   /*    40    24 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	atomic_t                   refs;                 /*    64     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	long unsigned int          state;                /*    72     8 */
      	u64                        private;              /*    80     8 */
      
      	/* size: 88, cachelines: 2, members: 7 */
      	/* sum members: 84, holes: 1, sum holes: 4 */
      	/* last cacheline: 24 bytes */
      };
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9ee49a04
    • David Sterba's avatar
      btrfs: set proper message level for skinny metadata · 5efa0490
      David Sterba authored
      This has been confusing people for too long, the message is really just
      informative.
      
      CC: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      5efa0490
    • David Sterba's avatar
      btrfs: update message levels after checksum errors · f0954c66
      David Sterba authored
      The errors are worth noting and might get missed with INFO level.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f0954c66
    • David Sterba's avatar
      btrfs: update message levels during failed mount · aa8ee312
      David Sterba authored
      All error conditions from open_ctree shall be ERR. Warning would
      suggest that something's wrong and we can continue.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      aa8ee312
    • David Sterba's avatar
      btrfs: update message levels for errors · 68b663d1
      David Sterba authored
      Several messages that point to some internal problem, level INFO is
      wrong here.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      68b663d1
    • Filipe Manana's avatar
      Btrfs: fix setup_leaf_for_split() to avoid leaf corruption · a8df6fe6
      Filipe Manana authored
      We were incorrectly detecting when the target key didn't exist anymore
      after releasing the path and re-searching the tree. This could make
      us split or duplicate (btrfs_split_item() and btrfs_duplicate_item()
      are its only callers at the moment) an item when we should not.
      
      For the case of duplicating an item, we currently only duplicate
      checksum items (csum tree) and file extent items (fs/subvol trees).
      For the checksum items we end up overriding the item completely,
      but for file extent items we update only some of their fields in
      the copy (done in __btrfs_drop_extents), which means we can end up
      having a logical corruption for some values.
      
      Also for the case where we duplicate a file extent item it will make
      us produce a leaf with a wrong key order, as btrfs_duplicate_item()
      advances us to the next slot and then its caller sets a smaller key
      on the new item at that slot (like in __btrfs_drop_extents() e.g.).
      Alternatively if the tree search in setup_leaf_for_split() leaves
      with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end
      up accessing beyond the leaf's end (when we check if the item's size
      has changed) and make our caller insert an item at the invalid slot
      btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory
      access if the leaf is full or nearly full.
      
      This issue has been present since the introduction of this function
      in 2009:
      
          Btrfs: Add btrfs_duplicate_item
          commit ad48fd75Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a8df6fe6