1. 29 Jul, 2015 4 commits
    • Jeff Mahoney's avatar
      btrfs: iterate over unused chunk space in FITRIM · 499f377f
      Jeff Mahoney authored
      Since we now clean up block groups automatically as they become
      empty, iterating over block groups is no longer sufficient to discard
      unused space.
      
      This patch iterates over the unused chunk space and discards any regions
      that are unallocated, regardless of whether they were ever used.  This is
      a change for btrfs but is consistent with other file systems.
      
      We do this in a transactionless manner since the discard process can take
      a substantial amount of time and a transaction would need to be started
      before the acquisition of the device list lock.  That would mean a
      transaction would be held open across /all/ of the discards collectively.
      In order to prevent other threads from allocating or freeing chunks, we
      hold the chunks lock across the search and discard calls.  We release it
      between searches to allow the file system to perform more-or-less
      normally.  Since the running transaction can commit and disappear while
      we're using the transaction pointer, we take a reference to it and
      release it after the search.  This is safe since it would happen normally
      at the end of the transaction commit after any locks are released anyway.
      We also take the commit_root_sem to protect against a transaction starting
      and committing while we're running.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      499f377f
    • Jeff Mahoney's avatar
      btrfs: skip superblocks during discard · 86557861
      Jeff Mahoney authored
      Btrfs doesn't track superblocks with extent records so there is nothing
      persistent on-disk to indicate that those blocks are in use.  We track
      the superblocks in memory to ensure they don't get used by removing them
      from the free space cache when we load a block group from disk.  Prior
      to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
      was fine since the block group would never be reclaimed so the superblock
      was always safe.  Once we started removing the empty block groups, we
      were protected by the fact that discards weren't being properly issued
      for unused space either via FITRIM or -odiscard.  The block groups were
      still being released, but the blocks remained on disk.
      
      In order to properly discard unused block groups, we need to filter out
      the superblocks from the discard range.  Superblocks are located at fixed
      locations on each device, so it makes sense to filter them out in
      btrfs_issue_discard, which is used by both -odiscard and FITRIM.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      86557861
    • Jeff Mahoney's avatar
      btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries · 4d89d377
      Jeff Mahoney authored
      It's possible, though unexpected, to pass unaligned offsets and lengths
      to btrfs_issue_discard.  We then shift the offset/length values to sector
      units.  If an unaligned offset has been passed, it will result in the
      entire sector being discarded, possibly losing data.  An unaligned
      length is safe but we'll end up returning an inaccurate number of
      discarded bytes.
      
      This patch aligns the offset to the 512B boundary, adjusts the length,
      and warns, since we shouldn't be discarding on an offset that isn't
      aligned with our sector size.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      4d89d377
    • Jeff Mahoney's avatar
      btrfs: make btrfs_issue_discard return bytes discarded · d04c6b88
      Jeff Mahoney authored
      Initially this will just be the length argument passed to it,
      but the following patches will adjust that to reflect re-alignment
      and skipped blocks.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d04c6b88
  2. 23 Jul, 2015 4 commits
    • Filipe Manana's avatar
      Btrfs: fix quick exhaustion of the system array in the superblock · 00d80e34
      Filipe Manana authored
      Omar reported that after commit 4fbcdf66 ("Btrfs: fix -ENOSPC when
      finishing block group creation"), introduced in 4.2-rc1, the following
      test was failing due to exhaustion of the system array in the superblock:
      
        #!/bin/bash
      
        truncate -s 100T big.img
        mkfs.btrfs big.img
        mount -o loop big.img /mnt/loop
      
        num=5
        sz=10T
        for ((i = 0; i < $num; i++)); do
            echo fallocate $i $sz
            fallocate -l $sz /mnt/loop/testfile$i
        done
        btrfs filesystem sync /mnt/loop
      
        for ((i = 0; i < $num; i++)); do
              echo rm $i
              rm /mnt/loop/testfile$i
              btrfs filesystem sync /mnt/loop
        done
        umount /mnt/loop
      
      This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
      allocation of system block groups. This happened because the test creates
      a large number of data block groups per transaction and when committing
      the transaction we start the writeout of the block group caches for all
      the new new (dirty) block groups, which results in pre-allocating space
      for each block group's free space cache using the same transaction handle.
      That in turn often leads to creation of more block groups, and all get
      attached to the new_bgs list of the same transaction handle to the point
      of getting a list with over 1500 elements, and creation of new block groups
      leads to the need of reserving space in the chunk block reserve and often
      creating a new system block group too.
      
      So that made us quickly exhaust the chunk block reserve/system space info,
      because as of the commit mentioned before, we do reserve space for each
      new block group in the chunk block reserve, unlike before where we would
      not and would at most allocate one new system block group and therefore
      would only ensure that there was enough space in the system space info to
      allocate 1 new block group even if we ended up allocating thousands of
      new block groups using the same transaction handle. That worked most of
      the time because the computed required space at check_system_chunk() is
      very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
      that all nodes/leafs in a path will be COWed and split) and since the
      updates to the chunk tree all happen at btrfs_create_pending_block_groups
      it is unlikely that a path needs to be COWed more than once (unless
      writepages() for the btree inode is called by mm in between) and that
      compensated for the need of creating any new nodes/leads in the chunk
      tree.
      
      So fix this by ensuring we don't accumulate a too large list of new block
      groups in a transaction's handles new_bgs list, inserting/updating the
      chunk tree for all accumulated new block groups and releasing the unused
      space from the chunk block reserve whenever the list becomes sufficiently
      large. This is a generic solution even though the problem currently can
      only happen when starting the writeout of the free space caches for all
      dirty block groups (btrfs_start_dirty_block_groups()).
      Reported-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      00d80e34
    • Anand Jain's avatar
      btrfs: its btrfs_err() instead of btrfs_error() · 3e303ea6
      Anand Jain authored
      sorry I indented to use btrfs_err() and I have no idea
      how btrfs_error() got there.
      infact I was thinking about these kind of oversights
      since these two func are too closely named.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3e303ea6
    • Zhao Lei's avatar
      btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail · 95ab1f64
      Zhao Lei authored
      When read_tree_block() failed, we can see following dmesg:
       [  134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063
       [  134.372236] IP: [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
       [  134.372236] PGD 0
       [  134.372236] Oops: 0000 [#1] SMP
       [  134.372236] Modules linked in:
       [  134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f0_+ #115
       [  134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
       [  134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000
       [  134.372236] RIP: 0010:[<ffffffff813a4a51>]  [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
       ...
       [  134.372236] Call Trace:
       [  134.372236]  [<ffffffff81379aa1>] free_root_extent_buffers+0x91/0xb0
       [  134.372236]  [<ffffffff81379c3d>] free_root_pointers+0x17d/0x190
       [  134.372236]  [<ffffffff813801b0>] open_ctree+0x1ca0/0x25b0
       [  134.372236]  [<ffffffff8144d017>] ? disk_name+0x97/0xb0
       [  134.372236]  [<ffffffff813558aa>] btrfs_mount+0x8fa/0xab0
       ...
      
      Reason:
       read_tree_block() changed to return error number on fail,
       and this value(not NULL) is set to tree_root->node, then subsequent
       code will run to:
        free_root_pointers()
        ->free_root_extent_buffers()
        ->free_extent_buffer()
        ->atomic_read((extent_buffer *)(-E_XXX)->refs);
       and trigger above error.
      
      Fix:
       Set tree_root->node to NULL on fail to make error_handle code
       happy.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      95ab1f64
    • Zhao Lei's avatar
      btrfs: Fix lockdep warning of btrfs_run_delayed_iputs() · 8a733013
      Zhao Lei authored
      Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of
      delayed_iput_sem in xfstests generic/241:
        [ 2061.345955] =============================================
        [ 2061.346027] [ INFO: possible recursive locking detected ]
        [ 2061.346027] 4.1.0+ #268 Tainted: G        W
        [ 2061.346027] ---------------------------------------------
        [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
        [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at:
        [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
        [ 2061.346027] but task is already holding lock:
        [ 2061.346027]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
        [ 2061.346027] other info that might help us debug this:
        [ 2061.346027]  Possible unsafe locking scenario:
      
        [ 2061.346027]        CPU0
        [ 2061.346027]        ----
        [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
        [ 2061.346027]   lock(&fs_info->delayed_iput_sem);
        [ 2061.346027]
         *** DEADLOCK ***
      It is rarely happened, about 1/400 in my test env.
      
      The reason is recursion of btrfs_run_delayed_iputs():
        cleaner_kthread
        -> btrfs_run_delayed_iputs() *1
        -> get delayed_iput_sem lock *2
        -> iput()
        -> ...
        -> btrfs_commit_transaction()
        -> btrfs_run_delayed_iputs() *1
        -> get delayed_iput_sem lock (dead lock) *2
        *1: recursion of btrfs_run_delayed_iputs()
        *2: warning of lockdep about delayed_iput_sem
      
      When fs is in high stress, new iputs may added into fs_info->delayed_iputs
      list when btrfs_run_delayed_iputs() is running, which cause
      second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem)
      again, and cause above lockdep warning.
      
      Actually, it will not cause real problem because both locks are read lock,
      but to avoid lockdep warning, we can do a fix.
      
      Fix:
        Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for
        cleaner_kthread thread to break above recursion path.
        cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code,
        and don't need to call btrfs_run_delayed_iputs() again in
        btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow.
      
      Test:
        No above lockdep warning after patch in 1200 generic/241 tests.
      Reported-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      8a733013
  3. 14 Jul, 2015 1 commit
    • Filipe Manana's avatar
      Btrfs: fix file corruption after cloning inline extents · ed958762
      Filipe Manana authored
      Using the clone ioctl (or extent_same ioctl, which calls the same extent
      cloning function as well) we end up allowing copy an inline extent from
      the source file into a non-zero offset of the destination file. This is
      something not expected and that the btrfs code is not prepared to deal
      with - all inline extents must be at a file offset equals to 0.
      
      For example, the following excerpt of a test case for fstests triggers
      a crash/BUG_ON() on a write operation after an inline extent is cloned
      into a non-zero offset:
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        # Create our test files. File foo has the same 2K of data at offset 4K
        # as file bar has at its offset 0.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
            -c "pwrite -S 0xbb 4k 2K" \
            -c "pwrite -S 0xcc 8K 4K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # File bar consists of a single inline extent (2K size).
        $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
           $SCRATCH_MNT/bar | _filter_xfs_io
      
        # Now call the clone ioctl to clone the extent of file bar into file
        # foo at its offset 4K. This made file foo have an inline extent at
        # offset 4K, something which the btrfs code can not deal with in future
        # IO operations because all inline extents are supposed to start at an
        # offset of 0, resulting in all sorts of chaos.
        # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
        # what it returns for other cases dealing with inlined extents.
        $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
            $SCRATCH_MNT/bar $SCRATCH_MNT/foo
      
        # Because of the inline extent at offset 4K, the following write made
        # the kernel crash with a BUG_ON().
        $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        status=0
        exit
      
      The stack trace of the BUG_ON() triggered by the last write is:
      
        [152154.035903] ------------[ cut here ]------------
        [152154.036424] kernel BUG at mm/page-writeback.c:2286!
        [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$
        [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G        W       4.1.0-rc6-btrfs-next-11+ #2
        [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
        [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000
        [152154.036424] RIP: 0010:[<ffffffff8111a9d5>]  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
        [152154.036424] RSP: 0018:ffff880429effc68  EFLAGS: 00010246
        [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001
        [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0
        [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001
        [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68
        [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80
        [152154.036424] FS:  00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000
        [152154.036424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0
        [152154.036424] Stack:
        [152154.036424]  ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1
        [152154.036424]  ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8
        [152154.036424]  0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000
        [152154.036424] Call Trace:
        [152154.036424]  [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs]
        [152154.036424]  [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs]
        [152154.036424]  [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs]
        [152154.036424]  [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
        [152154.036424]  [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs]
        [152154.036424]  [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5
        [152154.036424]  [<ffffffff81165f89>] vfs_write+0xa0/0xe4
        [152154.036424]  [<ffffffff81166855>] SyS_pwrite64+0x64/0x82
        [152154.036424]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
        [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$
        [152154.036424] RIP  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
        [152154.036424]  RSP <ffff880429effc68>
        [152154.242621] ---[ end trace e3d3376b23a57041 ]---
      
      Fix this by returning the error EOPNOTSUPP if an attempt to copy an
      inline extent into a non-zero offset happens, just like what is done for
      other scenarios that would require copying/splitting inline extents,
      which were introduced by the following commits:
      
         00fdf13a ("Btrfs: fix a crash of clone with inline extents's split")
         3f9e3df8 ("btrfs: replace error code from btrfs_drop_extents")
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      ed958762
  4. 11 Jul, 2015 4 commits
    • Filipe Manana's avatar
      Btrfs: fix order by which delayed references are run · cffc3374
      Filipe Manana authored
      When we have an extent that got N references removed and N new references
      added in the same transaction, we must run the insertion of the references
      first because otherwise the last removed reference will remove the extent
      item from the extent tree, resulting in a failure for the insertions.
      
      This is a regression introduced in the 4.2-rc1 release and this fix just
      brings back the behaviour of selecting reference additions before any
      reference removals.
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_cloner
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create prealloc extent covering range [160K, 620K[
        $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo
      
        # Now write to the last 80K of the prealloc extent plus 40K to the unallocated
        # space that immediately follows it. This creates a new extent of 40K that spans
        # the range [620K, 660K[.
        $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # At this point, there are now 2 back references to the prealloc extent in our
        # extent tree. Both are for our file offset 160K and one relates to a file
        # extent item with a data offset of 0 and a length of 380K, while the other
        # relates to a file extent item with a data offset of 380K and a length of 80K.
      
        # Make sure everything done so far is durably persisted (all back references are
        # in the extent tree, etc).
        sync
      
        # Now clone all extents of our file that cover the offset 160K up to its eof
        # (660K at this point) into itself at offset 2M. This leaves a hole in the file
        # covering the range [660K, 2M[. The prealloc extent will now be referenced by
        # the file twice, once for offset 160K and once for offset 2M. The 40K extent
        # that follows the prealloc extent will also be referenced twice by our file,
        # once for offset 620K and once for offset 2M + 460K.
        $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \
      	$SCRATCH_MNT/foo
      
        # Now create one new extent in our file with a size of 100Kb. It will span the
        # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the
        # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
        $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # At this point, there are now (in memory) 4 back references to the prealloc
        # extent.
        #
        # Two of them are for file offset 160K, related to file extent items
        # matching the file offsets 160K and 540K respectively, with data offsets of
        # 0 and 380K respectively, and with lengths of 380K and 80K respectively.
        #
        # The other two references are for file offset 2M, related to file extent items
        # matching the file offsets 2M and 2M + 380K respectively, with data offsets of
        # 0 and 380K respectively, and with lengths of 389K and 80K respectively.
        #
        # The 40K extent has 2 back references, one for file offset 620K and the other
        # for file offset 2M + 460K.
        #
        # The 100K extent has a single back reference and it relates to file offset 3M.
      
        # Now clone our 100K extent into offset 600K. That offset covers the last 20K
        # of the prealloc extent, the whole 40K extent and 40K of the hole starting at
        # offset 660K.
        $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \
            $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        # At this point there's only one reference to the 40K extent, at file offset
        # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset
        # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for
        # file offset 3M and a new one for file offset 600K).
      
        # Now fsync our file to make all its new data and metadata updates are durably
        # persisted and present if a power failure/crash happens after a successful
        # fsync and before the next transaction commit.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        # Silently drop all writes and ummount to simulate a crash/power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file contents.
        # During log replay, the btrfs delayed references implementation used to run the
        # deletion of back references before the addition of new back references, which
        # made the addition fail as it didn't find the key in the extent tree that it
        # was looking for. The failure triggered by this test was related to the 40K
        # extent, which got 1 reference dropped and 1 reference added during the fsync
        # log replay - when running the delayed references at transaction commit time,
        # btrfs was applying the deletion before the insertion, resulting in a failure
        # of the insertion that ended up turning the fs into read-only mode.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        _unmount_flakey
      
        status=0
        exit
      
      This issue turned the filesystem into read-only mode (current transaction
      aborted) and produced the following traces:
      
        [ 8247.578385] ------------[ cut here ]------------
        [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]()
        (...)
        [ 8247.601697] Call Trace:
        [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [ 8247.605488]  [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs]
        [ 8247.608226]  [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs]
        [ 8247.617061]  [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs]
        [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs]
        [ 8247.624366]  [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
        [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs]
        [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
        [ 8247.628531]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
        (...)
        [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---
      
        [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
        [ 8247.728954] BTRFS: Transaction aborted (error -5)
        (...)
        [ 8247.760866] Call Trace:
        [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [ 8247.767582]  [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
        [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
        [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
        [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
        [ 8247.773664]  [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs]
        [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
        [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
        [ 8247.777427]  [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs]
        [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs]
        [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
        [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
        [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
        (...)
        [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
        [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure
        [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree)
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Acked-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      cffc3374
    • Filipe Manana's avatar
      Btrfs: fix list transaction->pending_ordered corruption · d3efe084
      Filipe Manana authored
      When we call btrfs_commit_transaction(), we splice the list "ordered"
      of our transaction handle into the transaction's "pending_ordered"
      list, but we don't re-initialize the "ordered" list of our transaction
      handle, this means it still points to the same elements it used to
      before the splice. Then we check if the current transaction's state is
      >= TRANS_STATE_COMMIT_START and if it is we end up calling
      btrfs_end_transaction() which simply splices again the "ordered" list
      of our handle into the transaction's "pending_ordered" list, leaving
      multiple pointers to the same ordered extents which results in list
      corruption when we are iterating, removing and freeing ordered extents
      at btrfs_wait_pending_ordered(), resulting in access to dangling
      pointers / use-after-free issues.
      Similarly, btrfs_end_transaction() can end up in some cases calling
      btrfs_commit_transaction(), and both did a list splice of the transaction
      handle's "ordered" list into the transaction's "pending_ordered" without
      re-initializing the handle's "ordered" list, resulting in exactly the
      same problem.
      
      This produces the following warning on a kernel with linked list
      debugging enabled:
      
      [109749.265416] ------------[ cut here ]------------
      [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
      [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
      (...)
      [109749.287505] Call Trace:
      [109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
      [109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
      [109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
      [109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
      [109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
      [109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
      [109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
      [109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
      [109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
      [109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
      [109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
      [109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
      [109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
      [109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
      [109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
      [109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
      [109749.382274] ---[ end trace 48e0d07f7c03d95a ]---
      
      On a non-debug kernel this leads to invalid memory accesses, causing a
      crash. Fix this by using list_splice_init() instead of list_splice() in
      btrfs_commit_transaction() and btrfs_end_transaction().
      
      Cc: stable@vger.kernel.org
      Fixes: 50d9aa99 ("Btrfs: make sure logged extents complete in the current transaction V3"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      d3efe084
    • Filipe Manana's avatar
      Btrfs: fix memory leak in the extent_same ioctl · 497b4050
      Filipe Manana authored
      We were allocating memory with memdup_user() but we were never releasing
      that memory. This affected pretty much every call to the ioctl, whether
      it deduplicated extents or not.
      
      This issue was reported on IRC by Julian Taylor and on the mailing list
      by Marcel Ritter, credit goes to them for finding the issue.
      Reported-by: default avatarJulian Taylor <jtaylor.debian@googlemail.com>
      Reported-by: default avatarMarcel Ritter <ritter.marcel@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      497b4050
    • Filipe Manana's avatar
      Btrfs: fix shrinking truncate when the no_holes feature is enabled · c1aa4575
      Filipe Manana authored
      If the no_holes feature is enabled, we attempt to shrink a file to a size
      that ends up in the middle of a hole and we don't have any file extent
      items in the fs/subvol tree that go beyond the new file size (or any
      ordered extents that will insert such file extent items), we end up not
      updating the inode's disk_i_size, we only update the inode's i_size.
      
      This means that after unmounting and mounting the filesystem, or after
      the inode is evicted and reloaded, its i_size ends up being incorrect
      (an inode's i_size is set to the disk_i_size field when an inode is
      loaded). This happens when btrfs_truncate_inode_items() doesn't find
      any file extent items to drop - in this case it never makes a call to
      btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.
      
      Example reproducer:
      
        $ mkfs.btrfs -O no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create our test file with some data and durably persist it.
        $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
        $ sync
      
        # Append some data to the file, increasing its size, and leave a hole
        # between the old size and the start offset if the following write. So
        # our file gets a hole in the range [128Kb, 256Kb[.
        $ xfs_io -c "truncate 160K" /mnt/foo
      
        # We expect to see our file with a size of 160Kb, with the first 128Kb
        # of data all having the value 0xaa and the remaining 32Kb of data all
        # having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0500000
      
        # Now cleanly unmount and mount again the filesystem.
        $ umount /mnt
        $ mount /dev/sdd /mnt
      
        # We expect to get the same result as before, a file with a size of
        # 160Kb, with the first 128Kb of data all having the value 0xaa and the
        # remaining 32Kb of data all having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000
      
      In the example above the file size/data do not match what they were before
      the remount.
      
      Fix this by always calling btrfs_ordered_update_i_size() with a size
      matching the size the file was truncated to if btrfs_truncate_inode_items()
      is not called for a log tree and no file extent items were dropped. This
      ensures the same behaviour as when the no_holes feature is not enabled.
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      c1aa4575
  5. 02 Jul, 2015 10 commits
    • Shilong Wang's avatar
      Btrfs: fix wrong check for btrfs_force_chunk_alloc() · 9689457b
      Shilong Wang authored
      btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
      This problem exists since commit c87f08ca.
      
      With this patch, we might fix some enospc problems for balances.
      Signed-off-by: default avatarWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9689457b
    • Liu Bo's avatar
      Btrfs: fix warning of bytes_may_use · ddba1bfc
      Liu Bo authored
      While running generic/019, dmesg got several warnings from
      btrfs_free_reserved_data_space().
      
      Test generic/019 produces some disk failures so sumbit dio will get errors,
      in which case, btrfs_direct_IO() goes to the error handling and free
      bytes_may_use, but the problem is that bytes_may_use has been free'd
      during get_block().
      
      This adds a runtime flag to show if we've gone through get_block(), if so,
      don't do the cleanup work.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ddba1bfc
    • Liu Bo's avatar
      Btrfs: fix hang when failing to submit bio of directIO · ad9ee205
      Liu Bo authored
      The hang is uncoverd by generic/019.
      
      btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
      an error, thus those added ordered extents will never get processed, which
      block processes that waiting for them via btrfs_start_ordered_extent().
      
      This fixes the above, and meanwhile finish_ordered_fn will do the space
      accounting work.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ad9ee205
    • Filipe Manana's avatar
      Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() · 9c6429d9
      Filipe Manana authored
      The comment was not correct about the part where it says the endio
      callback of the bio might have not yet been called - update it
      to mention that by that time the endio callback execution might
      still be in progress only.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9c6429d9
    • Filipe Manana's avatar
      Btrfs: fix memory corruption on failure to submit bio for direct IO · 61de718f
      Filipe Manana authored
      If we fail to submit a bio for a direct IO request, we were grabbing the
      corresponding ordered extent and decrementing its reference count twice,
      once for our lookup reference and once for the ordered tree reference.
      This was a problem because it caused the ordered extent to be freed
      without removing it from the ordered tree and any lists it might be
      attached to, leaving dangling pointers to the ordered extent around.
      Example trace with CONFIG_DEBUG_PAGEALLOC=y:
      
      [161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
      [161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
      [161779.860636] PGD 34d818067 PUD 0
      [161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [161779.860636] Call Trace:
      [161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
      [161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
      [161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
      [161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
      [161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
      [161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
      [161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
      (...)
      
      We were also not freeing the btrfs_dio_private we allocated previously,
      which kmemleak reported with the following trace in its sysfs file:
      
      unreferenced object 0xffff8803f553bf80 (size 96):
        comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
        hex dump (first 32 bytes):
          88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
          00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81161ffe>] create_object+0x172/0x29a
          [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
          [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
          [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
          [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
          [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
          [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
          [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
          [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
          [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
          [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
          [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
          [<ffffffff81165da9>] vfs_write+0xa0/0xe4
          [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
          [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      For read requests we weren't doing any cleanup either (none of the work
      done by btrfs_endio_direct_read()), so a failure submitting a bio for a
      read request would leave a range in the inode's io_tree locked forever,
      blocking any future operations (both reads and writes) against that range.
      
      So fix this by making sure we do the same cleanup that we do for the case
      where the bio submission succeeds.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      61de718f
    • Mark Fasheh's avatar
      btrfs: don't update mtime/ctime on deduped inodes · 1c919a5e
      Mark Fasheh authored
      One issue users have reported is that dedupe changes mtime on files,
      resulting in tools like rsync thinking that their contents have changed when
      in fact the data is exactly the same. We also skip the ctime update as no
      user-visible metadata changes here and we want dedupe to be transparent to
      the user.
      
      Clone still wants time changes, so we special case this in the code.
      
      This was tested with the btrfs-extent-same tool.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1c919a5e
    • Mark Fasheh's avatar
      btrfs: allow dedupe of same inode · 0efa9f48
      Mark Fasheh authored
      clone() supports cloning within an inode so extent-same can do
      the same now. This patch fixes up the locking in extent-same to
      know about the single-inode case. In addition to that, we add a
      check for overlapping ranges, which clone does not allow.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0efa9f48
    • Mark Fasheh's avatar
      btrfs: fix deadlock with extent-same and readpage · f4414602
      Mark Fasheh authored
      ->readpage() does page_lock() before extent_lock(), we do the opposite in
      extent-same. We want to reverse the order in btrfs_extent_same() but it's
      not quite straightforward since the page locks are taken inside btrfs_cmp_data().
      
      So I split btrfs_cmp_data() into 3 parts with a small context structure that
      is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
      pages needed (taking page lock as required) and puts them on our context
      structure. At this point, we are safe to lock the extent range. Afterwards,
      we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
      to clean up our context.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f4414602
    • Mark Fasheh's avatar
      btrfs: pass unaligned length to btrfs_cmp_data() · 207910dd
      Mark Fasheh authored
      In the case that we dedupe the tail of a file, we might expand the dedupe
      len out to the end of our last block. We don't want to compare data past
      i_size however, so pass the original length to btrfs_cmp_data().
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      207910dd
    • Filipe Manana's avatar
      Btrfs: fix fsync after truncate when no_holes feature is enabled · a89ca6f2
      Filipe Manana authored
      When we have the no_holes feature enabled, if a we truncate a file to a
      smaller size, truncate it again but to a size greater than or equals to
      its original size and fsync it, the log tree will not have any information
      about the hole covering the range [truncate_1_offset, new_file_size[.
      Which means if the fsync log is replayed, the file will remain with the
      state it had before both truncate operations.
      
      Without the no_holes feature this does not happen, since when the inode
      is logged (full sync flag is set) it will find in the fs/subvol tree a
      leaf with a generation matching the current transaction id that has an
      explicit extent item representing the hole.
      
      Fix this by adding an explicit extent item representing a hole between
      the last extent and the inode's i_size if we are doing a full sync.
      
      The issue is easy to reproduce with the following test case for fstests:
      
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test files and make sure everything is durably persisted.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K"         \
                        -c "pwrite -S 0xbb 64K 61K"       \
                        $SCRATCH_MNT/foo | _filter_xfs_io
        $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K"         \
                        -c "pwrite -S 0xff 64K 61K"       \
                        $SCRATCH_MNT/bar | _filter_xfs_io
        sync
      
        # Now truncate our file foo to a smaller size (64Kb) and then truncate
        # it to the size it had before the shrinking truncate (125Kb). Then
        # fsync our file. If a power failure happens after the fsync, we expect
        # our file to have a size of 125Kb, with the first 64Kb of data having
        # the value 0xaa and the second 61Kb of data having the value 0x00.
        $XFS_IO_PROG -c "truncate 64K" \
                     -c "truncate 125K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo
      
        # Do something similar to our file bar, but the first truncation sets
        # the file size to 0 and the second truncation expands the size to the
        # double of what it was initially.
        $XFS_IO_PROG -c "truncate 0" \
                     -c "truncate 253K" \
                     -c "fsync" \
                     $SCRATCH_MNT/bar
      
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # We expect foo to have a size of 125Kb, the first 64Kb of data all
        # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
        # with value 0x00).
        echo "File foo content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # We expect bar to have a size of 253Kb and no extents (any byte read
        # from bar has the value 0x00).
        echo "File bar content after log replay:"
        od -t x1 $SCRATCH_MNT/bar
      
        status=0
        exit
      
      The expected file contents in the golden output are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0372000
        File bar content after log replay:
        0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      Without this fix, their contents are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0372000
        File bar content after log replay:
        0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
        *
        0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      A test case submission for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a89ca6f2
  6. 30 Jun, 2015 9 commits
    • Filipe Manana's avatar
      Btrfs: fix fsync xattr loss in the fast fsync path · 36283bf7
      Filipe Manana authored
      After commit 4f764e51 ("Btrfs: remove deleted xattrs on fsync log
      replay"), we can end up in a situation where during log replay we end up
      deleting xattrs that were never deleted when their file was last fsynced.
      
      This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
      not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
      set, the xattr was added in a past transaction and the leaf where the
      xattr is located was not updated (COWed or created) in the current
      transaction. In this scenario the xattr item never ends up in the log
      tree and therefore at log replay time, which makes the replay code delete
      the xattr from the fs/subvol tree as it thinks that xattr was deleted
      prior to the last fsync.
      
      Fix this by always logging all xattrs, which is the simplest and most
      reliable way to detect deleted xattrs and replay the deletes at log replay
      time.
      
      This issue is reproducible with the following test case for fstests:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        here=`pwd`
        tmp=/tmp/$$
        status=1	# failure is the default!
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
        . ./common/attr
      
        # real QA test starts here
      
        # We create a lot of xattrs for a single file. Only btrfs and xfs are currently
        # able to store such a large mount of xattrs per file, other filesystems such
        # as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
        # less than 1000 xattrs with very small values.
        _supported_fs btrfs xfs
        _supported_os Linux
        _need_to_be_root
        _require_scratch
        _require_dm_flakey
        _require_attrs
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create the test file with some initial data and make sure everything is
        # durably persisted.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
        sync
      
        # Add many small xattrs to our file.
        # We create such a large amount because it's needed to trigger the issue found
        # in btrfs - we need to have an amount that causes the fs to have at least 3
        # btree leafs with xattrs stored in them, and it must work on any leaf size
        # (maximum leaf/node size is 64Kb).
        num_xattrs=2000
        for ((i = 1; i <= $num_xattrs; i++)); do
            name="user.attr_$(printf "%04d" $i)"
            $SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
        done
      
        # Sync the filesystem to force a commit of the current btrfs transaction, this
        # is a necessary condition to trigger the bug on btrfs.
        sync
      
        # Now update our file's data and fsync the file.
        # After a successful fsync, if the fsync log/journal is replayed we expect to
        # see all the xattrs we added before with the same values (and the updated file
        # data of course). Btrfs used to delete some of these xattrs when it replayed
        # its fsync log/journal.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount. This makes the fs replay its fsync log.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File content after crash and log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        echo "File xattrs after crash and log replay:"
        for ((i = 1; i <= $num_xattrs; i++)); do
            name="user.attr_$(printf "%04d" $i)"
            echo -n "$name="
            $GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
            echo
        done
      
        status=0
        exit
      
      The golden output expects all xattrs to be available, and with the correct
      values, after the fsync log is replayed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      36283bf7
    • Filipe Manana's avatar
      Btrfs: fix fsync data loss after append write · e4545de5
      Filipe Manana authored
      If we do an append write to a file (which increases its inode's i_size)
      that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
      and the previous transaction added a new hard link to the file, which sets
      the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
      the file, the inode's new i_size isn't logged. This has the consequence
      that after the fsync log is replayed, the file size remains what it was
      before the append write operation, which means users/applications will
      not be able to read the data that was successsfully fsync'ed before.
      
      This happens because neither the inode item nor the delayed inode get
      their i_size updated when the append write is made - doing so would
      require starting a transaction in the buffered write path, something that
      we do not do intentionally for performance reasons.
      
      Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
      set the inode is logged with its current i_size (log the in-memory inode
      into the log tree).
      
      This issue is not a recent regression and is easy to reproduce with the
      following test case for fstests:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        here=`pwd`
        tmp=/tmp/$$
        status=1	# failure is the default!
      
        _cleanup()
        {
                _cleanup_flakey
                rm -f $tmp.*
        }
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _supported_fs generic
        _supported_os Linux
        _need_to_be_root
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        _crash_and_mount()
        {
                # Simulate a crash/power loss.
                _load_flakey_table $FLAKEY_DROP_WRITES
                _unmount_flakey
                # Allow writes again and mount. This makes the fs replay its fsync log.
                _load_flakey_table $FLAKEY_ALLOW_WRITES
                _mount_flakey
        }
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create the test file with some initial data and then fsync it.
        # The fsync here is only needed to trigger the issue in btrfs, as it causes the
        # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                        -c "fsync" \
                        $SCRATCH_MNT/foo | _filter_xfs_io
        sync
      
        # Add a hard link to our file.
        # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
        # which is a necessary condition to trigger the issue.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
      
        # Sync the filesystem to force a commit of the current btrfs transaction, this
        # is a necessary condition to trigger the bug on btrfs.
        sync
      
        # Now append more data to our file, increasing its size, and fsync the file.
        # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
        # write path did not update the inode item in the btree nor the delayed inode
        # item (in memory struture) in the current transaction (created by the fsync
        # handler), the fsync did not record the inode's new i_size in the fsync
        # log/journal. This made the data unavailable after the fsync log/journal is
        # replayed.
        $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo | _filter_xfs_io
      
        echo "File content after fsync and before crash:"
        od -t x1 $SCRATCH_MNT/foo
      
        _crash_and_mount
      
        echo "File content after crash and log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected file output before and after the crash/power failure expects the
      appended data to be available, which is:
      
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0200000
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e4545de5
    • Filipe Manana's avatar
      Btrfs: fix crash on close_ctree() if cleaner starts new transaction · da288d28
      Filipe Manana authored
      Often when running fstests btrfs/079 I was running into the following
      trace during umount on one of my qemu/kvm test vms:
      
      [ 8245.682441] WARNING: CPU: 8 PID: 25064 at fs/btrfs/extent-tree.c:138 btrfs_put_block_group+0x51/0x69 [btrfs]()
      [ 8245.685039] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
      [ 8245.693860] CPU: 8 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [ 8245.695081] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [ 8245.697583]  0000000000000009 ffff88020d047ce8 ffffffff8145eec7 ffffffff81095dce
      [ 8245.699234]  0000000000000000 ffff88020d047d28 ffffffff8104b399 0000000000000028
      [ 8245.700995]  ffffffffa04db07b ffff8801c6036c00 ffff8801c6036d68 ffff880202eb40b0
      [ 8245.702510] Call Trace:
      [ 8245.703006]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
      [ 8245.705393]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
      [ 8245.706569]  [<ffffffff8104b399>] warn_slowpath_common+0xa1/0xbb
      [ 8245.707747]  [<ffffffffa04db07b>] ? btrfs_put_block_group+0x51/0x69 [btrfs]
      [ 8245.709101]  [<ffffffff8104b456>] warn_slowpath_null+0x1a/0x1c
      [ 8245.710274]  [<ffffffffa04db07b>] btrfs_put_block_group+0x51/0x69 [btrfs]
      [ 8245.711823]  [<ffffffffa04e3473>] btrfs_free_block_groups+0x145/0x322 [btrfs]
      [ 8245.713251]  [<ffffffffa04ef31a>] close_ctree+0x1ef/0x325 [btrfs]
      [ 8245.714448]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
      [ 8245.715539]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
      [ 8245.716835]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
      [ 8245.718015]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
      [ 8245.719101]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
      [ 8245.720316]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
      [ 8245.721517]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
      [ 8245.722581]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
      [ 8245.723538]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
      [ 8245.724572]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
      [ 8245.725598]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
      [ 8245.726892]  [<ffffffff814651ac>] int_signal+0x12/0x17
      [ 8245.737887] ---[ end trace a01d038397e99b92 ]---
      [ 8245.769363] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 8245.770737] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
      [ 8245.772641] CPU: 2 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [ 8245.772641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [ 8245.772641] task: ffff880013005810 ti: ffff88020d044000 task.ti: ffff88020d044000
      [ 8245.772641] RIP: 0010:[<ffffffffa051c8e6>]  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
      [ 8245.772641] RSP: 0018:ffff88020d0478b8  EFLAGS: 00010202
      [ 8245.772641] RAX: 0000000000000004 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffa0581488
      [ 8245.772641] RDX: 0000000000000000 RSI: ffff880194b7bf48 RDI: ffff880144b6a7a0
      [ 8245.772641] RBP: ffff88020d0478d8 R08: 0000000000000000 R09: 000000000000ffff
      [ 8245.772641] R10: 0000000000000004 R11: 0000000000000005 R12: ffff880194b7bf48
      [ 8245.772641] R13: ffff880194b7bf48 R14: 0000000000000410 R15: 0000000000000000
      [ 8245.772641] FS:  00007f991e77d840(0000) GS:ffff88023e280000(0000) knlGS:0000000000000000
      [ 8245.772641] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 8245.772641] CR2: 00007fbbd325ee68 CR3: 000000021de8e000 CR4: 00000000000006e0
      [ 8245.772641] Stack:
      [ 8245.772641]  ffff880194b7bf00 ffff880202eb4000 ffff880194b7bf48 0000000000000410
      [ 8245.772641]  ffff88020d047958 ffffffffa04ec6d5 ffff8801629b2ee8 0000000082987570
      [ 8245.772641]  0000000000a5813f 0000000000000001 ffff880013006100 0000000000000002
      [ 8245.772641] Call Trace:
      [ 8245.772641]  [<ffffffffa04ec6d5>] btrfs_wq_submit_bio+0xe1/0x17b [btrfs]
      [ 8245.772641]  [<ffffffff81086bff>] ? check_irq_usage+0x76/0x87
      [ 8245.772641]  [<ffffffffa04ec825>] btree_submit_bio_hook+0xb6/0xd9 [btrfs]
      [ 8245.772641]  [<ffffffffa04ebb7c>] ? btree_csum_one_bio+0xad/0xad [btrfs]
      [ 8245.772641]  [<ffffffffa04eb1a6>] ? btree_io_failed_hook+0x5e/0x5e [btrfs]
      [ 8245.772641]  [<ffffffffa050a6e7>] submit_one_bio+0x8c/0xc7 [btrfs]
      [ 8245.772641]  [<ffffffffa050d75b>] submit_extent_page.isra.18+0x9d/0x186 [btrfs]
      [ 8245.772641]  [<ffffffffa050d95b>] write_one_eb+0x117/0x1ae [btrfs]
      [ 8245.772641]  [<ffffffffa050a79b>] ? end_extent_buffer_writeback+0x21/0x21 [btrfs]
      [ 8245.772641]  [<ffffffffa0510510>] btree_write_cache_pages+0x2ab/0x385 [btrfs]
      [ 8245.772641]  [<ffffffffa04eb2b8>] btree_writepages+0x23/0x5c [btrfs]
      [ 8245.772641]  [<ffffffff8111c661>] do_writepages+0x23/0x2c
      [ 8245.772641]  [<ffffffff81189cd4>] __writeback_single_inode+0xda/0x5bd
      [ 8245.772641]  [<ffffffff8118aa60>] ? writeback_single_inode+0x2b/0x173
      [ 8245.772641]  [<ffffffff8118aafd>] writeback_single_inode+0xc8/0x173
      [ 8245.772641]  [<ffffffff8118ac95>] write_inode_now+0x8a/0x95
      [ 8245.772641]  [<ffffffff81247bf0>] ? _atomic_dec_and_lock+0x30/0x4e
      [ 8245.772641]  [<ffffffff8117cc5e>] iput+0x17d/0x26a
      [ 8245.772641]  [<ffffffffa04ef355>] close_ctree+0x22a/0x325 [btrfs]
      [ 8245.772641]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
      [ 8245.772641]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
      [ 8245.772641]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
      [ 8245.772641]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
      [ 8245.772641]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
      [ 8245.772641]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
      [ 8245.772641]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
      [ 8245.772641]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
      [ 8245.772641]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
      [ 8245.772641]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
      [ 8245.772641]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
      [ 8245.772641]  [<ffffffff814651ac>] int_signal+0x12/0x17
      [ 8245.772641] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 5c ff 74 04 f0 ff 43 50 49 83 7c 24 08 00 74 2c 4c 8d 6b
      [ 8245.772641] RIP  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
      [ 8245.772641]  RSP <ffff88020d0478b8>
      [ 8245.845040] ---[ end trace a01d038397e99b93 ]---
      
      For logical reasons such as the phase of the moon, this happened more
      often with "-o inode_cache" than without any mount options.
      
      After some debugging it turned out to be simple to understand what was
      happening:
      
      1) close_ctree() is called;
      
      2) It then stops the transaction kthread, which commits the current
         transaction;
      
      3) It asks the cleaner kthread to stop, which is currently running
         btrfs_delete_unused_bgs();
      
      4) btrfs_delete_unused_bgs() finds an unused block group, starts a new
         transaction, deletes the block group, which implies COWing some
         tree nodes and leafs and dirtying their respective pages, and then
         finally it ends the transaction it started, without committing it;
      
      5) The cleaner kthread stops;
      
      6) close_ctree() releases (from memory) the block group objects, which
         produces the warning in the trace pasted above;
      
      7) Then it invalidates all pages of the btree inode, by calling
         invalidate_inode_pages2(), which waits for any pages under writeback,
         and releases any non-dirty pages;
      
      8) All work queues are destroyed (waiting first for their current tasks
         to finish execution);
      
      9) A final iput() is called against the btree inode;
      
      10) This iput triggers a writeback of the btree inode because it still
          has dirty pages;
      
      11) This starts the whole chain of callbacks for the btree inode until
          it eventually reaches btrfs_wq_submit_bio() where it leads to a
          NULL pointer dereference because the work queues were already
          destroyed.
      
      Fix this by making the cleaner commit any transaction that it started
      after the transaction kthread was stopped.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      da288d28
    • Filipe Manana's avatar
      Btrfs: fix race between caching kthread and returning inode to inode cache · ae9d8f17
      Filipe Manana authored
      While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
      we could have a concurrent call to btrfs_return_ino() that adds a new
      entry to the root's free space cache of pinned inodes. This concurrent
      call does not acquire the fs_info->commit_root_sem before adding a new
      entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
      because the caching kthread calls btrfs_unpin_free_ino() after setting
      the caching state to BTRFS_CACHE_FINISHED and therefore races with
      the task calling btrfs_return_ino(), which is adding a new entry, while
      the former (caching kthread) is navigating the cache's rbtree, removing
      and freeing nodes from the cache's rbtree without acquiring the spinlock
      that protects the rbtree.
      
      This race resulted in memory corruption due to double free of struct
      btrfs_free_space objects because both tasks can end up doing freeing the
      same objects. Note that adding a new entry can result in merging it with
      other entries in the cache, in which case those entries are freed.
      This is particularly important as btrfs_free_space structures are also
      used for the block group free space caches.
      
      This memory corruption can be detected by a debugging kernel, which
      reports it with the following trace:
      
      [132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
      [132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
      [132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
      [132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
      [132408.505075] Call Trace:
      [132408.505075]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
      [132408.505075]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
      [132408.505075]  [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
      [132408.505075]  [<ffffffff81155733>] __cache_free+0xe2/0x4b6
      [132408.505075]  [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
      [132408.505075]  [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
      [132408.505075]  [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
      [132408.505075]  [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
      [132408.505075]  [<ffffffff811563a1>] ? kfree+0xb6/0x14e
      [132408.505075]  [<ffffffff811563d0>] kfree+0xe5/0x14e
      [132408.505075]  [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
      [132408.505075]  [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
      [132408.505075]  [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
      [132408.505075]  [<ffffffff8106698f>] kthread+0xef/0xf7
      [132408.505075]  [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
      [132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
      [132408.505075]  [<ffffffff814653d2>] ret_from_fork+0x42/0x70
      [132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
      [132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
      [132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
      [132409.503355] ------------[ cut here ]------------
      [132409.504241] kernel BUG at mm/slab.c:2571!
      
      Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
      that protects the rbtree while doing the searches and removing entries.
      
      Fixes: 1c70d8fb ("Btrfs: fix inode caching vs tree log")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ae9d8f17
    • Filipe Manana's avatar
      Btrfs: use kmem_cache_free when freeing entry in inode cache · c3f4a168
      Filipe Manana authored
      The free space entries are allocated using kmem_cache_zalloc(),
      through __btrfs_add_free_space(), therefore we should use
      kmem_cache_free() and not kfree() to avoid any confusion and
      any potential problem. Looking at the kfree() definition at
      mm/slab.c it has the following comment:
      
        /*
         * (...)
         *
         * Don't free memory not originally allocated by kmalloc()
         * or you will run into trouble.
         */
      
      So better be safe and use kmem_cache_free().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c3f4a168
    • Filipe Manana's avatar
      Btrfs: fix race between balance and unused block group deletion · 67c5e7d4
      Filipe Manana authored
      We have a race between deleting an unused block group and balancing the
      same block group that leads to an assertion failure/BUG(), producing the
      following trace:
      
      [181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
      [181631.220591] ------------[ cut here ]------------
      [181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
      [181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
      [181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
      [181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
      [181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
      [181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
      [181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
      [181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
      [181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
      [181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
      [181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
      [181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
      [181631.224566] Stack:
      [181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
      [181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
      [181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
      [181631.224566] Call Trace:
      [181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
      [181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
      [181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
      [181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
      [181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      [181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
      [181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
      [181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
      [181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
      [181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
      (...)
      
      The sequence of steps leading to this are:
      
                 CPU 0                                         CPU 1
      
        btrfs_balance()
          btrfs_relocate_chunk()
      
            btrfs_relocate_block_group(bg X)
              btrfs_lookup_block_group(bg X)
      
                                                     cleaner_kthread
                                                        locks fs_info->cleaner_mutex
      
                                                        btrfs_delete_unused_bgs()
                                                          finds bg X, which became
                                                          unused in the previous
                                                          transaction
      
                                                          checks bg X ->ro == 0,
                                                          so it proceeds
              sets bg X ->ro to 1
              (btrfs_set_block_group_ro(bg X))
      
              blocks on fs_info->cleaner_mutex
                                                          btrfs_remove_chunk(bg X)
                                                        unlocks fs_info->cleaner_mutex
      
              acquires fs_info->cleaner_mutex
              relocate_block_group()
                --> does nothing, no extents found in
                    the extent tree from bg X
              unlocks fs_info->cleaner_mutex
      
            btrfs_relocate_block_group(bg X) returns
      
          btrfs_remove_chunk(bg X)
             extent map not found
                --> ASSERT(0)
      
      Fix this by using a new mutex to make sure these 2 operations, block
      group relocation and removal, are serialized.
      
      This issue is reproducible by running fstests generic/038 (which stresses
      chunk allocation and automatic removal of unused block groups) together
      with the following balance loop:
      
          while true; do btrfs balance start -dusage=0 <mountpoint> ; done
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      67c5e7d4
    • Zhao Lei's avatar
      btrfs: add error handling for scrub_workers_get() · e82afc52
      Zhao Lei authored
      Although it is a rare case, we'd better free previous allocated
      memory on error.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e82afc52
    • Zhao Lei's avatar
      btrfs: cleanup noused initialization of dev in btrfs_end_bio() · 65f53338
      Zhao Lei authored
      It is introduced by:
       c404e0dc
       Btrfs: fix use-after-free in the finishing procedure of the device replace
      
      But seems no relationship with that bug, this patch revirt these
      code block for cleanup.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      65f53338
    • Yang Dongsheng's avatar
      btrfs: qgroup: allow user to clear the limitation on qgroup · fe759907
      Yang Dongsheng authored
      Currently, we can only set a limitation on a qgroup, but we
      can not clear it.
      
      This patch provide a choice to user to clear a limitation on
      qgroup by passing a value of CLEAR_VALUE(-1) to kernel.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Tested-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fe759907
  7. 24 Jun, 2015 1 commit
  8. 23 Jun, 2015 1 commit
  9. 22 Jun, 2015 1 commit
  10. 19 Jun, 2015 3 commits
  11. 12 Jun, 2015 2 commits