1. 02 Jul, 2015 3 commits
    • Mark Fasheh's avatar
      btrfs: fix deadlock with extent-same and readpage · f4414602
      Mark Fasheh authored
      ->readpage() does page_lock() before extent_lock(), we do the opposite in
      extent-same. We want to reverse the order in btrfs_extent_same() but it's
      not quite straightforward since the page locks are taken inside btrfs_cmp_data().
      
      So I split btrfs_cmp_data() into 3 parts with a small context structure that
      is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
      pages needed (taking page lock as required) and puts them on our context
      structure. At this point, we are safe to lock the extent range. Afterwards,
      we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
      to clean up our context.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      f4414602
    • Mark Fasheh's avatar
      btrfs: pass unaligned length to btrfs_cmp_data() · 207910dd
      Mark Fasheh authored
      In the case that we dedupe the tail of a file, we might expand the dedupe
      len out to the end of our last block. We don't want to compare data past
      i_size however, so pass the original length to btrfs_cmp_data().
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      207910dd
    • Filipe Manana's avatar
      Btrfs: fix fsync after truncate when no_holes feature is enabled · a89ca6f2
      Filipe Manana authored
      When we have the no_holes feature enabled, if a we truncate a file to a
      smaller size, truncate it again but to a size greater than or equals to
      its original size and fsync it, the log tree will not have any information
      about the hole covering the range [truncate_1_offset, new_file_size[.
      Which means if the fsync log is replayed, the file will remain with the
      state it had before both truncate operations.
      
      Without the no_holes feature this does not happen, since when the inode
      is logged (full sync flag is set) it will find in the fs/subvol tree a
      leaf with a generation matching the current transaction id that has an
      explicit extent item representing the hole.
      
      Fix this by adding an explicit extent item representing a hole between
      the last extent and the inode's i_size if we are doing a full sync.
      
      The issue is easy to reproduce with the following test case for fstests:
      
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test files and make sure everything is durably persisted.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K"         \
                        -c "pwrite -S 0xbb 64K 61K"       \
                        $SCRATCH_MNT/foo | _filter_xfs_io
        $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K"         \
                        -c "pwrite -S 0xff 64K 61K"       \
                        $SCRATCH_MNT/bar | _filter_xfs_io
        sync
      
        # Now truncate our file foo to a smaller size (64Kb) and then truncate
        # it to the size it had before the shrinking truncate (125Kb). Then
        # fsync our file. If a power failure happens after the fsync, we expect
        # our file to have a size of 125Kb, with the first 64Kb of data having
        # the value 0xaa and the second 61Kb of data having the value 0x00.
        $XFS_IO_PROG -c "truncate 64K" \
                     -c "truncate 125K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo
      
        # Do something similar to our file bar, but the first truncation sets
        # the file size to 0 and the second truncation expands the size to the
        # double of what it was initially.
        $XFS_IO_PROG -c "truncate 0" \
                     -c "truncate 253K" \
                     -c "fsync" \
                     $SCRATCH_MNT/bar
      
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # We expect foo to have a size of 125Kb, the first 64Kb of data all
        # having the value 0xaa and the remaining 61Kb to be a hole (all bytes
        # with value 0x00).
        echo "File foo content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # We expect bar to have a size of 253Kb and no extents (any byte read
        # from bar has the value 0x00).
        echo "File bar content after log replay:"
        od -t x1 $SCRATCH_MNT/bar
      
        status=0
        exit
      
      The expected file contents in the golden output are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0372000
        File bar content after log replay:
        0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      Without this fix, their contents are:
      
        File foo content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0372000
        File bar content after log replay:
        0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
        *
        0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0772000
      
      A test case submission for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      a89ca6f2
  2. 30 Jun, 2015 9 commits
    • Filipe Manana's avatar
      Btrfs: fix fsync xattr loss in the fast fsync path · 36283bf7
      Filipe Manana authored
      After commit 4f764e51 ("Btrfs: remove deleted xattrs on fsync log
      replay"), we can end up in a situation where during log replay we end up
      deleting xattrs that were never deleted when their file was last fsynced.
      
      This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
      not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
      set, the xattr was added in a past transaction and the leaf where the
      xattr is located was not updated (COWed or created) in the current
      transaction. In this scenario the xattr item never ends up in the log
      tree and therefore at log replay time, which makes the replay code delete
      the xattr from the fs/subvol tree as it thinks that xattr was deleted
      prior to the last fsync.
      
      Fix this by always logging all xattrs, which is the simplest and most
      reliable way to detect deleted xattrs and replay the deletes at log replay
      time.
      
      This issue is reproducible with the following test case for fstests:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        here=`pwd`
        tmp=/tmp/$$
        status=1	# failure is the default!
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
        . ./common/attr
      
        # real QA test starts here
      
        # We create a lot of xattrs for a single file. Only btrfs and xfs are currently
        # able to store such a large mount of xattrs per file, other filesystems such
        # as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
        # less than 1000 xattrs with very small values.
        _supported_fs btrfs xfs
        _supported_os Linux
        _need_to_be_root
        _require_scratch
        _require_dm_flakey
        _require_attrs
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create the test file with some initial data and make sure everything is
        # durably persisted.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
        sync
      
        # Add many small xattrs to our file.
        # We create such a large amount because it's needed to trigger the issue found
        # in btrfs - we need to have an amount that causes the fs to have at least 3
        # btree leafs with xattrs stored in them, and it must work on any leaf size
        # (maximum leaf/node size is 64Kb).
        num_xattrs=2000
        for ((i = 1; i <= $num_xattrs; i++)); do
            name="user.attr_$(printf "%04d" $i)"
            $SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
        done
      
        # Sync the filesystem to force a commit of the current btrfs transaction, this
        # is a necessary condition to trigger the bug on btrfs.
        sync
      
        # Now update our file's data and fsync the file.
        # After a successful fsync, if the fsync log/journal is replayed we expect to
        # see all the xattrs we added before with the same values (and the updated file
        # data of course). Btrfs used to delete some of these xattrs when it replayed
        # its fsync log/journal.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount. This makes the fs replay its fsync log.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File content after crash and log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        echo "File xattrs after crash and log replay:"
        for ((i = 1; i <= $num_xattrs; i++)); do
            name="user.attr_$(printf "%04d" $i)"
            echo -n "$name="
            $GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
            echo
        done
      
        status=0
        exit
      
      The golden output expects all xattrs to be available, and with the correct
      values, after the fsync log is replayed.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      36283bf7
    • Filipe Manana's avatar
      Btrfs: fix fsync data loss after append write · e4545de5
      Filipe Manana authored
      If we do an append write to a file (which increases its inode's i_size)
      that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
      and the previous transaction added a new hard link to the file, which sets
      the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
      the file, the inode's new i_size isn't logged. This has the consequence
      that after the fsync log is replayed, the file size remains what it was
      before the append write operation, which means users/applications will
      not be able to read the data that was successsfully fsync'ed before.
      
      This happens because neither the inode item nor the delayed inode get
      their i_size updated when the append write is made - doing so would
      require starting a transaction in the buffered write path, something that
      we do not do intentionally for performance reasons.
      
      Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
      set the inode is logged with its current i_size (log the in-memory inode
      into the log tree).
      
      This issue is not a recent regression and is easy to reproduce with the
      following test case for fstests:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        here=`pwd`
        tmp=/tmp/$$
        status=1	# failure is the default!
      
        _cleanup()
        {
                _cleanup_flakey
                rm -f $tmp.*
        }
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _supported_fs generic
        _supported_os Linux
        _need_to_be_root
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        _crash_and_mount()
        {
                # Simulate a crash/power loss.
                _load_flakey_table $FLAKEY_DROP_WRITES
                _unmount_flakey
                # Allow writes again and mount. This makes the fs replay its fsync log.
                _load_flakey_table $FLAKEY_ALLOW_WRITES
                _mount_flakey
        }
      
        rm -f $seqres.full
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create the test file with some initial data and then fsync it.
        # The fsync here is only needed to trigger the issue in btrfs, as it causes the
        # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                        -c "fsync" \
                        $SCRATCH_MNT/foo | _filter_xfs_io
        sync
      
        # Add a hard link to our file.
        # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
        # which is a necessary condition to trigger the issue.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
      
        # Sync the filesystem to force a commit of the current btrfs transaction, this
        # is a necessary condition to trigger the bug on btrfs.
        sync
      
        # Now append more data to our file, increasing its size, and fsync the file.
        # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
        # write path did not update the inode item in the btree nor the delayed inode
        # item (in memory struture) in the current transaction (created by the fsync
        # handler), the fsync did not record the inode's new i_size in the fsync
        # log/journal. This made the data unavailable after the fsync log/journal is
        # replayed.
        $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
                     -c "fsync" \
                     $SCRATCH_MNT/foo | _filter_xfs_io
      
        echo "File content after fsync and before crash:"
        od -t x1 $SCRATCH_MNT/foo
      
        _crash_and_mount
      
        echo "File content after crash and log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected file output before and after the crash/power failure expects the
      appended data to be available, which is:
      
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0200000
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e4545de5
    • Filipe Manana's avatar
      Btrfs: fix crash on close_ctree() if cleaner starts new transaction · da288d28
      Filipe Manana authored
      Often when running fstests btrfs/079 I was running into the following
      trace during umount on one of my qemu/kvm test vms:
      
      [ 8245.682441] WARNING: CPU: 8 PID: 25064 at fs/btrfs/extent-tree.c:138 btrfs_put_block_group+0x51/0x69 [btrfs]()
      [ 8245.685039] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
      [ 8245.693860] CPU: 8 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [ 8245.695081] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [ 8245.697583]  0000000000000009 ffff88020d047ce8 ffffffff8145eec7 ffffffff81095dce
      [ 8245.699234]  0000000000000000 ffff88020d047d28 ffffffff8104b399 0000000000000028
      [ 8245.700995]  ffffffffa04db07b ffff8801c6036c00 ffff8801c6036d68 ffff880202eb40b0
      [ 8245.702510] Call Trace:
      [ 8245.703006]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
      [ 8245.705393]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
      [ 8245.706569]  [<ffffffff8104b399>] warn_slowpath_common+0xa1/0xbb
      [ 8245.707747]  [<ffffffffa04db07b>] ? btrfs_put_block_group+0x51/0x69 [btrfs]
      [ 8245.709101]  [<ffffffff8104b456>] warn_slowpath_null+0x1a/0x1c
      [ 8245.710274]  [<ffffffffa04db07b>] btrfs_put_block_group+0x51/0x69 [btrfs]
      [ 8245.711823]  [<ffffffffa04e3473>] btrfs_free_block_groups+0x145/0x322 [btrfs]
      [ 8245.713251]  [<ffffffffa04ef31a>] close_ctree+0x1ef/0x325 [btrfs]
      [ 8245.714448]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
      [ 8245.715539]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
      [ 8245.716835]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
      [ 8245.718015]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
      [ 8245.719101]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
      [ 8245.720316]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
      [ 8245.721517]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
      [ 8245.722581]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
      [ 8245.723538]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
      [ 8245.724572]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
      [ 8245.725598]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
      [ 8245.726892]  [<ffffffff814651ac>] int_signal+0x12/0x17
      [ 8245.737887] ---[ end trace a01d038397e99b92 ]---
      [ 8245.769363] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 8245.770737] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 acpi_cpufreq processor psmouse i2c_core thermal_sys parport evdev serio_raw button pcspkr microcode ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata floppy virtio_pci virtio_ring scsi_mod virtio e1000 [last unloaded: btrfs]
      [ 8245.772641] CPU: 2 PID: 25064 Comm: umount Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [ 8245.772641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [ 8245.772641] task: ffff880013005810 ti: ffff88020d044000 task.ti: ffff88020d044000
      [ 8245.772641] RIP: 0010:[<ffffffffa051c8e6>]  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
      [ 8245.772641] RSP: 0018:ffff88020d0478b8  EFLAGS: 00010202
      [ 8245.772641] RAX: 0000000000000004 RBX: 6b6b6b6b6b6b6b6b RCX: ffffffffa0581488
      [ 8245.772641] RDX: 0000000000000000 RSI: ffff880194b7bf48 RDI: ffff880144b6a7a0
      [ 8245.772641] RBP: ffff88020d0478d8 R08: 0000000000000000 R09: 000000000000ffff
      [ 8245.772641] R10: 0000000000000004 R11: 0000000000000005 R12: ffff880194b7bf48
      [ 8245.772641] R13: ffff880194b7bf48 R14: 0000000000000410 R15: 0000000000000000
      [ 8245.772641] FS:  00007f991e77d840(0000) GS:ffff88023e280000(0000) knlGS:0000000000000000
      [ 8245.772641] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 8245.772641] CR2: 00007fbbd325ee68 CR3: 000000021de8e000 CR4: 00000000000006e0
      [ 8245.772641] Stack:
      [ 8245.772641]  ffff880194b7bf00 ffff880202eb4000 ffff880194b7bf48 0000000000000410
      [ 8245.772641]  ffff88020d047958 ffffffffa04ec6d5 ffff8801629b2ee8 0000000082987570
      [ 8245.772641]  0000000000a5813f 0000000000000001 ffff880013006100 0000000000000002
      [ 8245.772641] Call Trace:
      [ 8245.772641]  [<ffffffffa04ec6d5>] btrfs_wq_submit_bio+0xe1/0x17b [btrfs]
      [ 8245.772641]  [<ffffffff81086bff>] ? check_irq_usage+0x76/0x87
      [ 8245.772641]  [<ffffffffa04ec825>] btree_submit_bio_hook+0xb6/0xd9 [btrfs]
      [ 8245.772641]  [<ffffffffa04ebb7c>] ? btree_csum_one_bio+0xad/0xad [btrfs]
      [ 8245.772641]  [<ffffffffa04eb1a6>] ? btree_io_failed_hook+0x5e/0x5e [btrfs]
      [ 8245.772641]  [<ffffffffa050a6e7>] submit_one_bio+0x8c/0xc7 [btrfs]
      [ 8245.772641]  [<ffffffffa050d75b>] submit_extent_page.isra.18+0x9d/0x186 [btrfs]
      [ 8245.772641]  [<ffffffffa050d95b>] write_one_eb+0x117/0x1ae [btrfs]
      [ 8245.772641]  [<ffffffffa050a79b>] ? end_extent_buffer_writeback+0x21/0x21 [btrfs]
      [ 8245.772641]  [<ffffffffa0510510>] btree_write_cache_pages+0x2ab/0x385 [btrfs]
      [ 8245.772641]  [<ffffffffa04eb2b8>] btree_writepages+0x23/0x5c [btrfs]
      [ 8245.772641]  [<ffffffff8111c661>] do_writepages+0x23/0x2c
      [ 8245.772641]  [<ffffffff81189cd4>] __writeback_single_inode+0xda/0x5bd
      [ 8245.772641]  [<ffffffff8118aa60>] ? writeback_single_inode+0x2b/0x173
      [ 8245.772641]  [<ffffffff8118aafd>] writeback_single_inode+0xc8/0x173
      [ 8245.772641]  [<ffffffff8118ac95>] write_inode_now+0x8a/0x95
      [ 8245.772641]  [<ffffffff81247bf0>] ? _atomic_dec_and_lock+0x30/0x4e
      [ 8245.772641]  [<ffffffff8117cc5e>] iput+0x17d/0x26a
      [ 8245.772641]  [<ffffffffa04ef355>] close_ctree+0x22a/0x325 [btrfs]
      [ 8245.772641]  [<ffffffff8117d26e>] ? evict_inodes+0xdc/0xeb
      [ 8245.772641]  [<ffffffffa04cb3ad>] btrfs_put_super+0x19/0x1b [btrfs]
      [ 8245.772641]  [<ffffffff81167607>] generic_shutdown_super+0x73/0xef
      [ 8245.772641]  [<ffffffff81167a3a>] kill_anon_super+0x13/0x1e
      [ 8245.772641]  [<ffffffffa04cb1b6>] btrfs_kill_super+0x17/0x23 [btrfs]
      [ 8245.772641]  [<ffffffff81167544>] deactivate_locked_super+0x3b/0x68
      [ 8245.772641]  [<ffffffff81167dd6>] deactivate_super+0x3f/0x43
      [ 8245.772641]  [<ffffffff8117fbb9>] cleanup_mnt+0x59/0x78
      [ 8245.772641]  [<ffffffff8117fc18>] __cleanup_mnt+0x12/0x14
      [ 8245.772641]  [<ffffffff81065371>] task_work_run+0x8f/0xbc
      [ 8245.772641]  [<ffffffff810028fb>] do_notify_resume+0x45/0x53
      [ 8245.772641]  [<ffffffff814651ac>] int_signal+0x12/0x17
      [ 8245.772641] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 5c ff 74 04 f0 ff 43 50 49 83 7c 24 08 00 74 2c 4c 8d 6b
      [ 8245.772641] RIP  [<ffffffffa051c8e6>] btrfs_queue_work+0x2c/0x14d [btrfs]
      [ 8245.772641]  RSP <ffff88020d0478b8>
      [ 8245.845040] ---[ end trace a01d038397e99b93 ]---
      
      For logical reasons such as the phase of the moon, this happened more
      often with "-o inode_cache" than without any mount options.
      
      After some debugging it turned out to be simple to understand what was
      happening:
      
      1) close_ctree() is called;
      
      2) It then stops the transaction kthread, which commits the current
         transaction;
      
      3) It asks the cleaner kthread to stop, which is currently running
         btrfs_delete_unused_bgs();
      
      4) btrfs_delete_unused_bgs() finds an unused block group, starts a new
         transaction, deletes the block group, which implies COWing some
         tree nodes and leafs and dirtying their respective pages, and then
         finally it ends the transaction it started, without committing it;
      
      5) The cleaner kthread stops;
      
      6) close_ctree() releases (from memory) the block group objects, which
         produces the warning in the trace pasted above;
      
      7) Then it invalidates all pages of the btree inode, by calling
         invalidate_inode_pages2(), which waits for any pages under writeback,
         and releases any non-dirty pages;
      
      8) All work queues are destroyed (waiting first for their current tasks
         to finish execution);
      
      9) A final iput() is called against the btree inode;
      
      10) This iput triggers a writeback of the btree inode because it still
          has dirty pages;
      
      11) This starts the whole chain of callbacks for the btree inode until
          it eventually reaches btrfs_wq_submit_bio() where it leads to a
          NULL pointer dereference because the work queues were already
          destroyed.
      
      Fix this by making the cleaner commit any transaction that it started
      after the transaction kthread was stopped.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      da288d28
    • Filipe Manana's avatar
      Btrfs: fix race between caching kthread and returning inode to inode cache · ae9d8f17
      Filipe Manana authored
      While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
      we could have a concurrent call to btrfs_return_ino() that adds a new
      entry to the root's free space cache of pinned inodes. This concurrent
      call does not acquire the fs_info->commit_root_sem before adding a new
      entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
      because the caching kthread calls btrfs_unpin_free_ino() after setting
      the caching state to BTRFS_CACHE_FINISHED and therefore races with
      the task calling btrfs_return_ino(), which is adding a new entry, while
      the former (caching kthread) is navigating the cache's rbtree, removing
      and freeing nodes from the cache's rbtree without acquiring the spinlock
      that protects the rbtree.
      
      This race resulted in memory corruption due to double free of struct
      btrfs_free_space objects because both tasks can end up doing freeing the
      same objects. Note that adding a new entry can result in merging it with
      other entries in the cache, in which case those entries are freed.
      This is particularly important as btrfs_free_space structures are also
      used for the block group free space caches.
      
      This memory corruption can be detected by a debugging kernel, which
      reports it with the following trace:
      
      [132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
      [132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
      [132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
      [132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
      [132408.505075] Call Trace:
      [132408.505075]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
      [132408.505075]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
      [132408.505075]  [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
      [132408.505075]  [<ffffffff81155733>] __cache_free+0xe2/0x4b6
      [132408.505075]  [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
      [132408.505075]  [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
      [132408.505075]  [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
      [132408.505075]  [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
      [132408.505075]  [<ffffffff811563a1>] ? kfree+0xb6/0x14e
      [132408.505075]  [<ffffffff811563d0>] kfree+0xe5/0x14e
      [132408.505075]  [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
      [132408.505075]  [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
      [132408.505075]  [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
      [132408.505075]  [<ffffffff8106698f>] kthread+0xef/0xf7
      [132408.505075]  [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
      [132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
      [132408.505075]  [<ffffffff814653d2>] ret_from_fork+0x42/0x70
      [132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
      [132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
      [132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
      [132409.503355] ------------[ cut here ]------------
      [132409.504241] kernel BUG at mm/slab.c:2571!
      
      Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
      that protects the rbtree while doing the searches and removing entries.
      
      Fixes: 1c70d8fb ("Btrfs: fix inode caching vs tree log")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      ae9d8f17
    • Filipe Manana's avatar
      Btrfs: use kmem_cache_free when freeing entry in inode cache · c3f4a168
      Filipe Manana authored
      The free space entries are allocated using kmem_cache_zalloc(),
      through __btrfs_add_free_space(), therefore we should use
      kmem_cache_free() and not kfree() to avoid any confusion and
      any potential problem. Looking at the kfree() definition at
      mm/slab.c it has the following comment:
      
        /*
         * (...)
         *
         * Don't free memory not originally allocated by kmalloc()
         * or you will run into trouble.
         */
      
      So better be safe and use kmem_cache_free().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c3f4a168
    • Filipe Manana's avatar
      Btrfs: fix race between balance and unused block group deletion · 67c5e7d4
      Filipe Manana authored
      We have a race between deleting an unused block group and balancing the
      same block group that leads to an assertion failure/BUG(), producing the
      following trace:
      
      [181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
      [181631.220591] ------------[ cut here ]------------
      [181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
      [181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
      [181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
      [181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
      [181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
      [181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
      [181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
      [181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
      [181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
      [181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
      [181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
      [181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
      [181631.224566] Stack:
      [181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
      [181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
      [181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
      [181631.224566] Call Trace:
      [181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
      [181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
      [181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
      [181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
      [181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      [181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
      [181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
      [181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
      [181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
      [181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
      (...)
      
      The sequence of steps leading to this are:
      
                 CPU 0                                         CPU 1
      
        btrfs_balance()
          btrfs_relocate_chunk()
      
            btrfs_relocate_block_group(bg X)
              btrfs_lookup_block_group(bg X)
      
                                                     cleaner_kthread
                                                        locks fs_info->cleaner_mutex
      
                                                        btrfs_delete_unused_bgs()
                                                          finds bg X, which became
                                                          unused in the previous
                                                          transaction
      
                                                          checks bg X ->ro == 0,
                                                          so it proceeds
              sets bg X ->ro to 1
              (btrfs_set_block_group_ro(bg X))
      
              blocks on fs_info->cleaner_mutex
                                                          btrfs_remove_chunk(bg X)
                                                        unlocks fs_info->cleaner_mutex
      
              acquires fs_info->cleaner_mutex
              relocate_block_group()
                --> does nothing, no extents found in
                    the extent tree from bg X
              unlocks fs_info->cleaner_mutex
      
            btrfs_relocate_block_group(bg X) returns
      
          btrfs_remove_chunk(bg X)
             extent map not found
                --> ASSERT(0)
      
      Fix this by using a new mutex to make sure these 2 operations, block
      group relocation and removal, are serialized.
      
      This issue is reproducible by running fstests generic/038 (which stresses
      chunk allocation and automatic removal of unused block groups) together
      with the following balance loop:
      
          while true; do btrfs balance start -dusage=0 <mountpoint> ; done
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      67c5e7d4
    • Zhao Lei's avatar
      btrfs: add error handling for scrub_workers_get() · e82afc52
      Zhao Lei authored
      Although it is a rare case, we'd better free previous allocated
      memory on error.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e82afc52
    • Zhao Lei's avatar
      btrfs: cleanup noused initialization of dev in btrfs_end_bio() · 65f53338
      Zhao Lei authored
      It is introduced by:
       c404e0dc
       Btrfs: fix use-after-free in the finishing procedure of the device replace
      
      But seems no relationship with that bug, this patch revirt these
      code block for cleanup.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      65f53338
    • Yang Dongsheng's avatar
      btrfs: qgroup: allow user to clear the limitation on qgroup · fe759907
      Yang Dongsheng authored
      Currently, we can only set a limitation on a qgroup, but we
      can not clear it.
      
      This patch provide a choice to user to clear a limitation on
      qgroup by passing a value of CLEAR_VALUE(-1) to kernel.
      Reported-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Tested-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      fe759907
  3. 24 Jun, 2015 1 commit
  4. 23 Jun, 2015 1 commit
  5. 22 Jun, 2015 1 commit
  6. 19 Jun, 2015 3 commits
  7. 12 Jun, 2015 2 commits
  8. 10 Jun, 2015 20 commits
    • Zhao Lei's avatar
      btrfs: wait for delayed iputs on no space · 9a4e7276
      Zhao Lei authored
      btrfs will report no_space when we run following write and delete
      file loop:
       # FILE_SIZE_M=[ 75% of fs space ]
       # DEV=[ some dev ]
       # MNT=[ some dir ]
       #
       # mkfs.btrfs -f "$DEV"
       # mount -o nodatacow "$DEV" "$MNT"
       # for ((i = 0; i < 100; i++)); do dd if=/dev/zero of="$MNT"/file0 bs=1M count="$FILE_SIZE_M"; rm -f "$MNT"/file0; done
       #
      
      Reason:
       iput() and evict() is run after write pages to block device, if
       write pages work is not finished before next write, the "rm"ed space
       is not freed, and caused above bug.
      
      Fix:
       We can add "-o flushoncommit" mount option to avoid above bug, but
       it have performance problem. Actually, we can to wait for on-the-fly
       writes only when no-space happened, it is which this patch do.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9a4e7276
    • Qu Wenruo's avatar
      btrfs: qgroup: Make snapshot accounting work with new extent-oriented · d6726335
      Qu Wenruo authored
      qgroup.
      
      Make snapshot accounting work with new extent-oriented mechanism by
      skipping given root in new/old_roots in create_pending_snapshot().
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d6726335
    • Qu Wenruo's avatar
      btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. · 9086db86
      Qu Wenruo authored
      This is used by later qgroup fix patches for snapshot.
      
      As current snapshot accounting is done by btrfs_qgroup_inherit(), but
      new extent oriented quota mechanism will account extent from
      btrfs_copy_root() and other snapshot things, causing wrong result.
      
      So add this ability to handle snapshot accounting.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9086db86
    • Qu Wenruo's avatar
      btrfs: ulist: Add ulist_del() function. · d4b80404
      Qu Wenruo authored
      This function will delete unode with given (val,aux) pair.
      And with this patch, seqnum for debug usage doesn't have any meaning
      now, so remove them.
      
      This is used by later patches to skip snapshot root.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d4b80404
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. · e69bcee3
      Qu Wenruo authored
      Goodbye, the old mechanisim.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      e69bcee3
    • Qu Wenruo's avatar
      btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. · 442244c9
      Qu Wenruo authored
      Since the self test transaction don't have delayed_ref_roots, so use
      find_all_roots() and export btrfs_qgroup_account_extent() to simulate it
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      442244c9
    • Qu Wenruo's avatar
      btrfs: qgroup: Switch to new extent-oriented qgroup mechanism. · 0ed4792a
      Qu Wenruo authored
      Switch from old ref_node based qgroup to extent based qgroup mechanism
      for normal operations.
      
      The new mechanism should hugely reduce the overhead of btrfs quota
      system, and further more, the codes and logic should be more clean and
      easier to maintain.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      0ed4792a
    • Qu Wenruo's avatar
      btrfs: qgroup: Switch rescan to new mechanism. · 9d220c95
      Qu Wenruo authored
      Switch rescan to use the new new extent oriented mechanism.
      
      As rescan is also based on extent, new mechanism is just a perfect match
      for rescan.
      
      With re-designed internal functions, rescan is quite easy, just call
      btrfs_find_all_roots() and then btrfs_qgroup_account_one_extent().
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9d220c95
    • Qu Wenruo's avatar
      btrfs: qgroup: Add new qgroup calculation function · 550d7a2e
      Qu Wenruo authored
      btrfs_qgroup_account_extents().
      
      The new btrfs_qgroup_account_extents() function should be called in
      btrfs_commit_transaction() and it will update all the qgroup according
      to delayed_ref_root->dirty_extent_root.
      
      The new function can handle both normal operation during
      commit_transaction() or in rescan in a unified method with clearer
      logic.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      550d7a2e
    • Qu Wenruo's avatar
      btrfs: backref: Add special time_seq == (u64)-1 case for · 21633fc6
      Qu Wenruo authored
      btrfs_find_all_roots().
      
      Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree
      lock to do tree search.
      
      This is important for later qgroup implement which will call
      find_all_roots() after fs trees are committed.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      21633fc6
    • Qu Wenruo's avatar
      btrfs: qgroup: Add new function to record old_roots. · 3b7d00f9
      Qu Wenruo authored
      Add function btrfs_qgroup_prepare_account_extents() to get old_roots
      which are needed for qgroup.
      
      We do it in commit_transaction() and before switch_roots(), and only
      search commit_root, so it gives a quite accurate view for previous
      transaction.
      
      With old_roots from previous transaction, we can use it to do accurate
      account with current transaction.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3b7d00f9
    • Qu Wenruo's avatar
      btrfs: qgroup: Record possible quota-related extent for qgroup. · 3368d001
      Qu Wenruo authored
      Add hook in add_delayed_ref_head() to record quota-related extent record
      into delayed_ref_root->dirty_extent_record rb-tree for later qgroup
      accounting.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3368d001
    • Qu Wenruo's avatar
      btrfs: qgroup: Add function qgroup_update_counters(). · 823ae5b8
      Qu Wenruo authored
      Add function qgroup_update_counters(), which will update related
      qgroups' rfer/excl according to old/new_roots.
      
      This is one of the two core functions for the new qgroup implement.
      
      This is based on btrfs_adjust_coutners() but with clearer logic and
      comment.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      823ae5b8
    • Qu Wenruo's avatar
      btrfs: qgroup: Add function qgroup_update_refcnt(). · d810ef2b
      Qu Wenruo authored
      This function is used to update refcnt for qgroups.
      And is one of the two core functions used in the new qgroup implement.
      
      This is based on the old update_old/new_refcnt, but provides a unified
      logic and behavior.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d810ef2b
    • Qu Wenruo's avatar
      btrfs: extent-tree: Use ref_node to replace unneeded parameters in... · c682f9b3
      Qu Wenruo authored
      btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent()
      
      __btrfs_inc_extent_ref() and __btrfs_free_extent() have already had too
      many parameters, but three of them can be extracted from
      btrfs_delayed_ref_node struct.
      
      So use btrfs_delayed_ref_node struct as a single parameter to replace
      the bytenr/num_byte/no_quota parameters.
      
      The real objective of this patch is to allow btrfs_qgroup_record_ref()
      get the delayed_ref_node in incoming qgroup patches.
      
      Other functions calling btrfs_qgroup_record_ref() are not affected since
      the rest will only add/sub exclusive extents, where node is not used.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c682f9b3
    • Qu Wenruo's avatar
      btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read. · 9c542136
      Qu Wenruo authored
      Use inline functions to do such things, to improve readability.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      9c542136
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Cleanup the unneeded functions. · c43d160f
      Qu Wenruo authored
      Cleanup the rb_tree merge/insert/update functions, since now we use list
      instead of rb_tree now.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c43d160f
    • Qu Wenruo's avatar
      btrfs: delayed-ref: Use list to replace the ref_root in ref_head. · c6fc2454
      Qu Wenruo authored
      This patch replace the rbtree used in ref_head to list.
      This has the following advantage:
      1) Easier merge logic.
      With the new list implement, we only need to care merging the tail
      ref_node with the new ref_node.
      And this can be done quite easy at insert time, no need to do a
      indicated merge at run_delayed_refs().
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      c6fc2454
    • Qu Wenruo's avatar
      btrfs: backref: Don't merge refs which are not for same block. · 00db646d
      Qu Wenruo authored
      Old __merge_refs() in backref.c will even merge refs whose root_id are
      different, which makes qgroup gives wrong result.
      
      Fix it by checking ref_for_same_block() before any mode specific works.
      Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      00db646d
    • Zhao Lei's avatar
      btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() · 20b2e302
      Zhao Lei authored
      lockdep report following warning in test:
       [25176.843958] =================================
       [25176.844519] [ INFO: inconsistent lock state ]
       [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
       [25176.845591] ---------------------------------
       [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
       [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
       [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
       [25176.847838] {SOFTIRQ-ON-W} state was registered at:
       [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
       [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
       [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
       [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
       [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
       [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
       [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
       [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
       [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
       [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
       [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
       [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
       [25176.854935] irq event stamp: 51506
       [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
       [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
       [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
       [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
       [25176.857746]
       other info that might help us debug this:
       [25176.858845]  Possible unsafe locking scenario:
       [25176.859981]        CPU0
       [25176.860537]        ----
       [25176.861059]   lock(&wr_ctx->wr_lock);
       [25176.861705]   <Interrupt>
       [25176.862272]     lock(&wr_ctx->wr_lock);
       [25176.862881]
        *** DEADLOCK ***
      
      Reason:
       Above warning is caused by:
       Interrupt
       -> bio_endio()
       -> ...
       -> scrub_put_ctx()
       -> scrub_free_ctx() *1
       -> ...
       -> mutex_lock(&wr_ctx->wr_lock);
      
       scrub_put_ctx() is allowed to be called in end_bio interrupt, but
       in code design, it will never call scrub_free_ctx(sctx) in interrupe
       context(above *1), because btrfs_scrub_dev() get one additional
       reference of sctx->refs, which makes scrub_free_ctx() only called
       withine btrfs_scrub_dev().
      
       Now the code runs out of our wish, because free sequence in
       scrub_pending_bio_dec() have a gap.
      
       Current code:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
       -> scrub_free_ctx()                |
       -----------------------------------+-----------------------------------
      
       We expected:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
                                          | -> scrub_free_ctx()
       -----------------------------------+-----------------------------------
      
      Fix:
       Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
       in interrupt context.
       Tested by check tracelog in debug.
      
      Changelog v1->v2:
       Use workqueue instead of adjust function call sequence in v1,
       because v1 will introduce a bug pointed out by:
       Filipe David Manana <fdmanana@gmail.com>
      Reported-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      20b2e302