1. 31 Aug, 2015 5 commits
  2. 21 Aug, 2015 1 commit
    • Chris Mason's avatar
      btrfs: fix compile when block cgroups are not enabled · 3a9508b0
      Chris Mason authored
      bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
      This adds an ifdef around them.  It's not perfect, but our
      use of bi_ioc is being removed in the 4.3 merge window.
      
      The bi_css usage really should go into bio_clone, but I want to make
      sure that doesn't introduce problems for other bio_clone use cases.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      3a9508b0
  3. 19 Aug, 2015 6 commits
    • Filipe Manana's avatar
      Btrfs: fix file read corruption after extent cloning and fsync · b84b8390
      Filipe Manana authored
      If we partially clone one extent of a file into a lower offset of the
      file, fsync the file, power fail and then mount the fs to trigger log
      replay, we can get multiple checksum items in the csum tree that overlap
      each other and result in checksum lookup failures later. Those failures
      can make file data read requests assume a checksum value of 0, but they
      will not return an error (-EIO for example) to userspace exactly because
      the expected checksum value 0 is a special value that makes the read bio
      endio callback return success and set all the bytes of the corresponding
      page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
      From a userspace perspective this is equivalent to file corruption
      because we are not returning what was written to the file.
      
      Details about how this can happen, and why, are included inline in the
      following reproducer test case for fstests and the comment added to
      tree-log.c.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_cloner
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with a single 100K extent starting at file
        # offset 800K. We fsync the file here to make the fsync log tree gets
        # a single csum item that covers the whole 100K extent, which causes
        # the second fsync, done after the cloning operation below, to not
        # leave in the log tree two csum items covering two sub-ranges
        # ([0, 20K[ and [20K, 100K[)) of our extent.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                        -c "fsync"                     \
                         $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now clone part of our extent into file offset 400K. This adds a file
        # extent item to our inode's metadata that points to the 100K extent
        # we created before, using a data offset of 20K and a data length of
        # 20K, so that it refers to the sub-range [20K, 40K[ of our original
        # extent.
        $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
            -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        # Now fsync our file to make sure the extent cloning is durably
        # persisted. This fsync will not add a second csum item to the log
        # tree containing the checksums for the blocks in the sub-range
        # [20K, 40K[ of our extent, because there was already a csum item in
        # the log tree covering the whole extent, added by the first fsync
        # we did before.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        # Silently drop all writes and ummount to simulate a crash/power
        # failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        # The fsync log replay first processes the file extent item
        # corresponding to the file offset 400K (the one which refers to the
        # [20K, 40K[ sub-range of our 100K extent) and then processes the file
        # extent item for file offset 800K. It used to happen that when
        # processing the later, it erroneously left in the csum tree 2 csum
        # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
        # 1 for the whole range of our extent. This introduced a problem where
        # subsequent lookups for the checksums of blocks within the range
        # [40K, 100K[ of our extent would not find anything because lookups in
        # the csum tree ended up looking only at the smaller csum item, the
        # one covering the subrange [20K, 40K[. This made read requests assume
        # an expected checksum with a value of 0 for those blocks, which caused
        # checksum verification failure when the read operations finished.
        # However those checksum failure did not result in read requests
        # returning an error to user space (like -EIO for e.g.) because the
        # expected checksum value had the special value 0, and in that case
        # btrfs set all bytes of the corresponding pages with the value 0x01
        # and produce the following warning in dmesg/syslog:
        #
        #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
        #   1322675045 expected csum 0"
        #
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest he had after cloning the extent and
        # before the power failure happened.
        md5sum $SCRATCH_MNT/foo | _filter_scratch
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      b84b8390
    • Filipe Manana's avatar
      Btrfs: check if previous transaction aborted to avoid fs corruption · 1f9b8c8f
      Filipe Manana authored
      While we are committing a transaction, it's possible the previous one is
      still finishing its commit and therefore we wait for it to finish first.
      However we were not checking if that previous transaction ended up getting
      aborted after we waited for it to commit, so we ended up committing the
      current transaction which can lead to fs corruption because the new
      superblock can point to trees that have had one or more nodes/leafs that
      were never durably persisted.
      The following sequence diagram exemplifies how this is possible:
      
                CPU 0                                                        CPU 1
      
        transaction N starts
      
        (...)
      
        btrfs_commit_transaction(N)
      
          cur_trans->state = TRANS_STATE_COMMIT_START;
          (...)
          cur_trans->state = TRANS_STATE_COMMIT_DOING;
          (...)
      
          cur_trans->state = TRANS_STATE_UNBLOCKED;
          root->fs_info->running_transaction = NULL;
      
                                                                    btrfs_start_transaction()
                                                                       --> starts transaction N + 1
      
          btrfs_write_and_wait_transaction(trans, root);
            --> starts writing all new or COWed ebs created
                at transaction N
      
                                                                    creates some new ebs, COWs some
                                                                    existing ebs but doesn't COW or
                                                                    deletes eb X
      
                                                                    btrfs_commit_transaction(N + 1)
                                                                      (...)
                                                                      cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                      (...)
                                                                      wait_for_commit(root, prev_trans);
                                                                        --> prev_trans == transaction N
      
          btrfs_write_and_wait_transaction() continues
          writing ebs
             --> fails writing eb X, we abort transaction N
                 and set bit BTRFS_FS_STATE_ERROR on
                 fs_info->fs_state, so no new transactions
                 can start after setting that bit
      
             cleanup_transaction()
               btrfs_cleanup_one_transaction()
                 wakes up task at CPU 1
      
                                                                      continues, doesn't abort because
                                                                      cur_trans->aborted (transaction N + 1)
                                                                      is zero, and no checks for bit
                                                                      BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                      are made
      
                                                                      btrfs_write_and_wait_transaction(trans, root);
                                                                        --> succeeds, no errors during writeback
      
                                                                      write_ctree_super(trans, root, 0);
                                                                        --> succeeds
                                                                        --> we have now a superblock that points us
                                                                            to some root that uses eb X, which was
                                                                            never written to disk
      
      In this scenario future attempts to read eb X from disk results in an
      error message like "parent transid verify failed on X wanted Y found Z".
      
      So fix this by aborting the current transaction if after waiting for the
      previous transaction we verify that it was aborted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      1f9b8c8f
    • Michal Hocko's avatar
      btrfs: use __GFP_NOFAIL in alloc_btrfs_bio · 277fb5fc
      Michal Hocko authored
      alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
      transaction but this allocation context is rather weak wrt. reclaim
      capabilities. The page allocator currently tries hard to not fail these
      allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
      still fail if the _current_ process is the OOM killer victim. Moreover
      there is an attempt to move away from the default no-fail behavior and
      allow these allocation to fail more eagerly. This would lead to:
      
      [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
      
      which is clearly undesirable and the nofail behavior should be explicit
      if the allocation failure cannot be tolerated.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      277fb5fc
    • Michal Hocko's avatar
      btrfs: Prevent from early transaction abort · d1b5c567
      Michal Hocko authored
      Btrfs relies on GFP_NOFS allocation when committing the transaction but
      this allocation context is rather weak wrt. reclaim capabilities. The
      page allocator currently tries hard to not fail these allocations if
      they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem
      currently but there is an attempt to move away from the default no-fail
      behavior and allow these allocation to fail more eagerly. And this would
      lead to a pre-mature transaction abort as follows:
      
      [   55.328093] Call Trace:
      [   55.328890]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.330518]  [<ffffffff8108fa28>] ? console_unlock+0x334/0x363
      [   55.332738]  [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4
      [   55.334910]  [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c
      [   55.336844]  [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs]
      [   55.338973]  [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs]
      [   55.341329]  [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs]
      [   55.343566]  [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs]
      [   55.345577]  [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs]
      [   55.347679]  [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e
      [   55.349434]  [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [   55.351681]  [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs]
      [   55.353979]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.356212]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.358378]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.360626]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.362894]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.365221]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.367273]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.369047]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.370654]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.372246]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.373851]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [   55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory
      [   55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction.
      [   55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure).
      [   55.384280] ------------[ cut here ]------------
      [   55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]()
      [...]
      [   55.384337] Call Trace:
      [   55.384353]  [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b
      [   55.384357]  [<ffffffff8107f717>] ? down_trylock+0x2d/0x37
      [   55.384359]  [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb
      [   55.384398]  [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384400]  [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c
      [   55.384423]  [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs]
      [   55.384446]  [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs]
      [   55.384455]  [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs]
      [   55.384476]  [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs]
      [   55.384499]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384521]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384543]  [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs]
      [   55.384565]  [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs]
      [   55.384588]  [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs]
      [   55.384591]  [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e
      [   55.384592]  [<ffffffff81186833>] vfs_fsync+0x1c/0x1e
      [   55.384593]  [<ffffffff81186869>] do_fsync+0x34/0x4e
      [   55.384594]  [<ffffffff81186ab3>] SyS_fsync+0x10/0x14
      [   55.384595]  [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f
      [...]
      [   55.384608] ---[ end trace c29799da1d4dd621 ]---
      [   55.437323] BTRFS info (device hdb1): forced readonly
      [   55.438815] BTRFS info (device hdb1): delayed_refs has NO entry
      
      Fix this by being explicit about the no-fail behavior of this allocation
      path and use __GFP_NOFAIL.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      d1b5c567
    • Zhaolei's avatar
      btrfs: Remove unused arguments in tree-log.c · 60d53eb3
      Zhaolei authored
      Following arguments are not used in tree-log.c:
       insert_one_name(): path, type
       wait_log_commit(): trans
       wait_for_writer(): trans
      
      This patch remove them.
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      60d53eb3
    • Zhaolei's avatar
      btrfs: Remove useless condition in start_log_trans() · 34eb2a52
      Zhaolei authored
      Dan Carpenter <dan.carpenter@oracle.com> reported a smatch warning
      for start_log_trans():
       fs/btrfs/tree-log.c:178 start_log_trans()
       warn: we tested 'root->log_root' before and it was 'false'
      
       fs/btrfs/tree-log.c
       147          if (root->log_root) {
       We test "root->log_root" here.
       ...
      
      Reason:
       Condition of:
       fs/btrfs/tree-log.c:178: if (!root->log_root) {
       is not necessary after commit: 7237f183
      
       It caused a smatch warning, and no functionally error.
      
      Fix:
       Deleting above condition will make smatch shut up,
       but a better way is to do cleanup for start_log_trans()
       to remove duplicated code and make code more readable.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      34eb2a52
  4. 09 Aug, 2015 28 commits