1. 12 Nov, 2013 40 commits
    • Stefan Behrens's avatar
      Btrfs: Don't allocate inode that is already in use · ff76b056
      Stefan Behrens authored
      Due to an off-by-one error, it is possible to reproduce a bug
      when the inode cache is used.
      
      The same inode number is assigned twice, the second time this
      leads to an EEXIST in btrfs_insert_empty_items().
      
      The issue can happen when a file is removed right after a subvolume
      is created and then a new inode number is created before the
      inodes in free_inode_pinned are processed.
      unlink() calls btrfs_return_ino() which calls start_caching() in this
      case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
      searching for the highest inode (which already cannot find the
      unlinked one anymore in btrfs_find_free_objectid()). So if this
      unlinked inode's number is equal to the highest_ino + 1 (or >= this value
      instead of > this value which was the off-by-one error), we mustn't add
      the inode number to free_ino_pinned (caching_thread() does it right).
      In this case we need to try directly to add the number to the inode_cache
      which will fail in this case.
      
      When this inode number is allocated while it is still in free_ino_pinned,
      it is allocated and still added to the free inode cache when the
      pinned inodes are processed, thus one of the following inode number
      allocations will get an inode that is already in use and fail with EEXIST
      in btrfs_insert_empty_items().
      
      One example which was created with the reproducer below:
      Create a snapshot, work in the newly created snapshot for the rest.
      In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
      start_caching() calls add_free_space [34284, 18446744073709517077].
      In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
      mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
      btrfs_unpin_free_ino calls add_free_space [34284, 1].
      mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
      EEXIST when the new inode is inserted.
      
      One possible reproducer is this one:
       #!/bin/sh
       # preparation
      TEST_DEV=/dev/sdc1
      TEST_MNT=/mnt
      umount ${TEST_MNT} 2>/dev/null || true
      mkfs.btrfs -f ${TEST_DEV}
      mount ${TEST_DEV} ${TEST_MNT} -o \
       rw,relatime,compress=lzo,space_cache,inode_cache
      btrfs subv create ${TEST_MNT}/s1
      for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
      btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
      FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
      rm ${TEST_MNT}/s2/$FILENAME
      touch ${TEST_MNT}/s2/$FILENAME
       # the following steps can be repeated to reproduce the issue again and again
      [ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
      btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
      rm ${TEST_MNT}/s3/$FILENAME
      touch ${TEST_MNT}/s3/$FILENAME
      ls -alFi ${TEST_MNT}/s?/$FILENAME
      touch ${TEST_MNT}/s3/_1 || logger FAILED
      ls -alFi ${TEST_MNT}/s?/_1
      touch ${TEST_MNT}/s3/_2 || logger FAILED
      ls -alFi ${TEST_MNT}/s?/_2
      touch ${TEST_MNT}/s3/__1 || logger FAILED
      ls -alFi ${TEST_MNT}/s?/__1
      touch ${TEST_MNT}/s3/__2 || logger FAILED
      ls -alFi ${TEST_MNT}/s?/__2
       # if the above is not enough, add the following loop:
      for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
       #for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
       # one of the touch(1) calls in s3 fail due to EEXIST because the inode is
       # already in use that btrfs_find_ino_for_alloc() returns.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Reviewed-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      ff76b056
    • Filipe David Borba Manana's avatar
      Btrfs: fix btrfs_prev_leaf() previous key computation · e8b0d724
      Filipe David Borba Manana authored
      If we decrement the key type, we must reset its offset to the largest
      possible offset (u64)-1. If we decrement the key's objectid, then we
      must reset the key's type and offset to their largest possible values,
      (u8)-1 and (u64)-1 respectively. Not doing so can make us miss an
      items in the tree.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      e8b0d724
    • Filipe David Borba Manana's avatar
      Btrfs: optimize tree-log.c:count_inode_refs() · e93ae26f
      Filipe David Borba Manana authored
      Avoid repeated tree searches by processing all inode ref items in
      a leaf at once instead of processing one at a time, followed by a
      path release and a tree search for a key with a decremented offset.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      e93ae26f
    • Geyslan G. Bem's avatar
      btrfs: simplify kmalloc+copy_from_user to memdup_user · 229eed43
      Geyslan G. Bem authored
      Use memdup_user rather than duplicating its implementation
      This is a little bit restricted to reduce false positives
      
      The semantic patch that makes this report is available
      in scripts/coccinelle/api/memdup_user.cocci.
      
      More information about semantic patching is available at
      http://coccinelle.lip6.fr/Signed-off-by: default avatarGeyslan G. Bem <geyslan@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      229eed43
    • chandan's avatar
      Btrfs: btrfs_add_ordered_operation: Fix last modified transaction comparison. · 5ede859b
      chandan authored
      Comparison of an inode's last modified transaction with the last committed
      transaction is incorrect. Fix it.
      Signed-off-by: default avatarchandan <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      5ede859b
    • Filipe David Borba Manana's avatar
      Btrfs: don't leak delayed node on path allocation failure · 3c77bd94
      Filipe David Borba Manana authored
      If the path allocation failed, we would return without decrementing
      the reference count in the delayed node we got before, resulting
      in a leak.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      3c77bd94
    • Stefan Behrens's avatar
      Btrfs: Wait for uuid-tree rebuild task on remount read-only · 361c093d
      Stefan Behrens authored
      If the user remounts the filesystem read-only while the uuid-tree
      scan and rebuild task is still running (this happens once after the
      filesystem was mounted with an old kernel, or when forced with the
      mount options), the remount should wait on the tasks completion
      before setting the filesystem read-only. Otherwise the background
      task continues to write to the filesystem which is apparently not
      what users expect.
      
      The reproducer:
      
      TEST_DEV=/dev/sdzzzzz1
      TEST_MNT=/mnt
      mkfs.btrfs -f $TEST_DEV
      mount $TEST_DEV $TEST_MNT
      for i in `seq 50000`; do btrfs subvolume create ${TEST_MNT}/$i; done
      umount $TEST_MNT
      mount $TEST_DEV $TEST_MNT -o rescan_uuid_tree
      sleep 1
      ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
      mount $TEST_DEV $TEST_MNT -o ro,remount
      ps -elf | fgrep '[btrfs-uuid]' | grep -v grep
      sleep 1
      umount $TEST_MNT
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      361c093d
    • Stefan Behrens's avatar
      Btrfs: init device stats for new devices · 27087f37
      Stefan Behrens authored
      Device stats are only initialized (read from tree items) on mount.
      Trying to read device stats after adding or replacing new devices will
      return errors.
      
      btrfs_init_new_device() and btrfs_init_dev_replace_tgtdev() are the two
      functions that allocate and initialize new btrfs_device structures after
      a filesystem is mounted. They set the device stats to zero by using
      kzalloc() which is correct for new devices. The only missing thing was
      to declare these stats as being valid (device->dev_stats_valid = 1) and
      this patch adds this missing code.
      
      This is the reproducer:
      
      TEST_DEV1=/dev/sdzzzzz1
      TEST_DEV2=/dev/sdzzzzz2
      TEST_DEV3=/dev/sdzzzzz3
      TEST_MNT=/mnt
      mkfs.btrfs $TEST_DEV1
      mount $TEST_DEV1 $TEST_MNT
      btrfs device add $TEST_DEV2 $TEST_MNT
      btrfs device stat $TEST_MNT
      btrfs replace start -B $TEST_DEV2 $TEST_DEV3 $TEST_MNT
      btrfs device stat $TEST_MNT
      umount $TEST_MNT
      Reported-by: default avatarOndrej Kunc <kunc88@gmail.com>
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      27087f37
    • Liu Bo's avatar
      Btrfs: fixup error path in __btrfs_inc_extent_ref · 30d133fc
      Liu Bo authored
      When we fail to add a reference after a non-inline insertion by some reasons,
      eg. ENOSPC, we'll abort the transaction, but we don't return this error to
      the caller who has to walk around again to find something wrong, that's
      unnecessary.
      
      Also fixup other error paths to keep it simple.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      30d133fc
    • Ilya Dryomov's avatar
      Btrfs: disallow 'btrfs {balance,replace} cancel' on ro mounts · e649e587
      Ilya Dryomov authored
      For both balance and replace, cancelling involves changing the on-disk
      state and committing a transaction, which is not a good thing to do on
      read-only filesystems.
      
      Cc: Stefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      e649e587
    • Ilya Dryomov's avatar
      Btrfs: don't leak ioctl args in btrfs_ioctl_dev_replace · adfa97cb
      Ilya Dryomov authored
      struct btrfs_ioctl_dev_replace_args memory is leaked if replace is
      requested on a read-only filesystem.  Fix it.
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      adfa97cb
    • Ilya Dryomov's avatar
      Btrfs: nuke a bogus rw_devices decrement in __btrfs_close_devices · f747cab7
      Ilya Dryomov authored
      On mount failures, __btrfs_close_devices can be called well before
      dev-replace state is read and ->is_tgtdev_for_dev_replace is set.  This
      leads to a bogus decrement of ->rw_devices and sets off a WARN_ON in
      __btrfs_close_devices if replace target device happens to be on the
      lists and we fail early in the mount sequence.  Fix this by checking
      the devid instead of ->is_tgtdev_for_dev_replace before the decrement:
      for replace targets devid is always equal to BTRFS_DEV_REPLACE_DEVID.
      
      Cc: Stefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      f747cab7
    • Geyslan G. Bem's avatar
      btrfs: Fix memory leakage in the tree-log.c · 03b2f08b
      Geyslan G. Bem authored
      In add_inode_ref() function:
      
      Initializes local pointers.
      
      Reduces the logical condition with the __add_inode_ref() return
      value by using only one 'goto out'.
      
      Centralizes the exiting, ensuring the freeing of all used memory.
      Signed-off-by: default avatarGeyslan G. Bem <geyslan@gmail.com>
      Reviewed-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      03b2f08b
    • Liu Bo's avatar
      Btrfs: kill unused code in btrfs_search_forward · 498456d3
      Liu Bo authored
      After commit de78b51a
      (btrfs: remove cache only arguments from defrag path), @blockptr is no more
      used.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      498456d3
    • Liu Bo's avatar
      Btrfs: cleanup dead code of defragment · 8319bfe1
      Liu Bo authored
      @is_extent is no more needed since we don't defrag extent root.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      8319bfe1
    • Filipe David Borba Manana's avatar
      Btrfs: remove unnecessary key copy when logging inode · efd0c405
      Filipe David Borba Manana authored
      The btrfs_insert_empty_item() function doesn't modify its
      key argument.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      efd0c405
    • Chandra Seetharaman's avatar
      Btrfs: Simplify the logic in alloc_extent_buffer() for existing extent buffer case · 452c75c3
      Chandra Seetharaman authored
      alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
      fails with EEXIST. That part of the code is very similar to the code in
      find_extent_buffer(). This patch replaces radix_tree_lookup() and
      surrounding code in alloc_extent_buffer() with find_extent_buffer().
      
      Note that radix_tree_lookup() does not need to be protected by
      tree->buffer_lock. It is protected by eb->refs.
      
      While at it, this patch
        - changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
          with find_extent_buffer() to reduce redundancy.
        - removes the unused argument 'len' to find_extent_buffer().
      Signed-Off-by: default avatarChandra Seetharaman <sekharan@us.ibm.com>
      Reviewed-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      452c75c3
    • Josef Bacik's avatar
      Btrfs: fix up seek_hole/seek_data handling · 7f4ca37c
      Josef Bacik authored
      Whoever wrote this was braindead.  Also it doesn't work right if you have
      VACANCY's since we assumed you would only have that at the end of the file,
      which won't be the case in the near future.  I tested this with generic/285 and
      generic/286 as well as the btrfs tests that use fssum since it uses
      seek_hole/seek_data to verify things are ok.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      7f4ca37c
    • Josef Bacik's avatar
      Btrfs: add an assert to btrfs_lookup_csums_range for alignment · 4277a9c3
      Josef Bacik authored
      I was hitting weird issues when trying to remove hole extents and it turned out
      it was because I was sending non-aligned offsets down to
      btrfs_lookup_csums_range.  So add an assert for this in case somebody trips over
      this in the future.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      4277a9c3
    • Josef Bacik's avatar
      Btrfs: fix hole check in log_one_extent · ed9e8af8
      Josef Bacik authored
      I added an assert to make sure we were looking up aligned offsets for csums and
      I tripped it when running xfstests.  This is because log_one_extent was checking
      if block_start == 0 for a hole instead of EXTENT_MAP_HOLE.  This worked out fine
      in practice it seems, but it adds a lot of extra work that is uneeded.  With
      this fix I'm no longer tripping my assert.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      ed9e8af8
    • Josef Bacik's avatar
      Btrfs: add a sanity test for a vacant extent at the front of a file · 0e30db86
      Josef Bacik authored
      Btrfs_get_extent was not handling this case properly, add a test to make sure we
      don't regress.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      0e30db86
    • Josef Bacik's avatar
      Btrfs: handle a missing extent for the first file extent · 25a50341
      Josef Bacik authored
      While trying to kill our hole extents I noticed I was seeing problems where we
      seek into a file and then start writing and then try to fiemap that file later.
      This is because we search for offset 0, don't find anything and so back up one
      slot, which puts us at the inode ref or something like that, which means we goto
      not_found and create an extent map for our entire search area.  This isn't quite
      what we want, we want to move forward one slot and see if there is an extent
      there so we can limit our hole extent.  This patch fixes this problem, I will
      add a testcase for this as well.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      25a50341
    • Josef Bacik's avatar
      Btrfs: stop all workers after we free block groups · 96192499
      Josef Bacik authored
      Stefan was hitting a panic in the async worker stuff because we had outstanding
      read bios while we were stopping the worker threads.  You could reproduce this
      easily if you mount -o nospace_cache and ran generic/273.  This is because the
      caching thread stuff is still going and we were stopping all the worker threads.
      We need to stop the workers after this work is done, and the free block groups
      code will wait for all the caching threads to stop first so we don't run into
      this problem.  With this patch we no longer panic.  Thanks,
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      96192499
    • Josef Bacik's avatar
      Btrfs: add tests for btrfs_get_extent · aaedb55b
      Josef Bacik authored
      I'm going to be removing hole extents in the near future so I wanted to make a
      sanity test for btrfs_get_extent to make sure I don't break anything in the
      meantime.  This patch just puts btrfs_get_extent through its paces by giving it
      a completely unreasonable mapping to look at and make sure it is giving us back
      maps that make sense.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      aaedb55b
    • Josef Bacik's avatar
      Btrfs: add tests for find_lock_delalloc_range · 294e30fe
      Josef Bacik authored
      So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
      stuff, me first by fixing extent size, then him by fixing something I broke and
      then me again telling him to fix it a different way.  So this is obviously a
      candidate for some testing.  This patch adds a pseudo fs so we can allocate fake
      inodes for tests that need an inode or pages.  Then it addes a bunch of tests to
      make sure find_lock_delalloc_range is acting the way it is supposed to.  With
      this patch and all of our previous patches to find_lock_delalloc_range I am sure
      it is working as expected now.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      294e30fe
    • Josef Bacik's avatar
      Btrfs: free reserved space on error in a few places · 857cc2fc
      Josef Bacik authored
      While trying to track down a reserved space leak I noticed a few places where we
      won't properly clean up reserved space if we have an error, this patch fixes
      those up.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      857cc2fc
    • Josef Bacik's avatar
      Btrfs: fixup reserved trace points · 0be5dc67
      Josef Bacik authored
      In trying to track down where we were leaking reserved space I noticed our
      reserve extent tracepoints are a little off.  First we were saying that the
      reserved space had been alloced in btrfs_reserve_extent, which isn't the case,
      this needs to be triggered when we actually allocate the space when we run the
      delayed ref.  We were also missing a few places where we should have been
      tracing the btrfs_reserve_extent_free tracepoint.  With these in place I was
      able to put together where we were leaking reserved space.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      0be5dc67
    • Josef Bacik's avatar
      Btrfs: free up block groups after everything · 2b1360da
      Josef Bacik authored
      If we abort a transaction we will do the tree log cleanup at unmount, but this
      happens after we free up the block groups.  This makes all the leak detection
      warnings go off because we think we've leaked space but in reality we just
      haven't cleaned it up yet.  So instead do the block group cleanup stuff after
      free'ing the fs roots so we don't get these warnings.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      2b1360da
    • Josef Bacik's avatar
      Btrfs: cleanup reserved space when freeing tree log on error · 681ae509
      Josef Bacik authored
      On error we will wait and free the tree log at unmount without a transaction.
      This means that the actual freeing of the blocks doesn't happen which means we
      complain about space leaks on unmount.  So to fix this just skip the transaction
      specific cleanup part of the tree log free'ing if we don't have a transaction
      and that way we can free up our reserved space and our counters stay happy.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      681ae509
    • Josef Bacik's avatar
      Btrfs: do not free the dirty bytes from the trans block rsv on cleanup · eb58bb37
      Josef Bacik authored
      The transactions should be cleaning up their reservations on failure, this just
      causes us to have warnings on unmount because we go negative by free'ing
      reservations that have already been free'ed.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      eb58bb37
    • Filipe David Borba Manana's avatar
      Btrfs: fix memory leaks on transaction commit failure · 80d94fb3
      Filipe David Borba Manana authored
      Structures of the types tree_mod_elem and qgroup_update are allocated
      during transaction commit but were not being released if the call to
      btrfs_run_delayed_items() returned an error.
      
      Stack trace reported by kmemleak:
      
      unreferenced object 0xffff880679f0b398 (size 128):
        comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
        hex dump (first 32 bytes):
          60 b5 f0 79 06 88 ff ff 00 00 00 00 00 00 00 00  `..y............
          00 00 00 00 00 00 00 00 50 1c 00 00 00 00 00 00  ........P.......
        backtrace:
          [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
          [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
          [<ffffffffa046f2d3>] tree_mod_log_insert_key.constprop.45+0x93/0x150 [btrfs]
          [<ffffffffa04720f9>] __btrfs_cow_block+0x299/0x4f0 [btrfs]
          [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
          [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
          [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
          [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
          [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
          [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
          [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
          [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
          [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
          [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70
          [<ffffffff8119ce7b>] generic_shutdown_super+0x3b/0xf0
          [<ffffffff8119cfc6>] kill_anon_super+0x16/0x30
      unreferenced object 0xffff880677e0dd88 (size 32):
        comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s)
        hex dump (first 32 bytes):
          78 75 11 a9 06 88 ff ff 00 c0 e0 77 06 88 ff ff  xu.........w....
          40 c3 a2 70 06 88 ff ff 00 00 00 00 00 00 00 00  @..p............
        backtrace:
          [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50
          [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200
          [<ffffffffa04fa54f>] btrfs_qgroup_record_ref+0xf/0x90 [btrfs]
          [<ffffffffa04e1914>] btrfs_add_delayed_tree_ref+0xf4/0x170 [btrfs]
          [<ffffffffa048518a>] btrfs_free_tree_block+0x9a/0x220 [btrfs]
          [<ffffffffa0472163>] __btrfs_cow_block+0x303/0x4f0 [btrfs]
          [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs]
          [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs]
          [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
          [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs]
          [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs]
          [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs]
          [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs]
          [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs]
          [<ffffffff811cb210>] __sync_filesystem+0x30/0x60
          [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      80d94fb3
    • Ilya Dryomov's avatar
      Btrfs: fix the dev-replace suspend sequence · 539f358a
      Ilya Dryomov authored
      Replace progresses strictly from lower to higher offsets, and the
      progress is tracked in chunks, by storing the physical offset of the
      dev_extent which is being copied in the cursor_left field of
      btrfs_dev_replace_item.  When we are done copying the chunk,
      left_cursor is updated to point one byte past the dev_extent, so that
      on resume we can skip the dev_extents that have already been copied.
      
      There is a major bug (which goes all the way back to the inception of
      dev-replace in 3.8) in the way left_cursor is bumped: the bump is done
      unconditionally, without any regard to the scrub_chunk return value.
      On suspend (and also on any kind of error) scrub_chunk returns early,
      i.e. without completing the copy.  This leads to us skipping the chunk
      that hasn't been fully copied yet when resuming.
      
      Fix this by doing the cursor_left update only if scrub_chunk ret is 0.
      (On suspend scrub_chunk returns with -ECANCELED, so this fix covers
      both suspend and error cases.)
      
      Cc: Stefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      539f358a
    • Filipe David Borba Manana's avatar
      Btrfs: improve inode hash function/inode lookup · 778ba82b
      Filipe David Borba Manana authored
      Currently the hash value used for adding an inode to the VFS's inode
      hash table consists of the plain inode number, which is a 64 bits
      integer. This results in hash table buckets (hlist_head lists) with
      too many elements for at least 2 important scenarios:
      
      1) When we have many subvolumes. Each subvolume has its own btree
         where its files and directories are added to, and each has its
         own objectid (inode number) namespace. This means that if we have
         N subvolumes, and all have inode number X associated to a file or
         directory, the corresponding inodes all map to the same hash table
         entry, resulting in a bucket (hlist_head list) with N elements;
      
      2) On 32 bits machines. Th VFS hash values are unsigned longs, which
         are 32 bits wide on 32 bits machines, and the inode (objectid)
         numbers are 64 bits unsigned integers. We simply cast the inode
         numbers to hash values, which means that for all inodes with the
         same 32 bits lower half, the same hash bucket is used for all of
         them. For example, all inodes with a number (objectid) between
         0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
         the same hash table bucket.
      
      This change ensures the inode's hash value depends both on the
      objectid (inode number) and its subvolume's (btree root) objectid.
      For 32 bits machines, this change gives better entropy by making
      the hash value depend on both the upper and lower 32 bits of the
      64 bits hash previously computed.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      778ba82b
    • Filipe David Borba Manana's avatar
      Btrfs: remove unnecessary tree search when logging inode · 3d41d702
      Filipe David Borba Manana authored
      In tree-log.c:btrfs_log_inode(), we keep calling btrfs_search_forward()
      until it returns a key whose objectid is higher than our inode or until
      the key's type is higher than our maximum allowed type.
      
      At the end of the loop, we increment our mininum search key's objectid
      and type regardless of our desired target objectid and maximum desired
      type, which causes another loop iteration that will call again
      btrfs_search_forward() just to figure out we've gone beyond our maximum
      key and exit the loop. Therefore while incrementing our minimum key,
      don't do it blindly and exit the loop immiediately if the next search
      key's objectid or type is beyond what we seek.
      
      Also after incrementing the type, set the key's offset to 0, which was
      missing and could make us loose some of the inode's items.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      3d41d702
    • Filipe David Borba Manana's avatar
      6174d3cb
    • Liu Bo's avatar
      Btrfs: fix memory leak of chunks' extent map · 7d3d1744
      Liu Bo authored
      As we're hold a ref on looking up the extent map, we need to drop the ref
      before returning to callers.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      7d3d1744
    • Miao Xie's avatar
      Btrfs: improve jitter performance of the sequential buffered write · fa7c1494
      Miao Xie authored
      The performance was slowed down sometimes when we ran sysbench to measure
      the performance of the sequential buffered write by 2 or more threads.
      
      It was because the write order of the test threads might be confused
      by the task scheduler, and the coming write would be beyond the end of
      the file, in this case, we need insert dummy file extents and create
      a hole for the area we skip. But in order to avoid the ongoing ordered
      extents which are in the area, we need wait for them. Unfortunately,
      the current code doesn't check if there are ordered extents in the area
      or not, try to find and flush the dirty pages directly, but in fact,
      there is no dirty page in that area, this step of the current code is
      unnecessary, and just wastes time. Sometimes, it would increase
      the contention of some locks, and makes the performance slow down suddenly.
      
      So we remove the ordered extent flush function before the check, and flush
      the dirty pages and wait for the ordered extents only when we find them.
      
      According to my test, we got 1-2 times of the performance regression when
      we ran the test by 10 times before applying this patch. After applying
      this patch, the regression went away.
      
      Test Environment:
       CPU:		1CPU * 4Cores
       Memory:	6GB
       Partition:	20GB
      
      Test Command:
       # sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \
       > --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      fa7c1494
    • Miao Xie's avatar
      Btrfs: fix BUG_ON() casued by the reserved space migration · 20dd2cbf
      Miao Xie authored
      When we did space balance and snapshot creation at the same time, we might
      meet the following oops:
       kernel BUG at fs/btrfs/inode.c:3038!
       [SNIP]
       Call Trace:
       [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
       [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
       [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
       [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
       [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
       [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
       [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
       [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
       [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
       [<ffffffff81121e88>] SyS_ioctl+0x57/0x83
       [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
       [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
       [SNIP]
       RIP  [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]
      
      The reason of the problem is that the relocation root creation stole
      the reserved space, which was reserved for orphan item deletion.
      
      There are several ways to fix this problem, one is to increasing
      the reserved space size of the space balace, and then we can use
      that space to create the relocation tree for each fs/file trees.
      But it is hard to calculate the suitable size because we doesn't
      know how many fs/file trees we need relocate.
      
      We fixed this problem by reserving the space for relocation root creation
      actively since the space it need is very small (one tree block, used for
      root node copy), then we use that reserved space to create the
      relocation tree. If we don't reserve space for relocation tree creation,
      we will use the reserved space of the balance.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      20dd2cbf
    • Ross Kirk's avatar
      btrfs: remove unused parameter from btrfs_header_fsid · 0a4e5586
      Ross Kirk authored
      Remove unused parameter, 'eb'. Unused since introduction in
      5f39d397
      
      Updated to be rebased against current upstream and correct diff supplied this time!
      Signed-off-by: default avatarRoss Kirk <ross.kirk@gmail.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      0a4e5586
    • Josef Bacik's avatar
      Btrfs: fix two use-after-free bugs with transaction cleanup · 724e2315
      Josef Bacik authored
      I was noticing the slab redzone stuff going off every once and a while during
      transaction aborts.  This was caused by two things
      
      1) We would walk the pending snapshots and set their error to -ECANCELED.  We
      don't need to do this, the snapshot stuff waits for a transaction commit and if
      there is a problem we just free our pending snapshot object and exit.  Doing
      this was causing us to touch the pending snapshot object after the thing had
      already been freed.
      
      2) We were freeing the transaction manually with wanton disregard for it's
      use_count reference counter.  To fix this I cleaned up the transaction freeing
      loop to either wait for the transaction commit to finish if it was in the middle
      of that (since it will be cleaned and freed up there) or to do the cleanup
      oursevles.
      
      I also moved the global "kill all things dirty everywhere" stuff outside of the
      transaction cleanup loop since that only needs to be done once.  With this patch
      I'm no longer seeing slab corruption because of use after frees.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      724e2315