1. 21 Sep, 2013 13 commits
    • Josef Bacik's avatar
      Btrfs: improve replacing nocow extents · 652f25a2
      Josef Bacik authored
      Various people have hit a deadlock when running btrfs/011.  This is because when
      replacing nocow extents we will take the i_mutex to make sure nobody messes with
      the file while we are replacing the extent.  The problem is we are already
      holding a transaction open, which is a locking inversion, so instead we need to
      save these inodes we find and then process them outside of the transaction.
      
      Further we can't just lock the inode and assume we are good to go.  We need to
      lock the extent range and then read back the extent cache for the inode to make
      sure the extent really still points at the physical block we want.  If it
      doesn't we don't have to copy it.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      652f25a2
    • Josef Bacik's avatar
      Btrfs: drop dir i_size when adding new names on replay · d555438b
      Josef Bacik authored
      So if we have dir_index items in the log that means we also have the inode item
      as well, which means that the inode's i_size is correct.  However when we
      process dir_index'es we call btrfs_add_link() which will increase the
      directory's i_size for the new entry.  To fix this we need to just set the dir
      items i_size to 0, and then as we find dir_index items we adjust the i_size.
      btrfs_add_link() will do it for new entries, and if the entry already exists we
      can just add the name_len to the i_size ourselves.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      d555438b
    • Josef Bacik's avatar
      Btrfs: replay dir_index items before other items · dd8e7217
      Josef Bacik authored
      A user reported a bug where his log would not replay because he was getting
      -EEXIST back.  This was because he had a file moved into a directory that was
      logged.  What happens is the file had a lower inode number, and so it is
      processed first when replaying the log, and so we add the inode ref in for the
      directory it was moved to.  But then we process the directories DIR_INDEX item
      and try to add the inode ref for that inode and it fails because we already
      added it when we replayed the inode.  To solve this problem we need to just
      process any DIR_INDEX items we have in the log first so this all is taken care
      of, and then we can replay the rest of the items.  With this patch my reproducer
      can remount the file system properly instead of erroring out.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      dd8e7217
    • Josef Bacik's avatar
      Btrfs: check roots last log commit when checking if an inode has been logged · a5874ce6
      Josef Bacik authored
      Liu introduced a local copy of the last log commit for an inode to make sure we
      actually log an inode even if a log commit has already taken place.  In order to
      make sure we didn't relog the same inode multiple times he set this local copy
      to the current trans when we log the inode, because usually we log the inode and
      then sync the log.  The exception to this is during rename, we will relog an
      inode if the name changed and it is already in the log.  The problem with this
      is then we go to sync the inode, and our check to see if the inode has already
      been logged is tripped and we don't sync the log.  To fix this we need to _also_
      check against the roots last log commit, because it could be less than what is
      in our local copy of the log commit.  This fixes a bug where we rename a file
      into a directory and then fsync the directory and then on remount the directory
      is no longer there.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      a5874ce6
    • Josef Bacik's avatar
      Btrfs: actually log directory we are fsync()'ing · de2b530b
      Josef Bacik authored
      If you just create a directory and then fsync that directory and then pull the
      power plug you will come back up and the directory will not be there.  That is
      because we won't actually create directories if we've logged files inside of
      them since they will be created on replay, but in this check we will set our
      logged_trans of our current directory if it happens to be a directory, making us
      think it doesn't need to be logged.  Fix the logic to only do this to parent
      directories.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      de2b530b
    • Josef Bacik's avatar
      Btrfs: actually limit the size of delalloc range · 573aecaf
      Josef Bacik authored
      So forever we have had this thing to limit the amount of delalloc pages we'll
      setup to be written out to 128mb.  This is because we have to lock all the pages
      in this range, so anything above this gets a bit unweildly, and also without a
      limit we'll happily allocate gigantic chunks of disk space.  Turns out our check
      for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
      write out, we'd just stop looking for more space after we went over the limit.
      So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
      extents.  This is fine normally, except when you go to relocate these extents
      and we can't find enough space to relocate these moster extents, since we have
      to be able to allocate exactly the same sized extent to move it around.  So fix
      this by actually enforcing the limit.  With this patch I'm no longer seeing
      giant 1.5gb extents.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      573aecaf
    • Miao Xie's avatar
      Btrfs: allocate the free space by the existed max extent size when ENOSPC · a4820398
      Miao Xie authored
      By the current code, if the requested size is very large, and all the extents
      in the free space cache are small, we will waste lots of the cpu time to cut
      the requested size in half and search the cache again and again until it gets
      down to the size the allocator can return. In fact, we can know the max extent
      size in the cache after the first search, so we needn't cut the size in half
      repeatedly, and just use the max extent size directly. This way can save
      lots of cpu time and make the performance grow up when there are only fragments
      in the free space cache.
      
      According to my test, if there are only 4KB free space extents in the fs,
      and the total size of those extents are 256MB, we can reduce the execute
      time of the following test from 5.4s to 1.4s.
        dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync
      
      Changelog v2 -> v3:
      - fix the problem that we skip the block group with the space which is
        less than we need.
      
      Changelog v1 -> v2:
      - address the problem that we return a wrong start position when searching
        the free space in a bitmap.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      a4820398
    • David Sterba's avatar
    • Stefan Behrens's avatar
      btrfs: show compiled-in config features at module load time · 79556c3d
      Stefan Behrens authored
      We want to know if there are debugging features compiled in, this may
      affect performance. The message is printed before the sanity checks.
      
      (This commit message is a copy of David Sterba's commit message when
      he introduced btrfs_print_info()).
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      79556c3d
    • Filipe David Borba Manana's avatar
      Btrfs: more efficient inode tree replace operation · cef21937
      Filipe David Borba Manana authored
      Instead of removing the current inode from the red black tree
      and then add the new one, just use the red black tree replace
      operation, which is more efficient.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      cef21937
    • Ilya Dryomov's avatar
      Btrfs: do not add replace target to the alloc_list · 55e50e45
      Ilya Dryomov authored
      If replace was suspended by the umount, replace target device is added
      to the fs_devices->alloc_list during a later mount.  This is obviously
      wrong.  ->is_tgtdev_for_dev_replace is supposed to guard against that,
      but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized
      *after* everything is opened and fs_devices lists are populated.  Fix
      this by checking the devid instead: for replace targets it's always
      equal to BTRFS_DEV_REPLACE_DEVID.
      
      Cc: Stefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      55e50e45
    • Josef Bacik's avatar
      Btrfs: fixup error handling in btrfs_reloc_cow · 83d4cfd4
      Josef Bacik authored
      If we failed to actually allocate the correct size of the extent to relocate we
      will end up in an infinite loop because we won't return an error, we'll just
      move on to the next extent.  So fix this up by returning an error, and then fix
      all the callers to return an error up the stack rather than BUG_ON()'ing.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      83d4cfd4
    • Chris Mason's avatar
      Merge tag 'v3.11' into for-linus · 07f0e62e
      Chris Mason authored
      Linux 3.11
      07f0e62e
  2. 02 Sep, 2013 4 commits
  3. 01 Sep, 2013 23 commits
    • Filipe David Borba Manana's avatar
      Btrfs: optimize key searches in btrfs_search_slot · d7396f07
      Filipe David Borba Manana authored
      When the binary search returns 0 (exact match), the target key
      will necessarily be at slot 0 of all nodes below the current one,
      so in this case the binary search is not needed because it will
      always return 0, and we waste time doing it, holding node locks
      for longer than necessary, etc.
      
      Below follow histograms with the times spent on the current approach of
      doing a binary search when the previous binary search returned 0, and
      times for the new approach, which directly picks the first item/child
      node in the leaf/node.
      
      Current approach:
      
      Count: 6682
      Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
      Percentiles:  90th: 124.000; 95th: 145.000; 99th: 206.000
        35.000 -   61.080:  1235 ################
        61.080 -  106.053:  4207 #####################################################
       106.053 -  183.606:  1122 ##############
       183.606 -  317.341:   111 #
       317.341 -  547.959:     6 |
       547.959 - 8370.000:     1 |
      
      Approach proposed by this patch:
      
      Count: 6682
      Range:  6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev:  7.160
      Percentiles:  90th: 23.000; 95th: 27.000; 99th: 40.000
         6.000 -    8.418:    58 #
         8.418 -   11.670:  1149 #########################
        11.670 -   16.046:  2418 #####################################################
        16.046 -   21.934:  2098 ##############################################
        21.934 -   29.854:   744 ################
        29.854 -   40.511:   154 ###
        40.511 -   54.848:    41 #
        54.848 -   74.136:     5 |
        74.136 -  100.087:     9 |
       100.087 -  135.000:     6 |
      
      These samples were captured during a run of the btrfs tests 001, 002 and
      004 in the xfstests, with a leaf/node size of 4Kb.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      d7396f07
    • Josef Bacik's avatar
      Btrfs: don't use an async starter for most of our workers · 45d5fd14
      Josef Bacik authored
      We only need an async starter if we can't make a GFP_NOFS allocation in our
      current path.  This is the case for the endio stuff since it happens in IRQ
      context, but things like the caching thread workers and the delalloc flushers we
      can easily make this allocation and start threads right away.  Also change the
      worker count for the caching thread pool.  Traditionally we limited this to 2
      since we took read locks while caching, but nowadays we do this lockless so
      there's no reason to limit the number of caching threads.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      45d5fd14
    • Josef Bacik's avatar
      Btrfs: only update disk_i_size as we remove extents · 7f4f6e0a
      Josef Bacik authored
      This fixes a problem where if we fail a truncate we will leave the i_size set
      where we wanted to truncate to instead of where we were able to truncate to.
      Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it
      removes extents, that way it will always be consistent with where its extents
      are.  Then if the truncate fails at all we can update the in-ram i_size with
      what we have on disk and delete the orphan item.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      7f4f6e0a
    • Filipe David Borba Manana's avatar
      Btrfs: fix deadlock in uuid scan kthread · f45388f3
      Filipe David Borba Manana authored
      If there's an ongoing transaction when the uuid scan kthread attempts
      to create one, the kthread will block, waiting for that transaction to
      finish while it's keeping locks on the tree root, and in turn the existing
      transaction is waiting for those locks to be free.
      
      The stack trace reported by the kernel follows.
      
      [36700.671601] INFO: task btrfs-uuid:15480 blocked for more than 120 seconds.
      [36700.671602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [36700.671602] btrfs-uuid      D 0000000000000000     0 15480      2 0x00000000
      [36700.671604]  ffff880710bd5b88 0000000000000046 ffff8803d36ba850 0000000000030000
      [36700.671605]  ffff8806d76dc530 ffff880710bd5fd8 ffff880710bd5fd8 ffff880710bd5fd8
      [36700.671607]  ffff8808098ac530 ffff8806d76dc530 ffff880710bd5b98 ffff8805e4508e40
      [36700.671608] Call Trace:
      [36700.671610]  [<ffffffff816f36b9>] schedule+0x29/0x70
      [36700.671620]  [<ffffffffa05a3bdf>] wait_current_trans.isra.33+0xbf/0x120 [btrfs]
      [36700.671623]  [<ffffffff81066760>] ? add_wait_queue+0x60/0x60
      [36700.671629]  [<ffffffffa05a5b06>] start_transaction+0x3d6/0x530 [btrfs]
      [36700.671636]  [<ffffffffa05bb1f4>] ? btrfs_get_token_32+0x64/0xf0 [btrfs]
      [36700.671642]  [<ffffffffa05a5fbb>] btrfs_start_transaction+0x1b/0x20 [btrfs]
      [36700.671649]  [<ffffffffa05c8a81>] btrfs_uuid_scan_kthread+0x211/0x3d0 [btrfs]
      [36700.671655]  [<ffffffffa05c8870>] ? __btrfs_open_devices+0x2a0/0x2a0 [btrfs]
      [36700.671657]  [<ffffffff81065fa0>] kthread+0xc0/0xd0
      [36700.671659]  [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0
      [36700.671661]  [<ffffffff816fcd1c>] ret_from_fork+0x7c/0xb0
      [36700.671662]  [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0
      [36700.671663] INFO: task btrfs:15481 blocked for more than 120 seconds.
      [36700.671664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [36700.671665] btrfs           D 0000000000000000     0 15481  15212 0x00000004
      [36700.671666]  ffff880248cbf4c8 0000000000000086 ffff8803d36ba700 ffff8801dbd5c280
      [36700.671668]  ffff880807815c40 ffff880248cbffd8 ffff880248cbffd8 ffff880248cbffd8
      [36700.671669]  ffff8805e86a0000 ffff880807815c40 ffff880248cbf4d8 ffff8801dbd5c280
      [36700.671670] Call Trace:
      [36700.671672]  [<ffffffff816f36b9>] schedule+0x29/0x70
      [36700.671679]  [<ffffffffa05d9b0d>] btrfs_tree_lock+0x6d/0x230 [btrfs]
      [36700.671680]  [<ffffffff81066760>] ? add_wait_queue+0x60/0x60
      [36700.671685]  [<ffffffffa0582829>] btrfs_search_slot+0x999/0xb00 [btrfs]
      [36700.671691]  [<ffffffffa05bd9de>] ? btrfs_lookup_first_ordered_extent+0x5e/0xb0 [btrfs]
      [36700.671698]  [<ffffffffa05e3e54>] __btrfs_write_out_cache+0x8c4/0xa80 [btrfs]
      [36700.671704]  [<ffffffffa05e4362>] btrfs_write_out_cache+0xb2/0xf0 [btrfs]
      [36700.671710]  [<ffffffffa05c4441>] ? free_extent_buffer+0x61/0xc0 [btrfs]
      [36700.671716]  [<ffffffffa0594c82>] btrfs_write_dirty_block_groups+0x562/0x650 [btrfs]
      [36700.671723]  [<ffffffffa0610092>] commit_cowonly_roots+0x171/0x24b [btrfs]
      [36700.671729]  [<ffffffffa05a4dde>] btrfs_commit_transaction+0x4fe/0xa10 [btrfs]
      [36700.671735]  [<ffffffffa0610af3>] create_subvol+0x5c0/0x636 [btrfs]
      [36700.671742]  [<ffffffffa05d49ff>] btrfs_mksubvol.isra.60+0x33f/0x3f0 [btrfs]
      [36700.671747]  [<ffffffffa05d4bf2>] btrfs_ioctl_snap_create_transid+0x142/0x190 [btrfs]
      [36700.671752]  [<ffffffffa05d4c6c>] ? btrfs_ioctl_snap_create+0x2c/0x80 [btrfs]
      [36700.671757]  [<ffffffffa05d4c9e>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs]
      [36700.671759]  [<ffffffff8113a764>] ? handle_pte_fault+0x84/0x920
      [36700.671764]  [<ffffffffa05d87eb>] btrfs_ioctl+0xf0b/0x1d00 [btrfs]
      [36700.671766]  [<ffffffff8113c120>] ? handle_mm_fault+0x210/0x310
      [36700.671768]  [<ffffffff816f83a4>] ? __do_page_fault+0x284/0x4e0
      [36700.671770]  [<ffffffff81180aa6>] do_vfs_ioctl+0x96/0x550
      [36700.671772]  [<ffffffff81170fe3>] ? __sb_end_write+0x33/0x70
      [36700.671774]  [<ffffffff81180ff1>] SyS_ioctl+0x91/0xb0
      [36700.671775]  [<ffffffff816fcdc2>] system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      f45388f3
    • Ilya Dryomov's avatar
      Btrfs: stop refusing the relocation of chunk 0 · 795a3321
      Ilya Dryomov authored
      AFAICT chunk 0 is no longer special, and so it should be restriped just
      like every other chunk.  One reason for this change is us refusing the
      relocation can lead to filesystems that can only be mounted ro, and
      never rw -- see the bugzilla [1] for details.  The other reason is that
      device removal code is already doing this: it will happily relocate
      chunk 0 is part of shrinking the device.
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=60594Reported-by: default avatarXavier Bassery <xavier@bartica.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      795a3321
    • Filipe David Borba Manana's avatar
    • Andy Shevchenko's avatar
      btrfs: reuse kbasename helper · ed84885d
      Andy Shevchenko authored
      To get name of the file from a pathname let's use kbasename() helper. It allows
      to simplify code a bit.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      ed84885d
    • Anand Jain's avatar
      btrfs: return btrfs error code for dev excl ops err · e57138b3
      Anand Jain authored
      now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
      as defined in btrfs.h for the dev excl operation error in
      the FS, which means with this kernel would stop logging
      (almost an user error) into the /var/log/messages
      
      v2: accepts Josef' comment
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      e57138b3
    • Josef Bacik's avatar
      Btrfs: allow partial ordered extent completion · 77cef2ec
      Josef Bacik authored
      We currently have this problem where you can truncate pages that have not yet
      been written for an ordered extent.  We do this because the truncate will be
      coming behind to clean us up anyway so what's the harm right?  Well if truncate
      fails for whatever reason we leave an orphan item around for the file to be
      cleaned up later.  But if the user goes and truncates up the file and tries to
      read from the area that had been discarded previously they will get a csum error
      because we never actually wrote that data out.
      
      This patch fixes this by allowing us to either discard the ordered extent
      completely, by which I mean we just free up the space we had allocated and not
      add the file extent, or adjust the length of the file extent we write.  We do
      this by setting the length we truncated down to in the ordered extent, and then
      we set the file extent length and ram bytes to this length.  The total disk
      space stays unchanged since we may be compressed and we can't just chop off the
      disk space, but at least this way the file extent only points to the valid data.
      Then when the file extent is free'd the extent and csums will be freed normally.
      
      This patch is needed for the next series which will give us more graceful
      recovery of failed truncates.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      77cef2ec
    • Josef Bacik's avatar
      Btrfs: convert all bug_ons in free-space-cache.c · b12d6869
      Josef Bacik authored
      All of these are logic checks to make sure we're not breaking anything, so
      convert them over to ASSERT().  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      b12d6869
    • Josef Bacik's avatar
      Btrfs: add support for asserts · 2e17c7c6
      Josef Bacik authored
      One of the complaints we get a lot is how many BUG_ON()'s we have.  So to help
      with this I'm introducing a kconfig option to enable/disable a new ASSERT()
      mechanism much like what XFS does.  This will allow us developers to still get
      our nice panics but allow users/distros to compile them out.  With this we can
      go through and convert any BUG_ON()'s that we have to catch actual programming
      mistakes to the new ASSERT() and then fix everybody else to return errors.  This
      will also allow developers to leave sanity checks in their new code to make sure
      we don't trip over problems while testing stuff and vetting new features.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      2e17c7c6
    • Josef Bacik's avatar
      Btrfs: adjust the fs_devices->missing count on unmount · 726551eb
      Josef Bacik authored
      I noticed that if I tried to mount a file system with -o degraded after having
      done it once already we would fail to mount.  This is because the
      fs_devices->missing count was getting bumped everytime we mounted, but not
      getting reset whenever we unmounted.  To fix this we just drop the missing count
      as we're closing devices to make sure this doesn't happen.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      726551eb
    • Stefan Behrens's avatar
      Btrf: cleanup: don't check for root_refs == 0 twice · 23fa76b0
      Stefan Behrens authored
      btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
      is zero and returns ENOENT in this case. There is no need to do
      it again in three more places.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      23fa76b0
    • Stefan Behrens's avatar
      Btrfs: fix for patch "cleanup: don't check the same thing twice" · 48475471
      Stefan Behrens authored
      Mitch Harder noticed that the patch 3c64a1ab mentioned in the subject
      line was causing a kernel BUG() on snapshot deletion.
      
      The patch was wrong. It did not handle cached roots correctly. The
      check for root_refs == 0 was removed everywhere where
      btrfs_read_fs_root_no_name() had been used to retrieve the root,
      because this check was already dealt with in
      btrfs_read_fs_root_no_name(). But in the case when the root was
      found in the cache, there was no such check.
      
      This patch adds the missing check in the case where the root is
      found in the cache.
      Reported-by: default avatarMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      48475471
    • Stefan Behrens's avatar
      Btrfs: get rid of one BUG() in write_all_supers() · 9d565ba4
      Stefan Behrens authored
      The second round uses btrfs_error() and return -EIO, the first round
      can handle write errors the same way.
      Signed-off-by: default avatarStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      9d565ba4
    • Wang Shilong's avatar
      Btrfs: allocate prelim_ref with a slab allocater · b9e9a6cb
      Wang Shilong authored
      struct __prelim_ref is allocated and freed frequently when
      walking backref tree, using slab allocater can not only
      speed up allocating but also detect memory leaks.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      b9e9a6cb
    • Wang Shilong's avatar
      Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC · 742916b8
      Wang Shilong authored
      Currently, only add_delayed_refs have to allocate with GFP_ATOMIC,
      So just pass arg 'gfp_t' to decide which allocation mode.
      Signed-off-by: default avatarWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      742916b8
    • Filipe David Borba Manana's avatar
      Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl · f7171750
      Filipe David Borba Manana authored
      The handler for the ioctl BTRFS_IOC_FS_INFO was reading the
      number of devices before acquiring the device list mutex.
      
      This could lead to inconsistent results because the update of
      the device list and the number of devices counter (amongst other
      counters related to the device list) are updated in volumes.c
      while holding the device list mutex - except for 2 places, one
      was volumes.c:btrfs_prepare_sprout() and the other was
      volumes.c:device_list_add().
      
      For example, if we have 2 devices, with IDs 1 and 2 and then add
      a new device, with ID 3, and while adding the device is in progress
      an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of
      devices of 2 and a max dev id of 3. This would be incorrect.
      
      Also, this ioctl handler was reading the fsid while it can be
      updated concurrently. This can happen when while a new device is
      being added and the current filesystem is in seeding mode.
      Example:
      
      $ mkfs.btrfs -f /dev/sdb1
      $ mkfs.btrfs -f /dev/sdb2
      $ btrfstune -S 1 /dev/sdb1
      $ mount /dev/sdb1 /mnt/test
      $ btrfs device add /dev/sdb2 /mnt/test
      
      If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it
      could read an fsid that was never valid (some bits part of the old
      fsid and others part of the new fsid). Also, it could read a number
      of devices that doesn't match the number of devices in the list and
      the max device id, as explained before.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      f7171750
    • Filipe David Borba Manana's avatar
      Btrfs: fix race between removing a dev and writing sbs · d7306801
      Filipe David Borba Manana authored
      This change fixes an issue when removing a device and writing
      all super blocks run simultaneously. Here's the steps necessary
      for the issue to happen:
      
      1) disk-io.c:write_all_supers() gets a number of N devices from the
         super_copy, so it will not panic if it fails to write super blocks
         for N - 1 devices;
      
      2) Then it tries to acquire the device_list_mutex, but blocks because
         volumes.c:btrfs_rm_device() got it first;
      
      3) btrfs_rm_device() removes the device from the list, then unlocks the
         mutex and after the unlock it updates the number of devices in
         super_copy to N - 1.
      
      4) write_all_supers() finally acquires the mutex, iterates over all the
         devices in the list and gets N - 1 errors, that is, it failed to write
         super blocks to all the devices;
      
      5) Because write_all_supers() thinks there are a total of N devices, it
         considers N - 1 errors to be ok, and therefore won't panic.
      
      So this change just makes sure that write_all_supers() reads the number
      of devices from super_copy after it acquires the device_list_mutex.
      Conversely, it changes btrfs_rm_device() to update the number of devices
      in super_copy before it releases the device list mutex.
      
      The code path to add a new device (volumes.c:btrfs_init_new_device),
      already has the right behaviour: it updates the number of devices in
      super_copy while holding the device_list_mutex.
      
      The only code path that doesn't lock the device list mutex
      before updating the number of devices in the super copy is
      disk-io.c:next_root_backup(), called by open_ctree() during
      mount time where concurrency issues can't happen.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      d7306801
    • Josef Bacik's avatar
      Btrfs: remove ourselves from the cluster list under lock · b8d0c69b
      Josef Bacik authored
      A user was reporting weird warnings from btrfs_put_delayed_ref() and I noticed
      that we were doing this list_del_init() on our head ref outside of
      delayed_refs->lock.  This is a problem if we have people still on the list, we
      could end up modifying old pointers and such.  Fix this by removing us from the
      list before we do our run_delayed_ref on our head ref.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      b8d0c69b
    • Josef Bacik's avatar
      Btrfs: do not clear our orphan item runtime flag on eexist · e8e7cff6
      Josef Bacik authored
      We were unconditionally clearing our runtime flag on the inode on error when
      trying to insert an orphan item.  This is wrong in the case of -EEXIST since we
      obviously have an orphan item.  This was causing us to not do the correct
      cleanup of our orphan items which caused issues on cleanup.  This happens
      because currently when truncate fails we just leave the orphan item on there so
      it can be cleaned up, so if we go to remove the file later we will hit this
      issue.  What we do for truncate isn't right either, but we shouldn't screw this
      sort of thing up on error either, so fix this and then I'll fix truncate in a
      different patch.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      e8e7cff6
    • Josef Bacik's avatar
      Btrfs: fix send to deal with sparse files properly · 57cfd462
      Josef Bacik authored
      Send was just sending everything it found, even if the extent was a hole.  This
      is unpleasant for users, so just skip holes when we are sending.  This will also
      skip sending prealloc extents since the send spec doesn't have a prealloc
      command.  Eventually we will add a prealloc command and rev the send version so
      we can send down the prealloc info.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      57cfd462
    • Filipe David Borba Manana's avatar
      Btrfs: fix printing of non NULL terminated string · bdab49d7
      Filipe David Borba Manana authored
      The name buffer is not terminated by a '\0' character,
      therefore it needs to be printed with %.*s and use the
      length of the buffer.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      bdab49d7