1. 26 Feb, 2013 4 commits
    • Liu Bo's avatar
      Btrfs: use reserved space for creating a snapshot · 2382c5cc
      Liu Bo authored
      While inserting dir index and updating inode for a snapshot, we'd
      add delayed items which consume trans->block_rsv, if we don't have
      any space reserved in this trans handle, we either just return or
      reserve space again.
      
      But before creating pending snapshots during committing transaction,
      we've done a release on this trans handle, so we don't have space reserved
      in it at this stage.
      
      What we're using is block_rsv of pending snapshots which has already
      reserved well enough space for both inserting dir index and updating
      inode, so we need to set trans handle to indicate that we have space
      now.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      2382c5cc
    • Alexandre Oliva's avatar
      clear chunk_alloc flag on retryable failure · a81cb9a2
      Alexandre Oliva authored
      I've experienced filesystem freezes with permanent spikes in the active
      process count for quite a while, particularly on filesystems whose
      available raw space has already been fully allocated to chunks.
      
      While looking into this, I found a pretty obvious error in
      do_chunk_alloc: it sets space_info->chunk_alloc, but if
      btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
      that flag set, which causes any other threads waiting for
      space_info->chunk_alloc to become zero to spin indefinitely.
      
      I haven't double-checked that this patch fixes the failure I've observed
      fully (it's not exactly trivial to trigger), but it surely is a bug and
      the fix is trivial, so...  Please put it in :-)
      
      What I saw in that function also happens to explain why in some cases I
      see filesystems allocate a huge number of chunks that remain unused
      (leading to the scenario above, of not having more chunks to allocate).
      It happens for data and metadata, but not necessarily both.  I'm
      guessing some thread sets the force_alloc flag on the corresponding
      space_info, and then several threads trying to get disk space end up
      attempting to allocate a new chunk concurrently.  All of them will see
      the force_alloc flag and bump their local copy of force up to the level
      they see first, and they won't clear it even if another thread succeeds
      in allocating a chunk, thus clearing the force flag.  Then each thread
      that observed the force flag will, on its turn, force the allocation of
      a new chunk.  And any threads that come in while it does that will see
      the force flag still set and pick it up, and so on.  This sounds like a
      problem to me, but...  what should the correct behavior be?  Clear
      force_flag once we copy it to a local force?  Reset force to the
      incoming value on every loop?  Set the flag to our incoming force if we
      have it at first, clear our local flag, and move it from the space_info
      when we determined that we are the thread that's going to perform the
      allocation?
      
      btrfs: clear chunk_alloc flag on retryable failure
      
      From: Alexandre Oliva <oliva@gnu.org>
      
      If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
      without clearing chunk_alloc in space_info.  As a result, any further
      calls to do_chunk_alloc on that filesystem will start busy-waiting for
      chunk_alloc to be cleared, but it never will be.  This patch adjusts
      do_chunk_alloc so that it clears this flag in case of an error.
      Signed-off-by: default avatarAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      a81cb9a2
    • Jan Schmidt's avatar
      Btrfs: fix backref walking race with tree deletions · ca60ebfa
      Jan Schmidt authored
      When a subvolume is removed, we remove the root item from the root tree,
      while the tree blocks and backrefs remain for a while. When backref walking
      comes across one of those orphan tree blocks, it can find a backref for a
      no longer existing root. This is all good, we only must tolerate
      __resolve_indirect_ref returning an error and continue with the good refs
      found.
      Reported-by: default avatarAlex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: default avatarJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      ca60ebfa
    • Josef Bacik's avatar
      Btrfs: make sure NODATACOW also gets NODATASUM set · f2bdf9a8
      Josef Bacik authored
      A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had
      csums on a NOCOW extent.  This can happen if we have NODATACOW set but not
      NODATASUM set, which can happen in two cases, either we mount with -o nodatacow
      and then write into preallocated space, or chattr +C a directory and move a file
      into that directory.  Liu has fixed the move case in a different place, but this
      fixes the mount -o nodatacow case.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      f2bdf9a8
  2. 21 Feb, 2013 2 commits
    • Miao Xie's avatar
      Btrfs: fix remount vs autodefrag · dc81cdc5
      Miao Xie authored
      If we remount the fs to close the auto defragment or make the fs R/O,
      we should stop the auto defragment.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      dc81cdc5
    • Miao Xie's avatar
      Btrfs: fix wrong outstanding_extents when doing DIO write · 172a5049
      Miao Xie authored
      When running the 083th case of xfstests on the filesystem with
      "compress-force=lzo", the following WARNINGs were triggered.
        WARNING: at fs/btrfs/inode.c:7908
        WARNING: at fs/btrfs/inode.c:7909
        WARNING: at fs/btrfs/inode.c:7911
        WARNING: at fs/btrfs/extent-tree.c:4510
        WARNING: at fs/btrfs/extent-tree.c:4511
      
      This problem was introduced by the patch "Btrfs: fix deadlock due
      to unsubmitted". In this patch, there are two bugs which caused
      the above problem.
      
      The 1st one is a off-by-one bug, if the DIO write return 0, it is
      also a short write, we need release the reserved space for it. But
      we didn't do it in that patch. Fix it by change "ret > 0" to
      "ret >= 0".
      
      The 2nd one is ->outstanding_extents was increased twice when
      a short write happened. As we know, ->outstanding_extents is
      a counter to keep track of the number of extent items we may
      use duo to delalloc, when we reserve the free space for a
      delalloc write, we assume that the write will introduce just
      one extent item, so we increase ->outstanding_extents by 1 at
      that time. And then we will increase it every time we split the
      write, it is done at the beginning of btrfs_get_blocks_direct().
      So when a short write happens, we needn't increase
      ->outstanding_extents again. But this patch done.
      
      In order to fix the 2nd problem, I re-write the logic for
      ->outstanding_extents operation. We don't increase it at the
      beginning of btrfs_get_blocks_direct(), instead, we just
      increase it when the split actually happens.
      Reported-by: default avatarMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      172a5049
  3. 20 Feb, 2013 34 commits
    • Liu Bo's avatar
      Btrfs: snapshot-aware defrag · 38c227d8
      Liu Bo authored
      This comes from one of btrfs's project ideas,
      As we defragment files, we break any sharing from other snapshots.
      The balancing code will preserve the sharing, and defrag needs to grow this
      as well.
      
      Now we're able to fill the blank with this patch, in which we make full use of
      backref walking stuff.
      
      Here is the basic idea,
      o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
      o  at endio, after we finish updating fs tree, we use backref walking to find
         all parents of the ranges and re-link them with the new COWed file layout by
         adding corresponding backrefs.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      38c227d8
    • Chris Mason's avatar
      Btrfs: fix max chunk size on raid5/6 · 86db2578
      Chris Mason authored
      We try to limit the size of a chunk to 10GB, which keeps the unit of
      work reasonable during balance and resize operations.  The limit checks
      were taking into account the number of copies of the data we had but
      what they really should be doing is comparing against the logical
      size of the chunk we're creating.
      
      This moves the code around a little to use the count of data stripes
      from raid5/6.
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      86db2578
    • Zach Brown's avatar
      btrfs: limit fallocate extent reservation to 256MB · 24542bf7
      Zach Brown authored
      Very large fallocate requests are cpu bound and result in extents with a
      repeating pattern of ever decreasing size:
      
      $ time fallocate -l 1T file
      real	0m13.039s
      
      ( an excerpt of the extents from btrfs-debug-tree: )
        prealloc data disk byte 1536292564992 nr 397312
        prealloc data disk byte 1536292962304 nr 196608
        prealloc data disk byte 1536293158912 nr 98304
        prealloc data disk byte 1536293257216 nr 49152
        prealloc data disk byte 1536293306368 nr 24576
        prealloc data disk byte 1536293330944 nr 12288
        prealloc data disk byte 1536293343232 nr 8192
        prealloc data disk byte 1536293351424 nr 4096
        prealloc data disk byte 1536293355520 nr 4096
        prealloc data disk byte 1536293359616 nr 4096
      
      The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
      allocate the entire remaining size after each extent is allocated.
      btrfs_reserve_extent() repeatedly cuts this requested size in half until
      it gets down to the size that the allocators can return.  We limit the
      problem for now by capping each reservation at 256 meg.
      
      The small extents come from a masking bug when decreasing the requested
      reservation size.  The high 32bits are cleared and the remaining low
      bits might happen to reserve a small size.   Fix this by using
      round_down() which properly casts the mask.
      
      After these fixes huge fallocate requests are fast and result in nice
      large extents:
      
      $ time fallocate -l 1T file
      real	0m0.082s
      
        prealloc data disk byte 1112425889792 nr 268435456
        prealloc data disk byte 1112694325248 nr 268435456
        prealloc data disk byte 1112962760704 nr 268435456
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      24542bf7
    • Thomas Gleixner's avatar
      btrfs: Init io_lock after cloning btrfs device struct · 1cba0cdf
      Thomas Gleixner authored
      __btrfs_close_devices() clones btrfs device structs with
      memcpy(). Some of the fields in the clone are reinitialized, but it's
      missing to init io_lock. In mainline this goes unnoticed, but on RT it
      leaves the plist pointing to the original about to be freed lock
      struct.
      
      Initialize io_lock after cloning, so no references to the original
      struct are left.
      Reported-and-tested-by: default avatarMike Galbraith <efault@gmx.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      1cba0cdf
    • Chris Mason's avatar
      Merge branch 'raid56-experimental' into for-linus-3.9 · e942f883
      Chris Mason authored
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      
      Conflicts:
      	fs/btrfs/ctree.h
      	fs/btrfs/extent-tree.c
      	fs/btrfs/inode.c
      	fs/btrfs/volumes.c
      e942f883
    • Chris Mason's avatar
      Merge branch 'master' of... · b2c6b3e0
      Chris Mason authored
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      
      Conflicts:
      	fs/btrfs/disk-io.c
      b2c6b3e0
    • Miao Xie's avatar
      Btrfs: fix missing release of qgroup reservation in commit_transaction() · 272d26d0
      Miao Xie authored
      We forget to free qgroup reservation in commit_transaction(),fix it.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      272d26d0
    • Wang Shilong's avatar
      Btrfs: fix missing check before disabling quota · 683cebda
      Wang Shilong authored
      The original code forget to check whether quota has been disabled firstly,
      and it will return 'EINVAL' and return error to users if quota has been
      disabled,it will be unfriendly and confusing for users to see that.
      So just return directly if quota has been disabled.
      Signed-off-by: default avatarWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      683cebda
    • Liu Bo's avatar
      Btrfs: fix cleaner thread not working with inode cache option · fa6ac876
      Liu Bo authored
      Right now inode cache inode is treated as the same as space cache
      inode, ie. keep inode in memory till putting super.
      
      But this leads to an awkward situation.
      
      If we're going to delete a snapshot/subvolume, btrfs will not
      actually delete it and return free space, but will add it to dead
      roots list until the last inode on this snap/subvol being destroyed.
      Then we'll fetch deleted roots and cleanup them via cleaner thread.
      
      So here is the problem, if we enable inode cache option, each
      snap/subvol has a cached inode which is used to store inode allcation
      information.  And this cache inode will be kept in memory, as the above
      said.  So with inode cache, snap/subvol can only be added into
      dead roots list during freeing roots stage in umount, so that we can
      ONLY get space back after another remount(we cleanup dead roots on mount).
      
      But the real thing is we'll no more use the snap/subvol if we mark it
      deleted, so we can safely iput its cache inode when we delete snap/subvol.
      
      Another thing is that we need to change the rules of droping inode, we
      don't keep snap/subvol's cache inode in memory till end so that we can
      add snap/subvol into dead roots list in time.
      Reported-by: default avatarMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      fa6ac876
    • Miao Xie's avatar
      Btrfs: fix uncompleted transaction · d4edf39b
      Miao Xie authored
      In some cases, we need commit the current transaction, but don't want
      to start a new one if there is no running transaction, so we introduce
      the function - btrfs_attach_transaction(), which can catch the current
      transaction, and return -ENOENT if there is no running transaction.
      
      But no running transaction doesn't mean the current transction completely,
      because we removed the running transaction before it completes. In some
      cases, it doesn't matter. But in some special cases, such as freeze fs, we
      hope the transaction is fully on disk, it will introduce some bugs, for
      example, we may feeze the fs and dump the data in the disk, if the transction
      doesn't complete, we would dump inconsistent data. So we need fix the above
      problem for those cases.
      
      We fixes this problem by introducing a function:
      	btrfs_attach_transaction_barrier()
      if we hope all the transaction is fully on the disk, even they are not
      running, we can use this function.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      d4edf39b
    • Miao Xie's avatar
      Btrfs: fix the deadlock between the transaction start/attach and commit · 178260b2
      Miao Xie authored
      Now btrfs_commit_transaction() does this
      
      ret = btrfs_run_ordered_operations(root, 0)
      
      which async flushes all inodes on the ordered operations list, it introduced
      a deadlock that transaction-start task, transaction-commit task and the flush
      workers waited for each other.
      (See the following URL to get the detail
       http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
      
      As we know, if ->in_commit is set, it means someone is committing the
      current transaction, we should not try to join it if we are not JOIN
      or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
      the above problem. In this way, there is another benefit: there is no new
      transaction handle to block the transaction which is on the way of commit,
      once we set ->in_commit.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      178260b2
    • Miao Xie's avatar
      Btrfs: fix the qgroup reserved space is released prematurely · 4b824906
      Miao Xie authored
      In start_transactio(), we will try to join the transaction again after
      the current transaction is committed, so we should not release the
      reserved space of the qgroup. Fix it.
      
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      4b824906
    • Zach Brown's avatar
      btrfs: define BTRFS_MAGIC as a u64 value · cdb4c574
      Zach Brown authored
      super.magic is an le64 but it's treated as an unterminated string when
      compared against BTRFS_MAGIC which is defined as a string.  Instead
      define BTRFS_MAGIC as a normal hex value and use endian helpers to
      compare it to the super's magic.
      
      I tested this by mounting an fs made before the change and made sure
      that it didn't introduce sparse errors.  This matches a similar cleanup
      that is pending in btrfs-progs.  David Sterba pointed out that we should
      fix the kernel side as well :).
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      cdb4c574
    • jeff.liu's avatar
      Btrfs: set/change the label of a mounted file system · a8bfd4ab
      jeff.liu authored
      With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system.
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: default avatarGoffredo Baroncelli <kreijack@inwind.it>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.cz>
      Reviewed-by: default avatarGoffredo Baroncelli <kreijack@inwind.it>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      a8bfd4ab
    • jeff.liu's avatar
      Btrfs: Add a new ioctl to get the label of a mounted file system · 867ab667
      jeff.liu authored
      Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem.
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Goffredo Baroncelli <kreijack@inwind.it>
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      867ab667
    • Josef Bacik's avatar
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik authored
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      569e0f35
    • Josef Bacik's avatar
      Btrfs: relax the block group size limit for bitmaps · dde5740f
      Josef Bacik authored
      Dave pointed out that xfstests 273 will tell you that it failed to load the
      space cache for a block group when it remounts.  This is because we run out
      of space writing out the block group cache.  This is ok and is working as it
      should, but let's try to be a bit nicer.  This happens because the block
      group was 100mb, but bitmap entries cover 128mb, so we were only getting
      extent entries for this block group, which ended up being too many to fit in
      the free space cache.  So relax the bitmap size requirements to block groups
      that are at least half the size a bitmap will cover or larger, that way we
      can still keep the amount of space used in the free space cache low enough
      to be able to write it out.  With this patch I no longer fail to write out
      the free space cache.  Thanks,
      Reported-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      dde5740f
    • Ilya Dryomov's avatar
      Btrfs: allow for selecting only completely empty chunks · 3e39cea6
      Ilya Dryomov authored
      Enhance balance usage filter by making it possible to balance out only
      completely empty chunks.  Today, usage filter properly acts on values
      from 1 to 99 inclusive, usage=100 selects all chunks, and usage=0
      selects no chunks.  This commit changes the usage=0 case: the new
      meaning is to restripe only completely empty chunks and nothing else.
      Suggested-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      3e39cea6
    • Ilya Dryomov's avatar
      Btrfs: eliminate a use-after-free in btrfs_balance() · bf023ecf
      Ilya Dryomov authored
      Commit 5af3e8cc introduced a use-after-free at volumes.c:3139: bctl is freed
      above in __cancel_balance() in all cases except for balance pause.  Fix this
      by moving the offending check a couple statements above, the meaning of the
      check is preserved.
      Reported-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      bf023ecf
    • Josef Bacik's avatar
      Btrfs: remove unused extent io tree ops V2 · c8f2f24b
      Josef Bacik authored
      Nobody uses these io tree ops anymore so just remove them and clean up the code
      a bit.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      c8f2f24b
    • David Sterba's avatar
      btrfs: add cancellation points to defrag · 210549eb
      David Sterba authored
      The defrag operation can take very long, we want to have a way how to
      cancel it. The code checks for a pending signal at safe points in the
      defrag loops and returns EAGAIN. This means a user can press ^C after
      running 'btrfs fi defrag', woks for both defrag modes, files and root.
      
      Returning from the command was instant in my light tests, but may take
      longer depending on the aging factor of the filesystem.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      210549eb
    • David Sterba's avatar
      btrfs: put some enospc messages under enospc_debug · b069e0c3
      David Sterba authored
      The warning in use_block_rsv is not useful for users and may fill
      the logs unnecessarily.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      b069e0c3
    • Miao Xie's avatar
      Btrfs: implement unlocked dio write · 38851cc1
      Miao Xie authored
      This idea is from ext4. By this patch, we can make the dio write parallel,
      and improve the performance. But because we can not update isize without
      i_mutex, the unlocked dio write just can be done in front of the EOF.
      
      We needn't worry about the race between dio write and truncate, because the
      truncate need wait untill all the dio write end.
      
      And we also needn't worry about the race between dio write and punch hole,
      because we have extent lock to protect our operation.
      
      I ran fio to test the performance of this feature.
      
      == Hardware ==
      CPU: Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz
      Mem: 2GB
      SSD: Intel X25-M 120GB (Test Partition: 60GB)
      
      == config file ==
      [global]
      ioengine=psync
      direct=1
      bs=4k
      size=32G
      runtime=60
      directory=/mnt/btrfs/
      filename=testfile
      group_reporting
      thread
      
      [file1]
      numjobs=1 # 2 4
      rw=randwrite
      
      == result (KBps) ==
      write	1	2	4
      lock	24936	24738	24726
      nolock	24962	30866	32101
      
      == result (iops) ==
      write	1	2	4
      lock	6234	6184	6181
      nolock	6240	7716	8025
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      38851cc1
    • Miao Xie's avatar
      Btrfs: serialize unlocked dio reads with truncate · 2e60a51e
      Miao Xie authored
      Currently, we can do unlocked dio reads, but the following race
      is possible:
      
      dio_read_task			truncate_task
      				->btrfs_setattr()
      ->btrfs_direct_IO
          ->__blockdev_direct_IO
            ->btrfs_get_block
      				  ->btrfs_truncate()
      				 #alloc truncated blocks
      				 #to other inode
            ->submit_io()
           #INFORMATION LEAK
      
      In order to avoid this problem, we must serialize unlocked dio reads with
      truncate. There are two approaches:
      - use extent lock to protect the extent that we truncate
      - use inode_dio_wait() to make sure the truncating task will wait for
        the read DIO.
      
      If we use the 1st one, we will meet the endless truncation problem due to
      the nonlocked read DIO after we implement the nonlocked write DIO. It is
      because we still need invoke inode_dio_wait() avoid the race between write
      DIO and truncation. By that time, we have to introduce
      
        btrfs_inode_{block, resume}_nolock_dio()
      
      again. That is we have to implement this patch again, so I choose the 2nd
      way to fix the problem.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      2e60a51e
    • Miao Xie's avatar
      Btrfs: fix deadlock due to unsubmitted · 0934856d
      Miao Xie authored
      The deadlock problem happened when running fsstress(a test program in LTP).
      
      Steps to reproduce:
       # mkfs.btrfs -b 100M <partition>
       # mount <partition> <mnt>
       # <Path>/fsstress -p 3 -n 10000000 -d <mnt>
      
      The reason is:
      btrfs_direct_IO()
       |->do_direct_IO()
           |->get_page()
           |->get_blocks()
           |	 |->btrfs_delalloc_resereve_space()
           |	 |->btrfs_add_ordered_extent() -------	Add a new ordered extent
           |->dio_send_cur_page(page0) --------------	We didn't submit bio here
           |->get_page()
           |->get_blocks()
      	 |->btrfs_delalloc_resereve_space()
      	     |->flush_space()
      		 |->btrfs_start_ordered_extent()
      		     |->wait_event() ----------	Wait the completion of
      						the ordered extent that is
      						mentioned above
      
      But because we didn't submit the bio that is mentioned above, the ordered
      extent can not complete, we would wait for its completion forever.
      
      There are two methods which can fix this deadlock problem:
      1. submit the bio before we invoke get_blocks()
      2. reserve the space before we do dio
      
      Though the 1st is the simplest way, we need modify the code of VFS, and it
      is likely to break contiguous requests, and introduce performance regression
      for the other filesystems.
      
      So we have to choose the 2nd way.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      0934856d
    • Josef Bacik's avatar
      Btrfs: cleanup orphan reservation if truncate fails · 4a7d0f68
      Josef Bacik authored
      I noticed we were getting lots of warnings with xfstest 83 because we have
      reservations outstanding.  This is because we moved the orphan add outside
      of the truncate, but we don't actually cleanup our reservation if something
      fails.  This fixes the problem and I no longer see warnings.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      4a7d0f68
    • Josef Bacik's avatar
      Btrfs: steal from global reserve if we are cleaning up orphans · 5d80366e
      Josef Bacik authored
      Sometimes xfstest 83 will fail to remount the scratch device because we've
      gotten ourselves so full that we cannot cleanup the orphan items.  In this
      case check to see if we're doing the orphan cleanup and if we are allow us
      to steal our reservation from the global block rsv.  With this patch I've
      not been able to reproduce the failed mount problem.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      5d80366e
    • Miao Xie's avatar
      Btrfs: fix memory leak of pending_snapshot->inherit · 8696c533
      Miao Xie authored
      The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned
      to NULL during we created the snapshots, so we didn't free it though we
      called kfree() in the caller.
      
      But since we are sure the snapshot creation is done after the function -
      btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't
      assign the pointer "inherit" to NULL, and just free it in the caller of
      btrfs_ioctl_snap_create_transid(). In this way, the code can become more
      readable.
      Reported-by: default avatarAlex Lyakas <alex.btrfs@zadarastorage.com>
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      8696c533
    • Miao Xie's avatar
      Btrfs: fix the race between bio and btrfs_stop_workers · 2b8195bb
      Miao Xie authored
      open_ctree() need read the metadata to initialize the global information
      of btrfs. But it may fail after it submit some bio, and then it will jump
      to the error path. Unfortunately, it doesn't check if there are some bios
      in flight, and just stop all the worker threads. As a result, when the
      submitted bios end, they can not find any worker thread which can deal with
      subsequent work, then oops happen.
      
      kernel BUG at fs/btrfs/async-thread.c:605!
      
      Fix this problem by invoking invalidate_inode_pages2() before we stop the
      worker threads. This function will wait until the bio end because it need
      lock the pages which are going to be invalidated, and if a page is under
      disk read IO, it must be locked. invalidate_inode_pages2() need wait until
      end bio handler to unlocked it.
      Reported-and-Tested-by: default avatarTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      2b8195bb
    • Mark Fasheh's avatar
      btrfs: add "no file data" flag to btrfs send ioctl · cb95e7bf
      Mark Fasheh authored
      This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send
      ioctl code. When this flag is set, the btrfs send code will never write file
      data into the stream (thus also avoiding expensive reads of that data in the
      first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of
      BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in
      question.
      
      This patch does not affect the operation of BTRFS_SEND_C_CLONE commands -
      they will continue to be sent when a search finds an appropriate extent to
      clone from.
      Signed-off-by: default avatarMark Fasheh <mfasheh@suse.de>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      cb95e7bf
    • Liu Bo's avatar
      Btrfs: extend the checksum item as much as possible · 2f697dc6
      Liu Bo authored
      For write, we also reserve some space for COW blocks during updating
      the checksum tree, and we calculate the number of blocks by checking
      if the number of bytes outstanding that are going to need csums needs
      one more block for csum.
      
      When we add these checksum into the checksum tree, we use ordered sums
      list.
      Every ordered sum contains csums for each sector, and we'll first try
      to look up an existing csum item,
      a) if we don't yet have a proper csum item, then we need to insert one,
      b) or if we find one but the csum item is not big enough, then we need
      to extend it.
      
      The point is we'll unlock the whole path and then insert or extend.
      So others can hack in and update the tree.
      
      Each insert or extend needs update the tree with COW on, and we may need
      to insert/extend for many times.
      
      That means what we've reserved for updating checksum tree is NOT enough
      indeed.
      
      The case is even more serious with having several write threads at the
      same time, it can end up eating our reserved space quickly and starting
      eating globle reserve pool instead.
      
      I don't yet come up with a way to calculate the worse case for updating
      csum, but extending the checksum item as much as possible can be helpful
      in my test.
      
      The idea behind is that it can reduce the times we insert/extend so that
      it saves us precious reserved space.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      2f697dc6
    • Eric Sandeen's avatar
      btrfs: remove cache only arguments from defrag path · de78b51a
      Eric Sandeen authored
      The entry point at the defrag ioctl always sets "cache only" to 0;
      the codepaths haven't run for a long time as far as I can
      tell.  Chris says they're dead code, so remove them.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      de78b51a
    • Josef Bacik's avatar
      Btrfs: if we aren't committing just end the transaction if we error out · e4a2bcac
      Josef Bacik authored
      I hit a deadlock where transaction commit was waiting on num_writers to be
      0.  This happened because somebody came into btrfs_commit_transaction and
      noticed we had aborted and it went to cleanup_transaction.  This shouldn't
      happen because cleanup_transaction is really to fixup a bad commit, it
      doesn't do the normal trans handle cleanup things.  So if we have an error
      just do the normal btrfs_end_transaction dance and return.  Once we are in
      the actual commit path we can use cleanup_transaction and be good to go.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      e4a2bcac
    • Josef Bacik's avatar
      Btrfs: handle errors in compression submission path · 3e04e7f1
      Josef Bacik authored
      I noticed we would deadlock if we aborted a transaction while doing
      compressed io.  This is because we don't unlock our pages if something goes
      horribly wrong.  To fix this we need to make sure that we call
      extent_clear_unlock_delalloc in order to unlock all the pages.  If we have
      to cow in the async submission thread we need to make sure to unlock our
      locked_page as the cow error path will not unlock the locked page as it
      depends on the caller to unlock that page.  With this patch we no longer
      deadlock on the page lock when we have an aborted transaction.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      3e04e7f1