1. 11 Nov, 2009 10 commits
    • Josef Bacik's avatar
      Btrfs: fix panic when trying to destroy a newly allocated · a6dbd429
      Josef Bacik authored
      There is a problem where iget5_locked will look for an inode, not find it, and
      then subsequently try to allocate it.  Another CPU will have raced in and
      allocated the inode instead, so when iget5_locked gets the inode spin lock again
      and does a search, it finds the new inode.  So it goes ahead and calls
      destroy_inode on the inode it just allocated.  The problem is we don't set
      BTRFS_I(inode)->root until the new inode is completely initialized.  This patch
      makes us set root to NULL when alloc'ing a new inode, so when we get to
      btrfs_destroy_inode and we see that root is NULL we can just free up the memory
      and continue on.  This fixes the panic
      
      http://www.kerneloops.org/submitresult.php?number=812690
      
      Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a6dbd429
    • Chris Mason's avatar
      Btrfs: allow more metadata chunk preallocation · 33b25808
      Chris Mason authored
      On an FS where all of the space has not been allocated into chunks yet,
      the enospc can return enospc just because the existing metadata chunks
      are full.
      
      We get around this by allowing more metadata chunks to be allocated up
      to a certain limit, and finding the right limit is a little fuzzy.  The
      problem is the reservations for delalloc would preallocate way too much
      of the FS as metadata.  We need to start saying no and just force some
      IO to happen.
      
      But we also need to let a reasonable amount of the FS become metadata.
      This bumps the hard limit up, later releases will have a better system.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      33b25808
    • Josef Bacik's avatar
      Btrfs: fallback on uncompressed io if compressed io fails · f5a84ee3
      Josef Bacik authored
      Currently compressed IO does not deal with not having its entire extent able to
      be allocated.  So if we have enough free space to allocate for the extent, but
      its not contiguous, it will fail spectacularly.  This patch fixes this by
      falling back on uncompressed IO which lets us spread the delalloc extent across
      multiple extents.  I tested this by making us randomly think the reservation had
      failed to make it fallback on the uncompressed io way and it seemed to work
      fine.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      f5a84ee3
    • Josef Bacik's avatar
      Btrfs: find ideal block group for caching · ccf0e725
      Josef Bacik authored
      This patch changes a few things.  Hopefully the comments are helpfull, but
      I'll try and be as verbose here.
      
      Problem:
      
      My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
      Part of this problem was we pick the first block group we can find and start
      caching it, even if it may not have enough free space.  The other problem is
      we only search for cached block groups the first time around, which we won't
      find any cached block groups because this is a newly mounted fs, so we end up
      caching several block groups during bootup, which with alot of fragmentation
      takes around 30-45 seconds to complete, which bogs down the system.  So
      
      Solution:
      
      1) Don't cache block groups willy-nilly at first.  Instead try and figure out
      which block group has the most free, and therefore will take the least amount
      of time to cache.
      
      2) Don't be so picky about cached block groups.  The other problem is once
      we've filled up a cluster, if the block group isn't finished caching the next
      time we try and do the allocation we'll completely ignore the cluster and
      start searching from the beginning of the space, which makes us cache more
      block groups, which slows us down even more.  So instead of skipping block
      groups that are not finished caching when we have a hint, only skip the block
      group if it hasn't started caching yet.
      
      There is one other tweak in here.  Before if we allocated a chunk and still
      couldn't find new space, we'd end up switching the space info to force another
      chunk allocation.  This could make us end up with way too many chunks, so keep
      track of this particular case.
      
      With this patch and my previous cluster fixes my fedora box now boots in 43
      seconds, and according to the bootchart is not held up by our block group
      caching at all.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ccf0e725
    • Dan Carpenter's avatar
      Btrfs: avoid null deref in unpin_extent_cache() · 4eb3991c
      Dan Carpenter authored
      I re-orderred the checks to avoid dereferencing "em" if it was null.
      
      Found by smatch static checker.
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4eb3991c
    • Li Dongyang's avatar
      Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root · df66916e
      Li Dongyang authored
      We don't need to call btrfs_release_path because btrfs_free_path will do
      that for us.
      Signed-off-by: default avatarLi Dongyang <Jerry87905@gmail.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      df66916e
    • Josef Bacik's avatar
      Btrfs: fix some metadata enospc issues · 5df6a9f6
      Josef Bacik authored
      We weren't reserving metadata space for rename, rmdir and unlink, which could
      cause problems.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5df6a9f6
    • Josef Bacik's avatar
      Btrfs: fix how we set max_size for free space clusters · 01dea1ef
      Josef Bacik authored
      This patch fixes a problem where max_size can be set to 0 even though we
      filled the cluster properly.  We set max_size to 0 if we restart the cluster
      window, but if the new start entry is big enough to be our new cluster then we
      could return with a max_size set to 0, which will mean the next time we try to
      allocate from this cluster it will fail.  So set max_extent to the entry's
      size.  Tested this on my box and now we actually allocate from the cluster
      after we fill it.  Thanks,
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      01dea1ef
    • Josef Bacik's avatar
      Btrfs: cleanup transaction starting and fix journal_info usage · 249ac1e5
      Josef Bacik authored
      We use journal_info to tell if we're in a nested transaction to make sure we
      don't commit the transaction within a nested transaction.  We use another
      method to see if there are any outstanding ioctl trans handles, so if we're
      starting one do not set current->journal_info, since it will screw with other
      filesystems.  This patch also cleans up the starting stuff so there aren't any
      magic numbers.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      249ac1e5
    • Josef Bacik's avatar
      Btrfs: fix data allocation hint start · 6346c939
      Josef Bacik authored
      Sometimes our start allocation hint when we cow a file can be either
      EXTENT_HOLE or some other such place holder, which is not optimal.  So if we
      find that our em->block_start is one of these special values, check to see
      where the first block of the inode is stored, and use that as a hint.  If that
      block is also a special value, just fallback on a hint of 0 and let the
      allocator figure out a good place to put the data.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      6346c939
  2. 14 Oct, 2009 5 commits
  3. 13 Oct, 2009 4 commits
    • Chris Mason's avatar
      Btrfs: fix btrfs acl #ifdef checks · 0eda294d
      Chris Mason authored
      The btrfs acl code was #ifdefing for a define
      that didn't exist.  This correctly matches it
      to the values used by the Kconfig file.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      0eda294d
    • Chris Mason's avatar
      Btrfs: streamline tree-log btree block writeout · 690587d1
      Chris Mason authored
      Syncing the tree log is a 3 phase operation.
      
      1) write and wait for all the tree log blocks for a given root.
      
      2) write and wait for all the tree log blocks for the
      tree of tree log roots.
      
      3) write and wait for the super blocks (barriers here)
      
      This isn't as efficient as it could be because there is
      no requirement to wait for the blocks from step one to hit the disk
      before we start writing the blocks from step two.  This commit
      changes the sequence so that we don't start waiting until
      all the tree blocks from both steps one and two have been sent
      to disk.
      
      We do this by breaking up btrfs_write_wait_marked_extents into
      two functions, which is trivial because it was already broken
      up into two parts.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      690587d1
    • Chris Mason's avatar
      Btrfs: avoid tree log commit when there are no changes · 257c62e1
      Chris Mason authored
      rpm has a habit of running fdatasync when the file hasn't
      changed.  We already detect if a file hasn't been changed
      in the current transaction but it might have been sent to
      the tree-log in this transaction and not changed since
      the last call to fsync.
      
      In this case, we want to avoid a tree log sync, which includes
      a number of synchronous writes and barriers.  This commit
      extends the existing tracking of the last transaction to change
      a file to also track the last sub-transaction.
      
      The end result is that rpm -ivh and -Uvh are roughly twice as fast,
      and on par with ext3.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      257c62e1
    • Chris Mason's avatar
      Btrfs: only write one super copy during fsync · 4722607d
      Chris Mason authored
      During a tree-log commit for fsync, we've been writing at least
      two copies of the super block and forcing them to disk.
      
      The other filesystems write only one, and this change brings us on
      par with them.  A full transaction commit will write all the super
      copies, so we still have redundant info written on a regular
      basis.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      4722607d
  4. 09 Oct, 2009 5 commits
  5. 08 Oct, 2009 5 commits
    • Josef Bacik's avatar
      Btrfs: optimize fsync for the single writer case · ff782e0a
      Josef Bacik authored
      This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie
      for new people to join the logging transaction if there is only ever 1 writer.
      This helps a little bit with latency where we have something like RPM where it
      will fdatasync every file it writes, and so waiting the 1 jiffie for every
      fdatasync really starts to add up.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      ff782e0a
    • Josef Bacik's avatar
      Btrfs: async delalloc flushing under space pressure · e3ccfa98
      Josef Bacik authored
      This patch moves the delalloc flushing that occurs when we are under space
      pressure off to a async thread pool.  This helps since we only free up
      metadata space when we actually insert the extent item, which means it takes
      quite a while for space to be free'ed up if we wait on all ordered extents.
      However, if space is freed up due to inline extents being inserted, we can
      wake people who are waiting up early, and they can finish their work.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      e3ccfa98
    • Josef Bacik's avatar
      Btrfs: release delalloc reservations on extent item insertion · 32c00aff
      Josef Bacik authored
      This patch fixes an issue with the delalloc metadata space reservation
      code.  The problem is we used to free the reservation as soon as we
      allocated the delalloc region.  The problem with this is if we are not
      inserting an inline extent, we don't actually insert the extent item until
      after the ordered extent is written out.  This patch does 3 things,
      
      1) It moves the reservation clearing stuff into the ordered code, so when
      we remove the ordered extent we remove the reservation.
      2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
      delalloc bits in the cases where we want to clear the metadata reservation
      when we clear the delalloc extent, in the case that we do an inline extent
      or we invalidate the page.
      3) It adds another waitqueue to the space info so that when we start a fs
      wide delalloc flush, anybody else who also hits that area will simply wait
      for the flush to finish and then try to make their allocation.
      
      This has been tested thoroughly to make sure we did not regress on
      performance.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      32c00aff
    • Chris Mason's avatar
      Btrfs: delay clearing EXTENT_DELALLOC for compressed extents · a3429ab7
      Chris Mason authored
      When compression is on, the cow_file_range code is farmed off to
      worker threads.  This allows us to do significant CPU work in parallel
      on SMP machines.
      
      But it is a delicate balance around when we clear flags and how.  In
      the past we cleared the delalloc flag immediately, which was safe
      because the pages stayed locked.
      
      But this is causing problems with the newest ENOSPC code, and with the
      recent extent state cleanups we can now clear the delalloc bit at the
      same time the uncompressed code does.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a3429ab7
    • Chris Mason's avatar
      Btrfs: cleanup extent_clear_unlock_delalloc flags · a791e35e
      Chris Mason authored
      extent_clear_unlock_delalloc has a growing set of ugly parameters
      that is very difficult to read and maintain.
      
      This switches to a flag field and well named flag defines.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a791e35e
  6. 06 Oct, 2009 1 commit
    • Josef Bacik's avatar
      Btrfs: fix possible softlockup in the allocator · 1cdda9b8
      Josef Bacik authored
      Like the cluster allocating stuff, we can lockup the box with the normal
      allocation path.  This happens when we
      
      1) Start to cache a block group that is severely fragmented, but has a decent
      amount of free space.
      2) Start to commit a transaction
      3) Have the commit try and empty out some of the delalloc inodes with extents
      that are relatively large.
      
      The inodes will not be able to make the allocations because they will ask for
      allocations larger than a contiguous area in the free space cache.  So we will
      wait for more progress to be made on the block group, but since we're in a
      commit the caching kthread won't make any more progress and it already has
      enough free space that wait_block_group_cache_progress will just return.  So,
      if we wait and fail to make the allocation the next time around, just loop and
      go to the next block group.  This keeps us from getting stuck in a softlockup.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1cdda9b8
  7. 05 Oct, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: fix deadlock on async thread startup · 61d92c32
      Chris Mason authored
      The btrfs async worker threads are used for a wide variety of things,
      including processing bio end_io functions.  This means that when
      the endio threads aren't running, the rest of the FS isn't
      able to do the final processing required to clear PageWriteback.
      
      The endio threads also try to exit as they become idle and
      start more as the work piles up.  The problem is that starting more
      threads means kthreadd may need to allocate ram, and that allocation
      may wait until the global number of writeback pages on the system is
      below a certain limit.
      
      The result of that throttling is that end IO threads wait on
      kthreadd, who is waiting on IO to end, which will never happen.
      
      This commit fixes the deadlock by handing off thread startup to a
      dedicated thread.  It also fixes a bug where the on-demand thread
      creation was creating far too many threads because it didn't take into
      account threads being started by other procs.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      61d92c32
  8. 01 Oct, 2009 3 commits
  9. 29 Sep, 2009 5 commits
  10. 28 Sep, 2009 1 commit
    • Josef Bacik's avatar
      Btrfs: proper -ENOSPC handling · 9ed74f2d
      Josef Bacik authored
      At the start of a transaction we do a btrfs_reserve_metadata_space() and
      specify how many items we plan on modifying.  Then once we've done our
      modifications and such, just call btrfs_unreserve_metadata_space() for
      the same number of items we reserved.
      
      For keeping track of metadata needed for data I've had to add an extent_io op
      for when we merge extents.  This lets us track space properly when we are doing
      sequential writes, so we don't end up reserving way more metadata space than
      what we need.
      
      The only place where the metadata space accounting is not done is in the
      relocation code.  This is because Yan is going to be reworking that code in the
      near future, so running btrfs-vol -b could still possibly result in a ENOSPC
      related panic.  This patch also turns off the metadata_ratio stuff in order to
      allow users to more efficiently use their disk space.
      
      This patch makes it so we track how much metadata we need for an inode's
      delayed allocation extents by tracking how many extents are currently
      waiting for allocation.  It introduces two new callbacks for the
      extent_io tree's, merge_extent_hook and split_extent_hook.  These help
      us keep track of when we merge delalloc extents together and split them
      up.  Reservations are handled prior to any actually dirty'ing occurs,
      and then we unreserve after we dirty.
      
      btrfs_unreserve_metadata_for_delalloc() will make the appropriate
      unreservations as needed based on the number of reservations we
      currently have and the number of extents we currently have.  Doing the
      reservation outside of doing any of the actual dirty'ing lets us do
      things like filemap_flush() the inode to try and force delalloc to
      happen, or as a last resort actually start allocation on all delalloc
      inodes in the fs.  This has survived dbench, fs_mark and an fsx torture
      test.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9ed74f2d