1. 03 Apr, 2009 9 commits
    • Chris Mason's avatar
      Btrfs: rework allocation clustering · fa9c0d79
      Chris Mason authored
      Because btrfs is copy-on-write, we end up picking new locations for
      blocks very often.  This makes it fairly difficult to maintain perfect
      read patterns over time, but we can at least do some optimizations
      for writes.
      
      This is done today by remembering the last place we allocated and
      trying to find a free space hole big enough to hold more than just one
      allocation.  The end result is that we tend to write sequentially to
      the drive.
      
      This happens all the time for metadata and it happens for data
      when mounted -o ssd.  But, the way we record it is fairly racey
      and it tends to fragment the free space over time because we are trying
      to allocate fairly large areas at once.
      
      This commit gets rid of the races by adding a free space cluster object
      with dedicated locking to make sure that only one process at a time
      is out replacing the cluster.
      
      The free space fragmentation is somewhat solved by allowing a cluster
      to be comprised of smaller free space extents.  This part definitely
      adds some CPU time to the cluster allocations, but it allows the allocator
      to consume the small holes left behind by cow.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      fa9c0d79
    • Chris Mason's avatar
      Btrfs: Optimize locking in btrfs_next_leaf() · 8e73f275
      Chris Mason authored
      btrfs_next_leaf was using blocking locks when it could have been using
      faster spinning ones instead.  This adds a few extra checks around
      the pieces that block and switches over to spinning locks.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      
      8e73f275
    • Chris Mason's avatar
      Btrfs: break up btrfs_search_slot into smaller pieces · c8c42864
      Chris Mason authored
      btrfs_search_slot was doing too many things at once.  This breaks
      it up into more reasonable units.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c8c42864
    • Josef Bacik's avatar
      Btrfs: kill the pinned_mutex · 04018de5
      Josef Bacik authored
      This patch removes the pinned_mutex.  The extent io map has an internal tree
      lock that protects the tree itself, and since we only copy the extent io map
      when we are committing the transaction we don't need it there.  We also don't
      need it when caching the block group since searching through the tree is also
      protected by the internal map spin lock.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      04018de5
    • Josef Bacik's avatar
      Btrfs: kill the block group alloc mutex · 6226cb0a
      Josef Bacik authored
      This patch removes the block group alloc mutex used to protect the free space
      tree for allocations and replaces it with a spin lock which is used only to
      protect the free space rb tree.  This means we only take the lock when we are
      directly manipulating the tree, which makes us a touch faster with
      multi-threaded workloads.
      
      This patch also gets rid of btrfs_find_free_space and replaces it with
      btrfs_find_space_for_alloc, which takes the number of bytes you want to
      allocate, and empty_size, which is used to indicate how much free space should
      be at the end of the allocation.
      
      It will return an offset for the allocator to use.  If we don't end up using it
      we _must_ call btrfs_add_free_space to put it back.  This is the tradeoff to
      kill the alloc_mutex, since we need to make sure nobody else comes along and
      takes our space.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      6226cb0a
    • Josef Bacik's avatar
      Btrfs: clean up find_free_extent · 2552d17e
      Josef Bacik authored
      I've replaced the strange looping constructs with a list_for_each_entry on
      space_info->block_groups.  If we have a hint we just jump into the loop with
      the block group and start looking for space.  If we don't find anything we
      start at the beginning and start looking.  We never come out of the loop with a
      ref on the block_group _unless_ we found space to use, then we drop it after we
      set the trans block_group.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      2552d17e
    • Josef Bacik's avatar
      Btrfs: free space cache cleanups · 70cb0743
      Josef Bacik authored
      This patch cleans up the free space cache code a bit.  It better documents the
      idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
      I took out the info allocation at the start of __btrfs_add_free_space and put it
      where it makes more sense.  This was left over cruft from when alloc_mutex
      existed.  Also all of the re-searches we do to make sure we inserted properly.
      Signed-off-by: default avatarJosef Bacik <jbacik@redhat.com>
      70cb0743
    • Chris Mason's avatar
      Btrfs: unplug in the async bio submission threads · bedf762b
      Chris Mason authored
      Btrfs pages being written get set to writeback, and then may go through
      a number of steps before they hit the block layer.  This includes compression,
      checksumming and async bio submission.
      
      The end result is that someone who writes a page and then does
      wait_on_page_writeback is likely to unplug the queue before the bio they
      cared about got there.
      
      We could fix this by marking bios sync, or by doing more frequent unplugs,
      but this commit just changes the async bio submission code to unplug
      after it has processed all the bios for a device.  The async bio submission
      does a fair job of collection bios, so this shouldn't be a huge problem
      for reducing merging at the elevator.
      
      For streaming O_DIRECT writes on a 5 drive array, it boosts performance
      from 386MB/s to 460MB/s.
      
      Thanks to Hisashi Hifumi for helping with this work.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      bedf762b
    • Chris Mason's avatar
      Btrfs: keep processing bios for a given bdev if our proc is batching · b765ead5
      Chris Mason authored
      Btrfs uses async helper threads to submit write bios so the checksumming
      helper threads don't block on the disk.
      
      The submit bio threads may process bios for more than one block device,
      so when they find one device congested they try to move on to other
      devices instead of blocking in get_request_wait for one device.
      
      This does a pretty good job of keeping multiple devices busy, but the
      congested flag has a number of problems.  A congested device may still
      give you a request, and other procs that aren't backing off the congested
      device may starve you out.
      
      This commit uses the io_context stored in current to decide if our process
      has been made a batching process by the block layer.  If so, it keeps
      sending IO down for at least one batch.  This helps make sure we do
      a good amount of work each time we visit a bdev, and avoids large IO
      stalls in multi-device workloads.
      
      It's also very ugly.  A better solution is in the works with Jens Axboe.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b765ead5
  2. 31 Mar, 2009 2 commits
    • Chris Mason's avatar
      Btrfs: try to free metadata pages when we free btree blocks · d57e62b8
      Chris Mason authored
      COW means we cycle though blocks fairly quickly, and once we
      free an extent on disk, it doesn't make much sense to keep the pages around.
      
      This commit tries to immediately free the page when we free the extent,
      which lowers our memory footprint significantly.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      d57e62b8
    • Chris Mason's avatar
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason authored
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5a3f23d5
  3. 25 Mar, 2009 1 commit
    • Chris Mason's avatar
      Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod · 1a81af4d
      Chris Mason authored
      btrfs_update_delayed_ref is optimized to add and remove different
      references in one pass through the delayed ref tree.  It is a zero
      sum on the total number of refs on a given extent.
      
      But, the code was recording an extra ref in the head node.  This
      never made it down to the disk but was used when deciding if it was
      safe to free the extent while dropping snapshots.
      
      The fix used here is to make sure the ref_mod count is unchanged
      on the head ref when btrfs_update_delayed_ref is called.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1a81af4d
  4. 24 Mar, 2009 15 commits
    • Chris Mason's avatar
      Btrfs: optimize fsyncs on old files · af4176b4
      Chris Mason authored
      The fsync log has code to make sure all of the parents of a file are in the
      log along with the file.  It uses a minimal log of the parent directory
      inodes, just enough to get the parent directory on disk.
      
      If the transaction that originally created a file is fully on disk,
      and the file hasn't been renamed or linked into other directories, we
      can safely skip the parent directory walk.  We know the file is on disk
      somewhere and we can go ahead and just log that single file.
      
      This is more important now because unrelated unlinks in the parent directory
      might make us force a commit if we try to log the parent.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      af4176b4
    • Chris Mason's avatar
      Btrfs: tree logging unlink/rename fixes · 12fcfd22
      Chris Mason authored
      The tree logging code allows individual files or directories to be logged
      without including operations on other files and directories in the FS.
      It tries to commit the minimal set of changes to disk in order to
      fsync the single file or directory that was sent to fsync or O_SYNC.
      
      The tree logging code was allowing files and directories to be unlinked
      if they were part of a rename operation where only one directory
      in the rename was in the fsync log.  This patch adds a few new rules
      to the tree logging.
      
      1) on rename or unlink, if the inode being unlinked isn't in the fsync
      log, we must force a full commit before doing an fsync of the directory
      where the unlink was done.  The commit isn't done during the unlink,
      but it is forced the next time we try to log the parent directory.
      
      Solution: record transid of last unlink/rename per directory when the
      directory wasn't already logged.  For renames this is only done when
      renaming to a different directory.
      
      mkdir foo/some_dir
      normal commit
      rename foo/some_dir foo2/some_dir
      mkdir foo/some_dir
      fsync foo/some_dir/some_file
      
      The fsync above will unlink the original some_dir without recording
      it in its new location (foo2).  After a crash, some_dir will be gone
      unless the fsync of some_file forces a full commit
      
      2) we must log any new names for any file or dir that is in the fsync
      log.  This way we make sure not to lose files that are unlinked during
      the same transaction.
      
      2a) we must log any new names for any file or dir during rename
      when the directory they are being removed from was logged.
      
      2a is actually the more important variant.  Without the extra logging
      a crash might unlink the old name without recreating the new one
      
      3) after a crash, we must go through any directories with a link count
      of zero and redo the rm -rf
      
      mkdir f1/foo
      normal commit
      rm -rf f1/foo
      fsync(f1)
      
      The directory f1 was fully removed from the FS, but fsync was never
      called on f1, only its parent dir.  After a crash the rm -rf must
      be replayed.  This must be able to recurse down the entire
      directory tree.  The inode link count fixup code takes care of the
      ugly details.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      12fcfd22
    • Chris Mason's avatar
      Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay · a74ac322
      Chris Mason authored
      During log replay, inodes are copied from the log to the main filesystem
      btrees.  Sometimes they have a zero link count in the log but they actually
      gain links during the replay or have some in the main btree.
      
      This patch updates the link count to be at least one after copying the
      inode out of the log.  This makes sure the inode is deleted during an
      iput while the rest of the replay code is still working on it.
      
      The log replay has fixup code to make sure that link counts are correct
      at the end of the replay, so we could use any non-zero number here and
      it would work fine.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a74ac322
    • Chris Mason's avatar
      Btrfs: limit balancing work while flushing delayed refs · a4b6e07d
      Chris Mason authored
      The delayed reference mechanism is responsible for all updates to the
      extent allocation trees, including those updates created while processing
      the delayed references.
      
      This commit tries to limit the amount of work that gets created during
      the final run of delayed refs before a commit.  It avoids cowing new blocks
      unless it is required to finish the commit, and so it avoids new allocations
      that were not really required.
      
      The goal is to avoid infinite loops where we are always making more work
      on the final run of delayed refs.  Over the long term we'll make a
      special log for the last delayed ref updates as well.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      a4b6e07d
    • Chris Mason's avatar
      Btrfs: readahead checksums during btrfs_finish_ordered_io · 5d13a98f
      Chris Mason authored
      This reads in blocks in the checksum btree before starting the
      transaction in btrfs_finish_ordered_io.  It makes it much more likely
      we'll be able to do operations inside the transaction without
      needing any btree reads, which limits transaction latencies overall.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      5d13a98f
    • Chris Mason's avatar
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason authored
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b9473439
    • Chris Mason's avatar
      Btrfs: Only let very young transactions grow during commit · 89573b9c
      Chris Mason authored
      Commits are fairly expensive, and so btrfs has code to sit around for a while
      during the commit and let new writers come in.
      
      But, while we're sitting there, new delayed refs might be added, and those
      can be expensive to process as well.  Unless the transaction is very very
      young, it makes sense to go ahead and let the commit finish without hanging
      around.
      
      The commit grow loop isn't as important as it used to be, the fsync logging
      code handles most performance critical syncs now.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      89573b9c
    • Chris Mason's avatar
      Btrfs: Check for a blocking lock before taking the spin · 66d7e85e
      Chris Mason authored
      This reduces contention on the extent buffer spin locks by testing for a
      blocking lock before trying to take the spinlock.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      66d7e85e
    • Chris Mason's avatar
      Btrfs: reduce stack in cow_file_range · 7f366cfe
      Chris Mason authored
      The fs/btrfs/inode.c code to run delayed allocation during writout
      needed some stack usage optimization.  This is the first pass, it does
      the check for compression earlier on, which allows us to do the common
      (no compression) case higher up in the call chain.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      7f366cfe
    • Chris Mason's avatar
      Btrfs: reduce stalls during transaction commit · b7ec40d7
      Chris Mason authored
      To avoid deadlocks and reduce latencies during some critical operations, some
      transaction writers are allowed to jump into the running transaction and make
      it run a little longer, while others sit around and wait for the commit to
      finish.
      
      This is a bit unfair, especially when the callers that jump in do a bunch
      of IO that makes all the others procs on the box wait.  This commit
      reduces the stalls this produces by pre-reading file extent pointers
      during btrfs_finish_ordered_io before the transaction is joined.
      
      It also tunes the drop_snapshot code to politely wait for transactions
      that have started writing out their delayed refs to finish.  This avoids
      new delayed refs being flooded into the queue while we're trying to
      close off the transaction.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      b7ec40d7
    • Chris Mason's avatar
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason authored
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      c3e69d58
    • Chris Mason's avatar
      Btrfs: try to cleanup delayed refs while freeing extents · 1887be66
      Chris Mason authored
      When extents are freed, it is likely that we've removed the last
      delayed reference update for the extent.  This checks the delayed
      ref tree when things are freed, and if no ref updates area left it
      immediately processes the delayed ref.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      1887be66
    • Chris Mason's avatar
      Btrfs: reduce stack usage in some crucial tree balancing functions · 44871b1b
      Chris Mason authored
      Many of the tree balancing functions follow the same pattern.
      
      1) cow a block
      2) do something to the result
      
      This commit breaks them up into two functions so the variables and
      code required for part two don't suck down stack during part one.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      44871b1b
    • Chris Mason's avatar
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason authored
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      56bec294
    • Chris Mason's avatar
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason authored
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  5. 23 Mar, 2009 11 commits
  6. 22 Mar, 2009 2 commits