1. 12 Jan, 2011 4 commits
    • Jesper Juhl's avatar
      xfs: fix an assignment within an ASSERT() · 1884bd83
      Jesper Juhl authored
      In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
      label we have this:
      	ASSERT(error = 0);
      I believe a comparison was intended, not an assignment. If I'm
      right, the patch below fixes that up.
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      1884bd83
    • Christoph Hellwig's avatar
      xfs: fix error handling for synchronous writes · bfc60177
      Christoph Hellwig authored
      If we get an IO error on a synchronous superblock write, we attach an
      error release function to it so that when the last reference goes away
      the release function is called and the buffer is invalidated and
      unlocked. The buffer is left locked until the release function is
      called so that other concurrent users of the buffer will be locked out
      until the buffer error is fully processed.
      
      Unfortunately, for the superblock buffer the filesyetm itself holds a
      reference to the buffer which prevents the reference count from
      dropping to zero and the release function being called. As a result,
      once an IO error occurs on a sync write, the buffer will never be
      unlocked and all future attempts to lock the buffer will hang.
      
      To make matters worse, this problems is not unique to such buffers;
      if there is a concurrent _xfs_buf_find() running, the lookup will grab
      a reference to the buffer and then wait on the buffer lock, preventing
      the reference count from ever falling to zero and hence unlocking the
      buffer.
      
      As such, the whole b_relse function implementation is broken because it
      cannot rely on the buffer reference count falling to zero to unlock the
      errored buffer. The synchronous write error path is the only path that
      uses this callback - it is used to ensure that the synchronous waiter
      gets the buffer error before the error state is cleared from the buffer
      by the release function.
      
      Given that the only sychronous buffer writes now go through xfs_bwrite
      and the error path in question can only occur for a write of a dirty,
      logged buffer, we can move most of the b_relse processing to happen
      inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
      In addition to that we make sure the error is not cleared in
      xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
      Given that xfs_bwrite keeps the buffer locked until it has waited for
      it and checked the error this allows to reliably propagate the error
      to the caller, and make sure that the buffer is reliably unlocked.
      
      Given that xfs_buf_iodone_callbacks was the only instance of the
      b_relse callback we can remove it entirely.
      
      Based on earlier patches by Dave Chinner and Ajeet Yadav.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarAjeet Yadav <ajeet.yadav.77@gmail.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      bfc60177
    • Christoph Hellwig's avatar
      xfs: add FITRIM support · a46db608
      Christoph Hellwig authored
      Allow manual discards from userspace using the FITRIM ioctl.  This is not
      intended to be run during normal workloads, as the freepsace btree walks
      can cause large performance degradation.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      a46db608
    • Dave Chinner's avatar
      xfs: ensure log covering transactions are synchronous · c58efdb4
      Dave Chinner authored
      To ensure the log is covered and the filesystem idles correctly, we
      need to ensure that dummy transactions hit the disk and do not stay
      pinned in memory.  If the superblock is pinned in memory, it can't
      be flushed so the log covering cannot make progress. The result is
      dependent on timing - more oftent han not we continue to issues a
      log covering transaction every 36s rather than idling after ~90s.
      
      Fix this by making the log covering transaction synchronous. To
      avoid additional log force from xfssyncd, make the log covering
      transaction take the place of the existing log force in the xfssyncd
      background sync process.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAlex Elder <aelder@sgi.com>
      c58efdb4
  2. 10 Jan, 2011 4 commits
  3. 12 Jan, 2011 1 commit
    • Dave Chinner's avatar
      xfs: introduce xfs_rw_lock() helpers for locking the inode · 487f84f3
      Dave Chinner authored
      We need to obtain the i_mutex, i_iolock and i_ilock during the read
      and write paths. Add a set of wrapper functions to neatly
      encapsulate the lock ordering and shared/exclusive semantics to make
      the locking easier to follow and get right.
      
      Note that this changes some of the exclusive locking serialisation in
      that serialisation will occur against the i_mutex instead of the
      XFS_IOLOCK_EXCL. This does not change any behaviour, and it is
      arguably more efficient to use the mutex for such serialisation than
      the rw_sem.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      487f84f3
  4. 10 Jan, 2011 3 commits
  5. 21 Dec, 2010 2 commits
    • Dave Chinner's avatar
      xfs: convert grant head manipulations to lockless algorithm · d0eb2f38
      Dave Chinner authored
      The only thing that the grant lock remains to protect is the grant head
      manipulations when adding or removing space from the log. These calculations
      are already based on atomic variables, so we can already update them safely
      without locks. However, the grant head manpulations require atomic multi-step
      calculations to be executed, which the algorithms currently don't allow.
      
      To make these multi-step calculations atomic, convert the algorithms to
      compare-and-exchange loops on the atomic variables. That is, we sample the old
      value, perform the calculation and use atomic64_cmpxchg() to attempt to update
      the head with the new value. If the head has not changed since we sampled it,
      it will succeed and we are done. Otherwise, we rerun the calculation again from
      a new sample of the head.
      
      This allows us to remove the grant lock from around all the grant head space
      manipulations, and that effectively removes the grant lock from the log
      completely. Hence we can remove the grant lock completely from the log at this
      point.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d0eb2f38
    • Dave Chinner's avatar
      xfs: introduce new locks for the log grant ticket wait queues · 3f16b985
      Dave Chinner authored
      The log grant ticket wait queues are currently protected by the log
      grant lock.  However, the queues are functionally independent from
      each other, and operations on them only require serialisation
      against other queue operations now that all of the other log
      variables they use are atomic values.
      
      Hence, we can make them independent of the grant lock by introducing
      new locks just to protect the lists operations. because the lists
      are independent, we can use a lock per list and ensure that reserve
      and write head queuing do not contend.
      
      To ensure forced shutdowns work correctly in conjunction with the
      new fast paths, ensure that we check whether the log has been shut
      down in the grant functions once we hold the relevant spin locks but
      before we go to sleep. This is needed to co-ordinate correctly with
      the wakeups that are issued on the ticket queues so we don't leave
      any processes sleeping on the queues during a shutdown.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3f16b985
  6. 03 Dec, 2010 1 commit
  7. 21 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: convert l_tail_lsn to an atomic variable. · 1c3cb9ec
      Dave Chinner authored
      log->l_tail_lsn is currently protected by the log grant lock. The
      lock is only needed for serialising readers against writers, so we
      don't really need the lock if we make the l_tail_lsn variable an
      atomic. Converting the l_tail_lsn variable to an atomic64_t means we
      can start to peel back the grant lock from various operations.
      
      Also, provide functions to safely crack an atomic LSN variable into
      it's component pieces and to recombined the components into an
      atomic variable. Use them where appropriate.
      
      This also removes the need for explicitly holding a spinlock to read
      the l_tail_lsn on 32 bit platforms.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      
      1c3cb9ec
  8. 03 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: convert l_last_sync_lsn to an atomic variable · 84f3c683
      Dave Chinner authored
      log->l_last_sync_lsn is updated in only one critical spot - log
      buffer Io completion - and is protected by the grant lock here. This
      requires the grant lock to be taken for every log buffer IO
      completion. Converting the l_last_sync_lsn variable to an atomic64_t
      means that we do not need to take the grant lock in log buffer IO
      completion to update it.
      
      This also removes the need for explicitly holding a spinlock to read
      the l_last_sync_lsn on 32 bit platforms.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      84f3c683
  9. 21 Dec, 2010 6 commits
  10. 20 Dec, 2010 3 commits
  11. 03 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: consume iodone callback items on buffers as they are processed · c90821a2
      Dave Chinner authored
      To allow buffer iodone callbacks to consume multiple items off the
      callback list, first we need to convert the xfs_buf_do_callbacks()
      to consume items and always pull the next item from the head of the
      list.
      
      The means the item list walk is never dependent on knowing the
      next item on the list and hence allows callbacks to remove items
      from the list as well. This allows callbacks to do bulk operations
      by scanning the list for identical callbacks, consuming them all
      and then processing them in bulk, negating the need for multiple
      callbacks of that type.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c90821a2
  12. 17 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: reduce the number of AIL push wakeups · e677d0f9
      Dave Chinner authored
      The xfaild often tries to rest to wait for congestion to pass of for
      IO to complete, but is regularly woken in tail-pushing situations.
      In severe cases, the xfsaild is getting woken tens of thousands of
      times a second. Reduce the number needless wakeups by only waking
      the xfsaild if the new target is larger than the old one. Further
      make short sleeps uninterruptible as they occur when the xfsaild has
      decided it needs to back off to allow some IO to complete and being
      woken early is counter-productive.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e677d0f9
  13. 20 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: bulk AIL insertion during transaction commit · 0e57f6a3
      Dave Chinner authored
      When inserting items into the AIL from the transaction committed
      callbacks, we take the AIL lock for every single item that is to be
      inserted. For a CIL checkpoint commit, this can be tens of thousands
      of individual inserts, yet almost all of the items will be inserted
      at the same point in the AIL because they have the same index.
      
      To reduce the overhead and contention on the AIL lock for such
      operations, introduce a "bulk insert" operation which allows a list
      of log items with the same LSN to be inserted in a single operation
      via a list splice. To do this, we need to pre-sort the log items
      being committed into a temporary list for insertion.
      
      The complexity is that not every log item will end up with the same
      LSN, and not every item is actually inserted into the AIL. Items
      that don't match the commit LSN will be inserted and unpinned as per
      the current one-at-a-time method (relatively rare), while items that
      are not to be inserted will be unpinned and freed immediately. Items
      that are to be inserted at the given commit lsn are placed in a
      temporary array and inserted into the AIL in bulk each time the
      array fills up.
      
      As a result of this, we trade off AIL hold time for a significant
      reduction in traffic. lock_stat output shows that the worst case
      hold time is unchanged, but contention from AIL inserts drops by an
      order of magnitude and the number of lock traversal decreases
      significantly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      0e57f6a3
  14. 03 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: clean up xfs_ail_delete() · eb3efa12
      Dave Chinner authored
      xfs_ail_delete() has a needlessly complex interface. It returns the log item
      that was passed in for deletion (which the callers then assert is identical to
      the one passed in), and callers of xfs_ail_delete() still need to invalidate
      current traversal cursors.
      
      Make xfs_ail_delete() return void, move the cursor invalidation inside it, and
      clean up the callers just to use the log item pointer they passed in.
      
      While cleaning up, remove the messy and unnecessary "/* ARGUSED */" comments
      around all these functions.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      eb3efa12
  15. 20 Dec, 2010 2 commits
    • Dave Chinner's avatar
      xfs: Pull EFI/EFD handling out from under the AIL lock · b199c8a4
      Dave Chinner authored
      EFI/EFD interactions are protected from races by the AIL lock. They
      are the only type of log items that require the the AIL lock to
      serialise internal state, so they need to be separated from the AIL
      lock before we can do bulk insert operations on the AIL.
      
      To acheive this, convert the counter of the number of extents in the
      EFI to an atomic so it can be safely manipulated by EFD processing
      without locks. Also, convert the EFI state flag manipulations to use
      atomic bit operations so no locks are needed to record state
      changes. Finally, use the state bits to determine when it is safe to
      free the EFI and clean up the code to do this neatly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      b199c8a4
    • Dave Chinner's avatar
      xfs: fix EFI transaction cancellation. · 9c5f8414
      Dave Chinner authored
      XFS_EFI_CANCELED has not been set in the code base since
      xfs_efi_cancel() was removed back in 2006 by commit
      065d312e ("[XFS] Remove unused
      iop_abort log item operation), and even then xfs_efi_cancel() was
      never called. I haven't tracked it back further than that (beyond
      git history), but it indicates that the handling of EFIs in
      cancelled transactions has been broken for a long time.
      
      Basically, when we get an IOP_UNPIN(lip, 1); call from
      xfs_trans_uncommit() (i.e. remove == 1), if we don't free the log
      item descriptor we leak it. Fix the behviour to be correct and kill
      the XFS_EFI_CANCELED flag.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      9c5f8414
  16. 02 Dec, 2010 2 commits
    • Dave Chinner's avatar
      xfs: connect up buffer reclaim priority hooks · 821eb21d
      Dave Chinner authored
      Now that the buffer reclaim infrastructure can handle different reclaim
      priorities for different types of buffers, reconnect the hooks in the
      XFS code that has been sitting dormant since it was ported to Linux. This
      should finally give use reclaim prioritisation that is on a par with the
      functionality that Irix provided XFS 15 years ago.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      821eb21d
    • Dave Chinner's avatar
      xfs: add a lru to the XFS buffer cache · 430cbeb8
      Dave Chinner authored
      Introduce a per-buftarg LRU for memory reclaim to operate on. This
      is the last piece we need to put in place so that we can fully
      control the buffer lifecycle. This allows XFS to be responsibile for
      maintaining the working set of buffers under memory pressure instead
      of relying on the VM reclaim not to take pages we need out from
      underneath us.
      
      The implementation introduces a b_lru_ref counter into the buffer.
      This is currently set to 1 whenever the buffer is referenced and so is used to
      determine if the buffer should be added to the LRU or not when freed.
      Effectively it allows lazy LRU initialisation of the buffer so we do not need
      to touch the LRU list and locks in xfs_buf_find().
      
      Instead, when the buffer is being released and we drop the last
      reference to it, we check the b_lru_ref count and if it is none zero
      we re-add the buffer reference and add the inode to the LRU. The
      b_lru_ref counter is decremented by the shrinker, and whenever the
      shrinker comes across a buffer with a zero b_lru_ref counter, if
      released the LRU reference on the buffer. In the absence of a lookup
      race, this will result in the buffer being freed.
      
      This counting mechanism is used instead of a reference flag so that
      it is simple to re-introduce buffer-type specific reclaim reference
      counts to prioritise reclaim more effectively. We still have all
      those hooks in the XFS code, so this will provide the infrastructure
      to re-implement that functionality.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      430cbeb8
  17. 30 Nov, 2010 1 commit
  18. 16 Dec, 2010 1 commit
  19. 17 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: convert inode cache lookups to use RCU locking · 1a3e8f3d
      Dave Chinner authored
      With delayed logging greatly increasing the sustained parallelism of inode
      operations, the inode cache locking is showing significant read vs write
      contention when inode reclaim runs at the same time as lookups. There is
      also a lot more write lock acquistions than there are read locks (4:1 ratio)
      so the read locking is not really buying us much in the way of parallelism.
      
      To avoid the read vs write contention, change the cache to use RCU locking on
      the read side. To avoid needing to RCU free every single inode, use the built
      in slab RCU freeing mechanism. This requires us to be able to detect lookups of
      freed inodes, so enѕure that ever freed inode has an inode number of zero and
      the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
      lookup path, but also add a check for a zero inode number as well.
      
      We canthen convert all the read locking lockups to use RCU read side locking
      and hence remove all read side locking.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAlex Elder <aelder@sgi.com>
      1a3e8f3d
  20. 16 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: rcu free inodes · d95b7aaf
      Dave Chinner authored
      Introduce RCU freeing of XFS inodes so that we can convert lookup
      traversals to use rcu_read_lock() protection. This patch only
      introduces the RCU freeing to minimise the potential conflicts with
      mainline if this is merged into mainline via a VFS patchset. It
      abuses the i_dentry list for the RCU callback structure because the
      VFS patches make this a union so it is safe to use like this and
      simplifies and merge issues.
      
      This patch uses basic RCU freeing rather than SLAB_DESTROY_BY_RCU.
      The later lookup patches need the same "found free inode" protection
      regardless of the RCU freeing method used, so once again the RCU
      freeing method can be dealt with apprpriately at merge time without
      affecting any other code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d95b7aaf
  21. 23 Dec, 2010 1 commit
    • Dave Chinner's avatar
      xfs: don't truncate prealloc from frequently accessed inodes · 6e857567
      Dave Chinner authored
      A long standing problem for streaming writeѕ through the NFS server
      has been that the NFS server opens and closes file descriptors on an
      inode for every write. The result of this behaviour is that the
      ->release() function is called on every close and that results in
      XFS truncating speculative preallocation beyond the EOF.  This has
      an adverse effect on file layout when multiple files are being
      written at the same time - they interleave their extents and can
      result in severe fragmentation.
      
      To avoid this problem, keep track of ->release calls made on a dirty
      inode. For most cases, an inode is only going to be opened once for
      writing and then closed again during it's lifetime in cache. Hence
      if there are multiple ->release calls when the inode is dirty, there
      is a good chance that the inode is being accessed by the NFS server.
      Hence set a flag the first time ->release is called while there are
      delalloc blocks still outstanding on the inode.
      
      If this flag is set when ->release is next called, then do no
      truncate away the speculative preallocation - leave it there so that
      subsequent writes do not need to reallocate the delalloc space. This
      will prevent interleaving of extents of different inodes written
      concurrently to the same AG.
      
      If we get this wrong, it is not a big deal as we truncate
      speculative allocation beyond EOF anyway in xfs_inactive() when the
      inode is thrown out of the cache.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6e857567
  22. 04 Jan, 2011 1 commit
    • Dave Chinner's avatar
      xfs: dynamic speculative EOF preallocation · 055388a3
      Dave Chinner authored
      Currently the size of the speculative preallocation during delayed
      allocation is fixed by either the allocsize mount option of a
      default size. We are seeing a lot of cases where we need to
      recommend using the allocsize mount option to prevent fragmentation
      when buffered writes land in the same AG.
      
      Rather than using a fixed preallocation size by default (up to 64k),
      make it dynamic by basing it on the current inode size. That way the
      EOF preallocation will increase as the file size increases.  Hence
      for streaming writes we are much more likely to get large
      preallocations exactly when we need it to reduce fragementation.
      
      For default settings, the size of the initial extents is determined
      by the number of parallel writers and the amount of memory in the
      machine. For 4GB RAM and 4 concurrent 32GB file writes:
      
      EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
         1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
         2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
         3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
         4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
         5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
         6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
         7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088
      
      and for 16 concurrent 16GB file writes:
      
       EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
         1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
         2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
         3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
         4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
         5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
         6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
         7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208
      
      Because it is hard to take back specualtive preallocation, cases
      where there are large slow growing log files on a nearly full
      filesystem may cause premature ENOSPC. Hence as the filesystem nears
      full, the maximum dynamic prealloc size іs reduced according to this
      table (based on 4k block size):
      
      freespace       max prealloc size
        >5%             full extent (8GB)
        4-5%             2GB (8GB >> 2)
        3-4%             1GB (8GB >> 3)
        2-3%           512MB (8GB >> 4)
        1-2%           256MB (8GB >> 5)
        <1%            128MB (8GB >> 6)
      
      This should reduce the amount of space held in speculative
      preallocation for such cases.
      
      The allocsize mount option turns off the dynamic behaviour and fixes
      the prealloc size to whatever the mount option specifies. i.e. the
      behaviour is unchanged.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      055388a3