1. 09 Jul, 2013 2 commits
  2. 28 Jun, 2013 7 commits
  3. 27 Jun, 2013 11 commits
    • Dave Chinner's avatar
      xfs: Use inode create transaction · ddf6ad01
      Dave Chinner authored
      Replace the use of buffer based logging of inode initialisation,
      uses the new logical form to describe the range to be initialised
      in recovery. We continue to "log" the inode buffers to push them
      into the AIL and ensure that the inode create transaction is not
      removed from the log before the inode buffers are written to disk.
      
      Update the transaction identifier and reservations to match the
      changed implementation.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      ddf6ad01
    • Dave Chinner's avatar
      xfs: Inode create item recovery · 28c8e41a
      Dave Chinner authored
      When we find a icreate transaction, we need to get and initialise
      the buffers in the range that has been passed. Extract and verify
      the information in the item record, then loop over the range
      initialising and issuing the buffer writes delayed.
      
      Support an arbitrary size range to initialise so that in
      future when we allocate inodes in much larger chunks all kernels
      that understand this transaction can still recover them.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      28c8e41a
    • Dave Chinner's avatar
      xfs: Inode create transaction reservations · b8402b47
      Dave Chinner authored
      Define the log and space transaction sizes. Factor the current
      create log reservation macro into the two logical halves and reuse
      one half for the new icreate transactions. The icreate transaction
      is transparent to all the high level create code - the
      pre-calculated reservations will correctly set the reservations
      dependent on whether the filesystem supports the icreate
      transaction.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      b8402b47
    • Dave Chinner's avatar
      xfs: Inode create log items · 3ebe7d2d
      Dave Chinner authored
      Introduce the inode create log item type for logical inode create logging.
      Instead of logging the changes in buffers, pass the range to be
      initialised through the log by a new transaction type.  This reduces
      the amount of log space required to record initialisation during
      allocation from about 128 bytes per inode to a small fixed amount
      per inode extent to be initialised.
      
      This requires a new log item type to track it through the log
      and the AIL. This is a relatively simple item - most callbacks are
      noops as this item has the same life cycle as the transaction.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      3ebe7d2d
    • Dave Chinner's avatar
      xfs: Introduce an ordered buffer item · 5f6bed76
      Dave Chinner authored
      If we have a buffer that we have modified but we do not wish to
      physically log in a transaction (e.g. we've logged a logical
      change), we still need to ensure that transactional integrity is
      maintained. Hence we must not move the tail of the log past the
      transaction that the buffer is associated with before the buffer is
      written to disk.
      
      This means these special buffers still need to be included in the
      transaction and added to the AIL just like a normal buffer, but we
      do not want the modifications to the buffer written into the
      transaction. IOWs, what we want is an "ordered buffer" that
      maintains the same transactional life cycle as a physically logged
      buffer, just without the transcribing of the modifications to the
      log.
      
      Hence we need to flag the buffer as an "ordered buffer" to avoid
      including it in vector size calculations or formatting during the
      transaction. Once the transaction is committed, the buffer appears
      for all intents to be the same as a physically logged buffer as it
      transitions through the log and AIL.
      
      Relogging will also work just fine for such an ordered buffer - the
      logical transaction will be replayed before the subsequent
      modifications that relog the buffer, so everything will be
      reconstructed correctly by recovery.
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      5f6bed76
    • Dave Chinner's avatar
      xfs: Introduce ordered log vector support · fd63875c
      Dave Chinner authored
      And "ordered log vector" is a log vector that is used for
      tracking a log item through the CIL and into the AIL as part of the
      log checkpointing. These ordered log vectors are special in that
      they are not written to to journal in any way, and are not accounted
      to the checkpoint being written.
      
      The reason for this behaviour is to allow operations to attach items
      to transactions and have them follow the normal transactional
      lifecycle without actually having to write them to the journal. This
      allows logging of items that track high level logical changes and
      writing them to the log, while the physical items being modified
      pass through into the AIL and pin the tail of the log (and therefore
      the logical item in the log) until all the modified items are
      physically written to disk.
      
      IOWs, it allows us to write metadata without physically logging
      every individual change but still maintain the full transactional
      integrity guarantees we currently have w.r.t. crash recovery.
      
      This change modifies some of the CIL item insertion loops, as
      ordered log vectors introduce some new constraints as they don't
      track any data. One advantage of this change is that it combines
      two log vector chain walks into a single pass, so there is less
      overhead in the transaction commit pass as well. It also kills some
      unused code in the log vector walk loop when committing the CIL.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      fd63875c
    • Dave Chinner's avatar
      xfs: xfs_ifree doesn't need to modify the inode buffer · 1baaed8f
      Dave Chinner authored
      Long ago, bulkstat used to read inodes directly from the backing
      buffer for speed. This had the unfortunate problem of being cache
      incoherent with unlinks, and so xfs_ifree() had to mark the inode
      as free directly in the backing buffer. bulkstat was changed some
      time ago to use inode cache coherent lookups, and so will never see
      unlinked inodes in it's lookups. Hence xfs_ifree() does not need to
      touch the inode backing buffer anymore.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      1baaed8f
    • Dave Chinner's avatar
      xfs: don't do IO when creating an new inode · cca9f93a
      Dave Chinner authored
      When we are allocating a new inode, we read the inode cluster off
      disk to increment the generation number. We are already using a
      random generation number for newly allocated inodes, so if we are not
      using the ikeep mode, we can just generate a new generation number
      when we initialise the newly allocated inode.
      
      This avoids the need for reading the inode buffer during inode
      creation. This will speed up allocation of inodes in cold, partially
      allocated clusters as they will no longer need to be read from disk
      during allocation. It will also reduce the CPU overhead of inode
      allocation by not having the process the buffer read, even on cache
      hits.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      cca9f93a
    • Dave Chinner's avatar
      xfs: don't use speculative prealloc for small files · 133eeb17
      Dave Chinner authored
      Dedicated small file workloads have been seeing significant free
      space fragmentation causing premature inode allocation failure
      when large inode sizes are in use. A particular test case showed
      that a workload that runs to a real ENOSPC on 256 byte inodes would
      fail inode allocation with ENOSPC about about 80% full with 512 byte
      inodes, and at about 50% full with 1024 byte inodes.
      
      The same workload, when run with -o allocsize=4096 on 1024 byte
      inodes would run to being 100% full before giving ENOSPC. That is,
      no freespace fragmentation at all.
      
      The issue was caused by the specific IO pattern the application had
      - the framework it was using did not support direct IO, and so it
      was emulating it by using fadvise(DONT_NEED). The result was that
      the data was getting written back before the speculative prealloc
      had been trimmed from memory by the close(), and so small single
      block files were being allocated with 2 blocks, and then having one
      truncated away. The result was lots of small 4k free space extents,
      and hence each new 8k allocation would take another 8k from
      contiguous free space and turn it into 4k of allocated space and 4k
      of free space.
      
      Hence inode allocation, which requires contiguous, aligned
      allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
      (1024 byte inodes) can fail to find sufficiently large freespace and
      hence fail while there is still lots of free space available.
      
      There's a simple fix for this, and one that has precendence in the
      allocator code already - don't do speculative allocation unless the
      size of the file is larger than a certain size. In this case, that
      size is the minimum default preallocation size:
      mp->m_writeio_blocks. And to keep with the concept of being nice to
      people when the files are still relatively small, cap the prealloc
      to mp->m_writeio_blocks until the file goes over a stripe unit is
      size, at which point we'll fall back to the current behaviour based
      on the last extent size.
      
      This will effectively turn off speculative prealloc for very small
      files, keep preallocation low for small files, and behave as it
      currently does for any file larger than a stripe unit. This
      completely avoids the freespace fragmentation problem this
      particular IO pattern was causing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      133eeb17
    • Dave Chinner's avatar
      xfs: plug directory buffer readahead · 34eefc06
      Dave Chinner authored
      Similar to bulkstat inode chunk readahead, we need to plug directory
      data buffer readahead during getdents to ensure that we can merge
      adjacent readahead requests and sort out of order requests optimally
      before they are dispatched. This improves the readahead efficiency
      and reduces the IO load it generates as the IO patterns are
      significantly better for both contiguous and fragmented directories.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      34eefc06
    • Dave Chinner's avatar
      xfs: add pluging for bulkstat readahead · cbb2864a
      Dave Chinner authored
      I was running some tests on bulkstat on CRC enabled filesystems when
      I noticed that all the IO being issued was 8k in size, regardless of
      the fact taht we are issuing sequential 8k buffers for inodes
      clusters. The IO size should be 16k for 256 byte inodes, and 32k for
      512 byte inodes, but this wasn't happening.
      
      blktrace showed that there was an explict plug and unplug happening
      around each readahead IO from _xfs_buf_ioapply, and the unplug was
      causing the IO to be issued immediately. Hence no opportunity was
      being given to the elevator to merge adjacent readahead requests and
      dispatch them as a single IO.
      
      Add plugging around the inode chunk readahead dispatch loop in
      bulkstat to ensure that we don't unplug the queue between adjacent
      inode buffer readahead IOs and so we get fewer, larger IO requests
      hitting the storage subsystem for bulkstat.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      cbb2864a
  4. 26 Jun, 2013 2 commits
  5. 20 Jun, 2013 1 commit
    • Eric Sandeen's avatar
      xfs: check on-disk (not incore) btree root size in dfrag.c · 427d9fe2
      Eric Sandeen authored
      xfs_swap_extents_check_format() contains checks to make sure that
      original and the temporary files during defrag are compatible;
      Gabriel VLASIU ran into a case where xfs_fsr returned EINVAL
      because the tests found the btree root to be of size 120,
      while the fork offset was only 104; IOW, they overlapped.
      
      However, this is just due to an error in the
      xfs_swap_extents_check_format() tests, because it is checking
      the in-memory btree root size against the on-disk fork offset.
      We should be checking the on-disk sizes in both cases.
      
      This patch adds a new macro to calculate this size, and uses
      it in the tests.
      
      With this change, the filesystem image provided by Gabriel
      allows for proper file degragmentation.
      Reported-by: default avatarGabriel VLASIU <gabriel@vlasiu.net>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      427d9fe2
  6. 19 Jun, 2013 4 commits
  7. 17 Jun, 2013 4 commits
  8. 14 Jun, 2013 1 commit
    • Dave Chinner's avatar
      xfs: don't shutdown log recovery on validation errors · 9222a9cf
      Dave Chinner authored
      Unfortunately, we cannot guarantee that items logged multiple times
      and replayed by log recovery do not take objects back in time. When
      they are taken back in time, the go into an intermediate state which
      is corrupt, and hence verification that occurs on this intermediate
      state causes log recovery to abort with a corruption shutdown.
      
      Instead of causing a shutdown and unmountable filesystem, don't
      verify post-recovery items before they are written to disk. This is
      less than optimal, but there is no way to detect this issue for
      non-CRC filesystems If log recovery successfully completes, this
      will be undone and the object will be consistent by subsequent
      transactions that are replayed, so in most cases we don't need to
      take drastic action.
      
      For CRC enabled filesystems, leave the verifiers in place - we need
      to call them to recalculate the CRCs on the objects anyway. This
      recovery problem can be solved for such filesystems - we have a LSN
      stamped in all metadata at writeback time that we can to determine
      whether the item should be replayed or not. This is a separate piece
      of work, so is not addressed by this patch.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      9222a9cf
  9. 13 Jun, 2013 2 commits
    • Dave Chinner's avatar
      xfs: ensure btree root split sets blkno correctly · ade1335a
      Dave Chinner authored
      For CRC enabled filesystems, the BMBT is rooted in an inode, so it
      passes through a different code path on root splits than the
      freespace and inode btrees. This is much less traversed by xfstests
      than the other trees. When testing on a 1k block size filesystem,
      I've been seeing ASSERT failures in generic/234 like:
      
      XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317
      
      which are generally preceded by a lblock check failure. I noticed
      this in the bmbt stats:
      
      $ pminfo -f xfs.btree.block_map
      
      xfs.btree.block_map.lookup
          value 39135
      
      xfs.btree.block_map.compare
          value 268432
      
      xfs.btree.block_map.insrec
          value 15786
      
      xfs.btree.block_map.delrec
          value 13884
      
      xfs.btree.block_map.newroot
          value 2
      
      xfs.btree.block_map.killroot
          value 0
      .....
      
      Very little coverage of root splits and merges. Indeed, on a 4k
      filesystem, block_map.newroot and block_map.killroot are both zero.
      i.e. the code is not exercised at all, and it's the only generic
      btree infrastructure operation that is not exercised by a default run
      of xfstests.
      
      Turns out that on a 1k filesystem, generic/234 accounts for one of
      those two root splits, and that is somewhat of a smoking gun. In
      fact, it's the same problem we saw in the directory/attr code where
      headers are memcpy()d from one block to another without updating the
      self describing metadata.
      
      Simple fix - when copying the header out of the root block, make
      sure the block number is updated correctly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      ade1335a
    • Dave Chinner's avatar
      xfs: fix implicit padding in directory and attr CRC formats · 8a1fd295
      Dave Chinner authored
      Michael L. Semon has been testing CRC patches on a 32 bit system and
      been seeing assert failures in the directory code from xfs/080.
      Thanks to Michael's heroic efforts with printk debugging, we found
      that the problem was that the last free space being left in the
      directory structure was too small to fit a unused tag structure and
      it was being corrupted and attempting to log a region out of bounds.
      Hence the assert failure looked something like:
      
      .....
      #5 calling xfs_dir2_data_log_unused() 36 32
      #1 4092 4095 4096
      #2 8182 8183 4096
      XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
      
      Where #1 showed the first region of the dup being logged (i.e. the
      last 4 bytes of a directory buffer) and #2 shows the corrupt values
      being calculated from the length of the dup entry which overflowed
      the size of the buffer.
      
      It turns out that the problem was not in the logging code, nor in
      the freespace handling code. It is an initial condition bug that
      only shows up on 32 bit systems. When a new buffer is initialised,
      where's the freespace that is set up:
      
      [  172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
      [  172.316346] #9 calling xfs_dir2_data_log_unused()
      [  172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
      [  172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096
      
      Note the offset of the first region being logged? It's 60 bytes into
      the buffer. Once I saw that, I pretty much knew that the bug was
      going to be caused by this.
      
      Essentially, all direct entries are rounded to 8 bytes in length,
      and all entries start with an 8 byte alignment. This means that we
      can decode inplace as variables are naturally aligned. With the
      directory data supposedly starting on a 8 byte boundary, and all
      entries padded to 8 bytes, the minimum freespace in a directory
      block is supposed to be 8 bytes, which is large enough to fit a
      unused data entry structure (6 bytes in size). The fact we only have
      4 bytes of free space indicates a directory data block alignment
      problem.
      
      And what do you know - there's an implicit hole in the directory
      data block header for the CRC format, which means the header is 60
      byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
      padding. And while looking at the structures, I found the same
      problem in the attr leaf header. Fix them both.
      
      Note that this only affects 32 bit systems with CRCs enabled.
      Everything else is just fine. Note that CRC enabled filesystems created
      before this fix on such systems will not be readable with this fix
      applied.
      Reported-by: default avatarMichael L. Semon <mlsemon35@gmail.com>
      Debugged-by: default avatarMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      8a1fd295
  10. 05 Jun, 2013 4 commits
  11. 04 Jun, 2013 2 commits
    • Dave Chinner's avatar
      xfs: fix remote attribute invalidation for a leaf · 59913f14
      Dave Chinner authored
      When invalidating an attribute leaf block block, there might be
      remote attributes that it points to. With the recent rework of the
      remote attribute format, we have to make sure we calculate the
      length of the attribute correctly. We aren't doing that in
      xfs_attr3_leaf_inactive(), so fix it.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinuguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      59913f14
    • Dave Chinner's avatar
      xfs: rework dquot CRCs · 6fcdc59d
      Dave Chinner authored
      Calculating dquot CRCs when the backing buffer is written back just
      doesn't work reliably. There are several places which manipulate
      dquots directly in the buffers, and they don't calculate CRCs
      appropriately, nor do they always set the buffer up to calculate
      CRCs appropriately.
      
      Firstly, if we log a dquot buffer (e.g. during allocation) it gets
      logged without valid CRC, and so on recovery we end up with a dquot
      that is not valid.
      
      Secondly, if we recover/repair a dquot, we don't have a verifier
      attached to the buffer and hence CRCs are not calculated on the way
      down to disk.
      
      Thirdly, calculating the CRC after we've changed the contents means
      that if we re-read the dquot from the buffer, we cannot verify the
      contents of the dquot are valid, as the CRC is invalid.
      
      So, to avoid all the dquot CRC errors that are being detected by the
      read verifier, change to using the same model as for inodes. That
      is, dquot CRCs are calculated and written to the backing buffer at
      the time the dquot is flushed to the backing buffer. If we modify
      the dquot directly in the backing buffer, calculate the CRC
      immediately after the modification is complete. Hence the dquot in
      the on-disk buffer should always have a valid CRC.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarBen Myers <bpm@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      6fcdc59d