1. 16 Nov, 2012 7 commits
  2. 14 Nov, 2012 5 commits
  3. 13 Nov, 2012 7 commits
    • Dave Chinner's avatar
      xfs: make growfs initialise the AGFL header · de497688
      Dave Chinner authored
      For verification purposes, AGFLs need to be initialised to a known
      set of values. For upcoming CRC changes, they are also headers that
      need to be initialised. Currently, growfs does neither for the AGFLs
      - it ignores them completely. Add initialisation of the AGFL to be
      full of invalid block numbers (NULLAGBLOCK) to put the
      infrastructure in place needed for CRC support.
      
      Includes a comment clarification from Jeff Liu.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by Rich Johnston <rjohnston@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      de497688
    • Dave Chinner's avatar
      xfs: growfs: use uncached buffers for new headers · fd23683c
      Dave Chinner authored
      When writing the new AG headers to disk, we can't attach write
      verifiers because they have a dependency on the struct xfs-perag
      being attached to the buffer to be fully initialised and growfs
      can't fully initialise them until later in the process.
      
      The simplest way to avoid this problem is to use uncached buffers
      for writing the new headers. These buffers don't have the xfs-perag
      attached to them, so it's simple to detect in the write verifier and
      be able to skip the checks that need the xfs-perag.
      
      This enables us to attach the appropriate buffer ops to the buffer
      and hence calculate CRCs on the way to disk. IT also means that the
      buffer is torn down immediately, and so the first access to the AG
      headers will re-read the header from disk and perform full
      verification of the buffer. This way we also can catch corruptions
      due to problems that went undetected in growfs.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by Rich Johnston <rjohnston@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      fd23683c
    • Dave Chinner's avatar
      xfs: use btree block initialisation functions in growfs · b64f3a39
      Dave Chinner authored
      Factor xfs_btree_init_block() to be independent of the btree cursor,
      and use the function to initialise btree blocks in the growfs code.
      This makes adding support for different format btree blocks simple.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by Rich Johnston <rjohnston@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      b64f3a39
    • Dave Chinner's avatar
      xfs: add more attribute tree trace points. · ee73259b
      Dave Chinner authored
      Added when debugging recent attribute tree problems to more finely
      trace code execution through the maze of twisty passages that makes
      up the attr code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      ee73259b
    • Dave Chinner's avatar
      xfs: drop buffer io reference when a bad bio is built · 37eb17e6
      Dave Chinner authored
      Error handling in xfs_buf_ioapply_map() does not handle IO reference
      counts correctly. We increment the b_io_remaining count before
      building the bio, but then fail to decrement it in the failure case.
      This leads to the buffer never running IO completion and releasing
      the reference that the IO holds, so at unmount we can leak the
      buffer. This leak is captured by this assert failure during unmount:
      
      XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 273
      
      This is not a new bug - the b_io_remaining accounting has had this
      problem for a long, long time - it's just very hard to get a
      zero length bio being built by this code...
      
      Further, the buffer IO error can be overwritten on a multi-segment
      buffer by subsequent bio completions for partial sections of the
      buffer. Hence we should only set the buffer error status if the
      buffer is not already carrying an error status. This ensures that a
      partial IO error on a multi-segment buffer will not be lost. This
      part of the problem is a regression, however.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      37eb17e6
    • Dave Chinner's avatar
      xfs: fix broken error handling in xfs_vm_writepage · 7bf7f352
      Dave Chinner authored
      When we shut down the filesystem, it might first be detected in
      writeback when we are allocating a inode size transaction. This
      happens after we have moved all the pages into the writeback state
      and unlocked them. Unfortunately, if we fail to set up the
      transaction we then abort writeback and try to invalidate the
      current page. This then triggers are BUG() in block_invalidatepage()
      because we are trying to invalidate an unlocked page.
      
      Fixing this is a bit of a chicken and egg problem - we can't
      allocate the transaction until we've clustered all the pages into
      the IO and we know the size of it (i.e. whether the last block of
      the IO is beyond the current EOF or not). However, we don't want to
      hold pages locked for long periods of time, especially while we lock
      other pages to cluster them into the write.
      
      To fix this, we need to make a clear delineation in writeback where
      errors can only be handled by IO completion processing. That is,
      once we have marked a page for writeback and unlocked it, we have to
      report errors via IO completion because we've already started the
      IO. We may not have submitted any IO, but we've changed the page
      state to indicate that it is under IO so we must now use the IO
      completion path to report errors.
      
      To do this, add an error field to xfs_submit_ioend() to pass it the
      error that occurred during the building on the ioend chain. When
      this is non-zero, mark each ioend with the error and call
      xfs_finish_ioend() directly rather than building bios. This will
      immediately push the ioends through completion processing with the
      error that has occurred.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      7bf7f352
    • Dave Chinner's avatar
      xfs: fix attr tree double split corruption · 07428d7f
      Dave Chinner authored
      In certain circumstances, a double split of an attribute tree is
      needed to insert or replace an attribute. In rare situations, this
      can go wrong, leaving the attribute tree corrupted. In this case,
      the attr being replaced is the last attr in a leaf node, and the
      replacement is larger so doesn't fit in the same leaf node.
      When we have the initial condition of a node format attribute
      btree with two leaves at index 1 and 2. Call them L1 and L2.  The
      leaf L1 is completely full, there is not a single byte of free space
      in it. L2 is mostly empty.  The attribute being replaced - call it X
      - is the last attribute in L1.
      
      The way an attribute replace is executed is that the replacement
      attribute - call it Y - is first inserted into the tree, but has an
      INCOMPLETE flag set on it so that list traversals ignore it. Once
      this transaction is committed, a second transaction it run to
      atomically mark Y as COMPLETE and X as INCOMPLETE, so that a
      traversal will now find Y and skip X. Once that transaction is
      committed, attribute X is then removed.
      
      So, the initial condition is:
      
           +--------+     +--------+
           |   L1   |     |   L2   |
           | fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |
           | fsp: 0 |     | fsp: N |
           |--------|     |--------|
           | attr A |     | attr 1 |
           |--------|     |--------|
           | attr B |     | attr 2 |
           |--------|     |--------|
           ..........     ..........
           |--------|     |--------|
           | attr X |     | attr n |
           +--------+     +--------+
      
      
      So now we go to replace X, and see that L1:fsp = 0 - it is full so
      we can't insert Y in the same leaf. So we record the the location of
      attribute X so we can track it for later use, then we split L1 into
      L1 and L3 and reblance across the two leafs. We end with:
      
      
           +--------+     +--------+     +--------+
           |   L1   |     |   L3   |     |   L2   |
           | fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |<----| bwd: 3 |
           | fsp: M |     | fsp: J |     | fsp: N |
           |--------|     |--------|     |--------|
           | attr A |     | attr X |     | attr 1 |
           |--------|     +--------+     |--------|
           | attr B |                    | attr 2 |
           |--------|                    |--------|
           ..........                    ..........
           |--------|                    |--------|
           | attr W |                    | attr n |
           +--------+                    +--------+
      
      
      And we track that the original attribute is now at L3:0.
      
      We then try to insert Y into L1 again, and find that there isn't
      enough room because the new attribute is larger than the old one.
      Hence we have to split again to make room for Y. We end up with
      this:
      
      
           +--------+     +--------+     +--------+     +--------+
           |   L1   |     |   L4   |     |   L3   |     |   L2   |
           | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
           | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
           |--------|     |--------|     |--------|     |--------|
           | attr A |     | attr Y |     | attr X |     | attr 1 |
           |--------|     + INCOMP +     +--------+     |--------|
           | attr B |     +--------+                    | attr 2 |
           |--------|                                   |--------|
           ..........                                   ..........
           |--------|                                   |--------|
           | attr W |                                   | attr n |
           +--------+                                   +--------+
      
      And now we have the new (incomplete) attribute @ L4:0, and the
      original attribute at L3:0. At this point, the first transaction is
      committed, and we move to the flipping of the flags.
      
      This is where we are supposed to end up with this:
      
           +--------+     +--------+     +--------+     +--------+
           |   L1   |     |   L4   |     |   L3   |     |   L2   |
           | fwd: 4 |---->| fwd: 3 |---->| fwd: 2 |---->| fwd: 0 |
           | bwd: 0 |<----| bwd: 1 |<----| bwd: 4 |<----| bwd: 3 |
           | fsp: M |     | fsp: J |     | fsp: J |     | fsp: N |
           |--------|     |--------|     |--------|     |--------|
           | attr A |     | attr Y |     | attr X |     | attr 1 |
           |--------|     +--------+     + INCOMP +     |--------|
           | attr B |                    +--------+     | attr 2 |
           |--------|                                   |--------|
           ..........                                   ..........
           |--------|                                   |--------|
           | attr W |                                   | attr n |
           +--------+                                   +--------+
      
      But that doesn't happen properly - the attribute tracking indexes
      are not pointing to the right locations. What we end up with is both
      the old attribute to be removed pointing at L4:0 and the new
      attribute at L4:1.  On a debug kernel, this assert fails like so:
      
      XFS: Assertion failed: args->index2 < be16_to_cpu(leaf2->hdr.count), file: fs/xfs/xfs_attr_leaf.c, line: 2725
      
      because the new attribute location does not exist. On a production
      kernel, this goes unnoticed and the code proceeds ahead merrily and
      removes L4 because it thinks that is the block that is no longer
      needed. This leaves the hash index node pointing to entries
      L1, L4 and L2, but only blocks L1, L3 and L2 to exist. Further, the
      leaf level sibling list is L1 <-> L4 <-> L2, but L4 is now free
      space, and so everything is busted. This corruption is caused by the
      removal of the old attribute triggering a join - it joins everything
      correctly but then frees the wrong block.
      
      xfs_repair will report something like:
      
      bad sibling back pointer for block 4 in attribute fork for inode 131
      problem with attribute contents in inode 131
      would clear attr fork
      bad nblocks 8 for inode 131, would reset to 3
      bad anextents 4 for inode 131, would reset to 0
      
      The problem lies in the assignment of the old/new blocks for
      tracking purposes when the double leaf split occurs. The first split
      tries to place the new attribute inside the current leaf (i.e.
      "inleaf == true") and moves the old attribute (X) to the new block.
      This sets up the old block/index to L1:X, and newly allocated
      block to L3:0. It then moves attr X to the new block and tries to
      insert attr Y at the old index. That fails, so it splits again.
      
      With the second split, the rebalance ends up placing the new attr in
      the second new block - L4:0 - and this is where the code goes wrong.
      What is does is it sets both the new and old block index to the
      second new block. Hence it inserts attr Y at the right place (L4:0)
      but overwrites the current location of the attr to replace that is
      held in the new block index (currently L3:0). It over writes it with
      L4:1 - the index we later assert fail on.
      
      Hopefully this table will show this in a foramt that is a bit easier
      to understand:
      
      Split		old attr index		new attr index
      		vanilla	patched		vanilla	patched
      before 1st	L1:26	L1:26		N/A	N/A
      after 1st	L3:0	L3:0		L1:26	L1:26
      after 2nd	L4:0	L3:0		L4:1	L4:0
                      ^^^^			^^^^
      		wrong			wrong
      
      The fix is surprisingly simple, for all this analysis - just stop
      the rebalance on the out-of leaf case from overwriting the new attr
      index - it's already correct for the double split case.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      07428d7f
  4. 08 Nov, 2012 10 commits
  5. 07 Nov, 2012 5 commits
    • Eric Sandeen's avatar
      xfs: report projid32bit feature in geometry call · 69a58a43
      Eric Sandeen authored
      When xfs gained the projid32bit feature, it was never added to
      the FSGEOMETRY ioctl feature flags, so it's not queryable without
      this patch.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      69a58a43
    • Dave Chinner's avatar
      xfs: fix reading of wrapped log data · 009507b0
      Dave Chinner authored
      Commit 44396476 ("xfs: reset buffer pointers before freeing them") in
      3.0-rc1 introduced a regression when recovering log buffers that
      wrapped around the end of log. The second part of the log buffer at
      the start of the physical log was being read into the header buffer
      rather than the data buffer, and hence recovery was seeing garbage
      in the data buffer when it got to the region of the log buffer that
      was incorrectly read.
      
      Cc: <stable@vger.kernel.org> # 3.0.x, 3.2.x, 3.4.x 3.6.x
      Reported-by: default avatarTorsten Kaiser <just.for.lkml@googlemail.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      009507b0
    • Dave Chinner's avatar
      xfs: fix buffer shudown reference count mismatch · 137fff09
      Dave Chinner authored
      When we shut down the filesystem, we have to unpin and free all the
      buffers currently active in the CIL. To do this we unpin and remove
      them in one operation as a result of a failed iclogbuf write. For
      buffers, we do this removal via a simultated IO completion of after
      marking the buffer stale.
      
      At the time we do this, we have two references to the buffer - the
      active LRU reference and the buf log item.  The LRU reference is
      removed by marking the buffer stale, and the active CIL reference is
      by the xfs_buf_iodone() callback that is run by
      xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone
      callback).
      
      However, ioend processing requires one more reference - that of the
      IO that it is completing. We don't have this reference, so we free
      the buffer prematurely and use it after it is freed. For buffers
      marked with XBF_ASYNC, this leads to assert failures in
      xfs_buf_rele() on debug kernels because the b_hold count is zero.
      
      Fix this by making sure we take the necessary IO reference before
      starting IO completion processing on the stale buffer, and set the
      XBF_ASYNC flag to ensure that IO completion processing removes all
      the active references from the buffer to ensure it is fully torn
      down.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      137fff09
    • Dave Chinner's avatar
      xfs: don't vmap inode cluster buffers during free · b6aff29f
      Dave Chinner authored
      Inode buffers do not need to be mapped as inodes are read or written
      directly from/to the pages underlying the buffer. This fixes a
      regression introduced by commit 611c9946 ("xfs: make XBF_MAPPED the
      default behaviour").
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      b6aff29f
    • Dave Chinner's avatar
      xfs: invalidate allocbt blocks moved to the free list · 4c05f9ad
      Dave Chinner authored
      When we free a block from the alloc btree tree, we move it to the
      freelist held in the AGFL and mark it busy in the busy extent tree.
      This typically happens when we merge btree blocks.
      
      Once the transaction is committed and checkpointed, the block can
      remain on the free list for an indefinite amount of time.  Now, this
      isn't the end of the world at this point - if the free list is
      shortened, the buffer is invalidated in the transaction that moves
      it back to free space. If the buffer is allocated as metadata from
      the free list, then all the modifications getted logged, and we have
      no issues, either. And if it gets allocated as userdata direct from
      the freelist, it gets invalidated and so will never get written.
      
      However, during the time it sits on the free list, pressure on the
      log can cause the AIL to be pushed and the buffer that covers the
      block gets pushed for write. IOWs, we end up writing a freed
      metadata block to disk. Again, this isn't the end of the world
      because we know from the above we are only writing to free space.
      
      The problem, however, is for validation callbacks. If the block was
      on old btree root block, then the level of the block is going to be
      higher than the current tree root, and so will fail validation.
      There may be other inconsistencies in the block as well, and
      currently we don't care because the block is in free space. Shutting
      down the filesystem because a freed block doesn't pass write
      validation, OTOH, is rather unfriendly.
      
      So, make sure we always invalidate buffers as they move from the
      free space trees to the free list so that we guarantee they never
      get written to disk while on the free list.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarPhil White <pwhite@sgi.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      4c05f9ad
  6. 02 Nov, 2012 4 commits
  7. 18 Oct, 2012 2 commits
    • Dave Chinner's avatar
      xfs: move allocation stack switch up to xfs_bmapi_allocate · e04426b9
      Dave Chinner authored
      Switching stacks are xfs_alloc_vextent can cause deadlocks when we
      run out of worker threads on the allocation workqueue. This can
      occur because xfs_bmap_btalloc can make multiple calls to
      xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
      return with the AGF locked in the current allocation transaction.
      
      If we then need to make another allocation, and all the allocation
      worker contexts are exhausted because the are blocked waiting for
      the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
      work completed to release the AGF.  Hence allocation effectively
      deadlocks.
      
      To avoid this, move the stack switch one layer up to
      xfs_bmapi_allocate() so that all of the allocation attempts in a
      single switched stack transaction occur in a single worker context.
      This avoids the problem of an allocation being blocked waiting for
      a worker thread whilst holding the AGF.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      e04426b9
    • Dave Chinner's avatar
      xfs: introduce XFS_BMAPI_STACK_SWITCH · 2455881c
      Dave Chinner authored
      Certain allocation paths through xfs_bmapi_write() are in situations
      where we have limited stack available. These are almost always in
      the buffered IO writeback path when convertion delayed allocation
      extents to real extents.
      
      The current stack switch occurs for userdata allocations, which
      means we also do stack switches for preallocation, direct IO and
      unwritten extent conversion, even those these call chains have never
      been implicated in a stack overrun.
      
      Hence, let's target just the single stack overun offended for stack
      switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
      the caller can pass xfs_bmapi_write() to indicate it should switch
      stacks if it needs to do allocation.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
      Signed-off-by: default avatarBen Myers <bpm@sgi.com>
      2455881c