1. 12 Feb, 2023 33 commits
  2. 10 Feb, 2023 7 commits
    • Dave Chinner's avatar
      xfs: don't assert fail on transaction cancel with deferred ops · 55d5c3a3
      Dave Chinner authored
      We can error out of an allocation transaction when updating BMBT
      blocks when things go wrong. This can be a btree corruption, and
      unexpected ENOSPC, etc. In these cases, we already have deferred ops
      queued for the first allocation that has been done, and we just want
      to cancel out the transaction and shut down the filesystem on error.
      
      In fact, we do just that for production systems - the assert that we
      can't have a transaction with defer ops attached unless we are
      already shut down is bogus and gets in the way of debugging
      whatever issue is actually causing the transaction to be cancelled.
      
      Remove the assert because it is causing spurious test failures to
      hang test machines.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      55d5c3a3
    • Dave Chinner's avatar
      xfs: t_firstblock is tracking AGs not blocks · 692b6cdd
      Dave Chinner authored
      The tp->t_firstblock field is now raelly tracking the highest AG we
      have locked, not the block number of the highest allocation we've
      made. It's purpose is to prevent AGF locking deadlocks, so rename it
      to "highest AG" and simplify the implementation to just track the
      agno rather than a fsbno.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      692b6cdd
    • Dave Chinner's avatar
      xfs: drop firstblock constraints from allocation setup · 36b6ad2d
      Dave Chinner authored
      Now that xfs_alloc_vextent() does all the AGF deadlock prevention
      filtering for multiple allocations in a single transaction, we no
      longer need the allocation setup code to care about what AGs we
      might already have locked.
      
      Hence we can remove all the "nullfb" conditional logic in places
      like xfs_bmap_btalloc() and instead have them focus simply on
      setting up locality constraints. If the allocation fails due to
      AGF lock filtering in xfs_alloc_vextent, then we just fall back as
      we normally do to more relaxed allocation constraints.
      
      As a result, any allocation that allows AG scanning (i.e. not
      confined to a single AG) and does not force a worst case full
      filesystem scan will now be able to attempt allocation from AGs
      lower than that defined by tp->t_firstblock. This is because
      xfs_alloc_vextent() allows try-locking of the AGFs and hence enables
      low space algorithms to at least -try- to get space from AGs lower
      than the one that we have currently locked and allocated from. This
      is a significant improvement in the low space allocation algorithm.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      36b6ad2d
    • Dave Chinner's avatar
      xfs: block reservation too large for minleft allocation · d5753847
      Dave Chinner authored
      When we enter xfs_bmbt_alloc_block() without having first allocated
      a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
      are doing something like unwritten extent conversion, the transaction
      block reservation is used as the minleft value.
      
      This works for operations like unwritten extent conversion, but it
      assumes that the block reservation is only for a BMBT split. THis is
      not always true, and sometimes results in larger than necessary
      minleft values being set. We only actually need enough space for a
      btree split, something we already handle correctly in
      xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.
      
      We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
      calculate the number of blocks a BMBT split on this inode is going to
      require, not use the transaction block reservation that contains the
      maximum number of blocks this transaction may consume in it...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d5753847
    • Dave Chinner's avatar
      xfs: prefer free inodes at ENOSPC over chunk allocation · f08f984c
      Dave Chinner authored
      When an XFS filesystem has free inodes in chunks already allocated
      on disk, it will still allocate new inode chunks if the target AG
      has no free inodes in it. Normally, this is a good idea as it
      preserves locality of all the inodes in a given directory.
      
      However, at ENOSPC this can lead to using the last few remaining
      free filesystem blocks to allocate a new chunk when there are many,
      many free inodes that could be allocated without consuming free
      space. This results in speeding up the consumption of the last few
      blocks and inode create operations then returning ENOSPC when there
      free inodes available because we don't have enough block left in the
      filesystem for directory creation reservations to proceed.
      
      Hence when we are near ENOSPC, we should be attempting to preserve
      the remaining blocks for directory block allocation rather than
      using them for unnecessary inode chunk creation.
      
      This particular behaviour is exposed by xfs/294, when it drives to
      ENOSPC on empty file creation whilst there are still thousands of
      free inodes available for allocation in other AGs in the filesystem.
      
      Hence, when we are within 1% of ENOSPC, change the inode allocation
      behaviour to prefer to use existing free inodes over allocating new
      inode chunks, even though it results is poorer locality of the data
      set. It is more important for the allocations to be space efficient
      near ENOSPC than to have optimal locality for performance, so lets
      modify the inode AG selection code to reflect that fact.
      
      This allows generic/294 to not only pass with this allocator rework
      patchset, but to increase the number of post-ENOSPC empty inode
      allocations to from ~600 to ~9080 before we hit ENOSPC on the
      directory create transaction reservation.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f08f984c
    • Dave Chinner's avatar
      xfs: fix low space alloc deadlock · 1dd0510f
      Dave Chinner authored
      I've recently encountered an ABBA deadlock with g/476. The upcoming
      changes seem to make this much easier to hit, but the underlying
      problem is a pre-existing one.
      
      Essentially, if we select an AG for allocation, then lock the AGF
      and then fail to allocate for some reason (e.g. minimum length
      requirements cannot be satisfied), then we drop out of the
      allocation with the AGF still locked.
      
      The caller then modifies the allocation constraints - usually
      loosening them up - and tries again. This can result in trying to
      access AGFs that are lower than the AGF we already have locked from
      the failed attempt. e.g. the failed attempt skipped several AGs
      before failing, so we have locks an AG higher than the start AG.
      Retrying the allocation from the start AG then causes us to violate
      AGF lock ordering and this can lead to deadlocks.
      
      The deadlock exists even if allocation succeeds - we can do a
      followup allocations in the same transaction for BMBT blocks that
      aren't guaranteed to be in the same AG as the original, and can move
      into higher AGs. Hence we really need to move the tp->t_firstblock
      tracking down into xfs_alloc_vextent() where it can be set when we
      exit with a locked AG.
      
      xfs_alloc_vextent() can also check there if the requested
      allocation falls within the allow range of AGs set by
      tp->t_firstblock. If we can't allocate within the range set, we have
      to fail the allocation. If we are allowed to to non-blocking AGF
      locking, we can ignore the AG locking order limitations as we can
      use try-locks for the first iteration over requested AG range.
      
      This invalidates a set of post allocation asserts that check that
      the allocation is always above tp->t_firstblock if it is set.
      Because we can use try-locks to avoid the deadlock in some
      circumstances, having a pre-existing locked AGF doesn't always
      prevent allocation from lower order AGFs. Hence those ASSERTs need
      to be removed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1dd0510f
    • Darrick J. Wong's avatar
      xfs: revert commit 8954c44f · dd07bb8b
      Darrick J. Wong authored
      The name passed into __xfs_xattr_put_listent is exactly namelen bytes
      long and not null-terminated.  Passing namelen+1 to the strscpy function
      
          strscpy(offset, (char *)name, namelen + 1);
      
      is therefore wrong.  Go back to the old code, which works fine because
      strncpy won't find a null in @name and stops after namelen bytes.  It
      really could be a memcpy call, but it worked for years.
      
      Reported-by: syzbot+898115bc6d7140437215@syzkaller.appspotmail.com
      Fixes: 8954c44f ("xfs: use strscpy() to instead of strncpy()")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      dd07bb8b