1. 14 Jul, 2022 19 commits
    • Darrick J. Wong's avatar
      xfs: fix use-after-free in xattr node block inactivation · 95ff0363
      Darrick J. Wong authored
      The kernel build robot reported a UAF error while running xfs/433
      (edited somewhat for brevity):
      
       BUG: KASAN: use-after-free in xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:214) xfs
       Read of size 4 at addr ffff88820ac2bd44 by task kworker/0:2/139
      
       CPU: 0 PID: 139 Comm: kworker/0:2 Tainted: G S                5.19.0-rc2-00004-g7cf2b0f9 #1
       Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013
       Workqueue: xfs-inodegc/sdb4 xfs_inodegc_worker [xfs]
       Call Trace:
        <TASK>
       dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
       print_address_description+0x1f/0x200
       print_report.cold (mm/kasan/report.c:430)
       kasan_report (mm/kasan/report.c:162 mm/kasan/report.c:493)
       xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:214) xfs
       xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:296) xfs
       xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs
       xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs
       xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs
       process_one_work
       worker_thread
       kthread
       ret_from_fork
        </TASK>
      
       Allocated by task 139:
       kasan_save_stack (mm/kasan/common.c:39)
       __kasan_slab_alloc (mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:469)
       kmem_cache_alloc (mm/slab.h:750 mm/slub.c:3214 mm/slub.c:3222 mm/slub.c:3229 mm/slub.c:3239)
       _xfs_buf_alloc (include/linux/instrumented.h:86 include/linux/atomic/atomic-instrumented.h:41 fs/xfs/xfs_buf.c:232) xfs
       xfs_buf_get_map (fs/xfs/xfs_buf.c:660) xfs
       xfs_buf_read_map (fs/xfs/xfs_buf.c:777) xfs
       xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:289) xfs
       xfs_da_read_buf (fs/xfs/libxfs/xfs_da_btree.c:2652) xfs
       xfs_da3_node_read (fs/xfs/libxfs/xfs_da_btree.c:392) xfs
       xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:272) xfs
       xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs
       xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs
       xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs
       process_one_work
       worker_thread
       kthread
       ret_from_fork
      
       Freed by task 139:
       kasan_save_stack (mm/kasan/common.c:39)
       kasan_set_track (mm/kasan/common.c:45)
       kasan_set_free_info (mm/kasan/generic.c:372)
       __kasan_slab_free (mm/kasan/common.c:368 mm/kasan/common.c:328 mm/kasan/common.c:374)
       kmem_cache_free (mm/slub.c:1753 mm/slub.c:3507 mm/slub.c:3524)
       xfs_buf_rele (fs/xfs/xfs_buf.c:1040) xfs
       xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:210) xfs
       xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:296) xfs
       xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs
       xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs
       xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs
       process_one_work
       worker_thread
       kthread
       ret_from_fork
      
      I reproduced this for my own satisfaction, and got the same report,
      along with an extra morsel:
      
       The buggy address belongs to the object at ffff88802103a800
        which belongs to the cache xfs_buf of size 432
       The buggy address is located 396 bytes inside of
        432-byte region [ffff88802103a800, ffff88802103a9b0)
      
      I tracked this code down to:
      
      	error = xfs_trans_get_buf(*trans, mp->m_ddev_targp,
      			child_blkno,
      			XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0,
      			&child_bp);
      	if (error)
      		return error;
      	error = bp->b_error;
      
      That doesn't look right -- I think this should be dereferencing
      child_bp, not bp.  Looking through the codebase history, I think this
      was added by commit 2911edb6 ("xfs: remove the mappedbno argument to
      xfs_da_get_buf"), which replaced a call to xfs_da_get_buf with the
      current call to xfs_trans_get_buf.  Not sure why we trans_brelse'd @bp
      earlier in the function, but I'm guessing it's to avoid pinning too many
      buffers in memory while we inactivate the bottom of the attr tree.
      Hence we now have to get the buffer back.
      
      I /think/ this was supposed to check child_bp->b_error and fail the rest
      of the invalidation if child_bp had experienced any kind of IO or
      corruption error.  I bet the xfs_da3_node_read earlier in the loop will
      catch most cases of incoming on-disk corruption which makes this check
      mostly moot unless someone corrupts the buffer and the AIL pushes it out
      to disk while the buffer's unlocked.
      
      In the first case we'll never get to the bad check, and in the second
      case the AIL will shut down the log, at which point there's no reason to
      check b_error.  Remove the check, and null out @bp to avoid this problem
      in the future.
      
      Cc: hch@lst.de
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Fixes: 2911edb6 ("xfs: remove the mappedbno argument to xfs_da_get_buf")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      95ff0363
    • Darrick J. Wong's avatar
      Merge tag 'make-attr-fork-permanent-5.20_2022-07-14' of... · 6d200bdc
      Darrick J. Wong authored
      Merge tag 'make-attr-fork-permanent-5.20_2022-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.20-mergeB
      
      xfs: make attr forks permanent
      
      This series fixes a use-after-free bug that syzbot uncovered.  The UAF
      itself is a result of a race condition between getxattr and removexattr
      because callers to getxattr do not necessarily take any sort of locks
      before calling into the filesystem.
      
      Although the race condition itself can be fixed through clever use of a
      memory barrier, further consideration of the use cases of extended
      attributes shows that most files always have at least one attribute, so
      we might as well make them permanent.
      
      v2: Minor tweaks suggested by Dave, and convert some more macros to
      helper functions.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'make-attr-fork-permanent-5.20_2022-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
        xfs: replace inode fork size macros with functions
        xfs: replace XFS_IFORK_Q with a proper predicate function
        xfs: use XFS_IFORK_Q to determine the presence of an xattr fork
        xfs: make inode attribute forks a permanent part of struct xfs_inode
        xfs: convert XFS_IFORK_PTR to a static inline helper
      6d200bdc
    • Darrick J. Wong's avatar
      Merge tag 'xfs-buf-lockless-lookup-5.20' of... · 35c5a09f
      Darrick J. Wong authored
      Merge tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB
      
      xfs: lockless buffer cache lookups
      
      Current work to merge the XFS inode life cycle with the VFS inode
      life cycle is finding some interesting issues. If we have a path
      that hits buffer trylocks fairly hard (e.g. a non-blocking
      background inode freeing function), we end up hitting massive
      contention on the buffer cache hash locks:
      
      -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
         - 92.67% xfs_inodegc_worker
            - 92.13% xfs_inode_unlink
               - 91.52% xfs_inactive_ifree
                  - 85.63% xfs_read_agi
                     - 85.61% xfs_trans_read_buf_map
                        - 85.59% xfs_buf_read_map
                           - xfs_buf_get_map
                              - 85.55% xfs_buf_find
                                 - 72.87% _raw_spin_lock
                                    - do_raw_spin_lock
                                         71.86% __pv_queued_spin_lock_slowpath
                                 - 8.74% xfs_buf_rele
                                    - 7.88% _raw_spin_lock
                                       - 7.88% do_raw_spin_lock
                                            7.63% __pv_queued_spin_lock_slowpath
                                 - 1.70% xfs_buf_trylock
                                    - 1.68% down_trylock
                                       - 1.41% _raw_spin_lock_irqsave
                                          - 1.39% do_raw_spin_lock
                                               __pv_queued_spin_lock_slowpath
                                 - 0.76% _raw_spin_unlock
                                      0.75% do_raw_spin_unlock
      
      This is basically hammering the pag->pag_buf_lock from lots of CPUs
      doing trylocks at the same time. Most of the buffer trylock
      operations ultimately fail after we've done the lookup, so we're
      really hammering the buf hash lock whilst making no progress.
      
      We can also see significant spinlock traffic on the same lock just
      under normal operation when lots of tasks are accessing metadata
      from the same AG, so let's avoid all this by creating a lookup fast
      path which leverages the rhashtable's ability to do RCU protected
      lookups.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: lockless buffer lookup
        xfs: remove a superflous hash lookup when inserting new buffers
        xfs: reduce the number of atomic when locking a buffer after lookup
        xfs: merge xfs_buf_find() and xfs_buf_get_map()
        xfs: break up xfs_buf_find() into individual pieces
        xfs: rework xfs_buf_incore() API
      35c5a09f
    • Darrick J. Wong's avatar
      Merge tag 'xfs-iunlink-item-5.20' of... · 4613b17c
      Darrick J. Wong authored
      Merge tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB
      
      xfs: introduce in-memory inode unlink log items
      
      To facilitate future improvements in inode logging and improving
      inode cluster buffer locking order consistency, we need a new
      mechanism for defering inode cluster buffer modifications during
      unlinked list modifications.
      
      The unlinked inode list buffer locking is complex. The unlinked
      list is unordered - we add to the tail, remove from where-ever the
      inode is in the list. Hence we might need to lock two inode buffers
      here (previous inode in list and the one being removed). While we
      can order the locking of these buffers correctly within the confines
      of the unlinked list, there may be other inodes that need buffer
      locking in the same transaction. e.g. O_TMPFILE being linked into a
      directory also modifies the directory inode.
      
      Hence we need a mechanism for defering unlinked inode list updates
      until a point where we know that all modifications have been made
      and all that remains is to lock and modify the cluster buffers.
      
      We can do this by first observing that we serialise unlinked list
      modifications by holding the AGI buffer lock. IOWs, the AGI is going
      to be locked until the transaction commits any time we modify the
      unlinked list. Hence it doesn't matter when in the unlink
      transactions that we actually load, lock and modify the inode
      cluster buffer.
      
      We add an in-memory unlinked inode log item to defer the inode
      cluster buffer update to transaction commit time where it can be
      ordered with all the other inode cluster operations that need to be
      done. Essentially all we need to do is record the inodes that need
      to have their unlinked list pointer updated in a new log item that
      we attached to the transaction.
      
      This log item exists purely for the purpose of delaying the update
      of the unlinked list pointer until the inode cluster buffer can be
      locked in the correct order around the other inode cluster buffers.
      It plays no part in the actual commit, and there's no change to
      anything that is written to the log. i.e. the inode cluster buffers
      still have to be fully logged here (not just ordered) as log
      recovery depedends on this to replay mods to the unlinked inode
      list.
      
      Hence if we add a "precommit" hook into xfs_trans_commit()
      to run a "precommit" operation on these iunlink log items, we can
      delay the locking, modification and logging of the inode cluster
      buffer until after all other modifications have been made. The
      precommit hook reuires us to sort the items that are going to be run
      so that we can lock precommit items in the correct order as we
      perform the modifications they describe.
      
      To make this unlinked inode list processing simpler and easier to
      implement as a log item, we need to change the way we track the
      unlinked list in memory. Starting from the observation that an inode
      on the unlinked list is pinned in memory by the VFS, we can use the
      xfs_inode itself to track the unlinked list. To do this efficiently,
      we want the unlinked list to be a double linked list. The problem
      here is that we need a list per AGI unlinked list, and there are 64
      of these per AGI. The approach taken in this patchset is to shadow
      the AGI unlinked list heads in the perag, and link inodes by agino,
      hence requiring only 8 extra bytes per inode to track this state.
      
      We can then use the agino pointers for lockless inode cache lookups
      to retreive the inode. The aginos in the inode are modified only
      under the AGI lock, just like the cluster buffer pointers, so we
      don't need any extra locking here.  The i_next_unlinked field tracks
      the on-disk value of the unlinked list, and the i_prev_unlinked is a
      purely in-memory pointer that enables us to efficiently remove
      inodes from the middle of the list.
      
      This results in moving a lot of the unlink modification work into
      the precommit operations on the unlink log item. Tracking all the
      unlinked inodes in the inodes themselves also gets rid of the
      unlinked list reference hash table that is used to track this back
      pointer relationship. This greatly simplifies the the unlinked list
      modification code, and removes memory allocations in this hot path
      to track back pointers. This, overall, slightly reduces the CPU
      overhead of the unlink path.
      
      The result of this log item means that we move all the actual
      manipulation of objects to be logged out of the iunlink path and
      into the iunlink item. This allows for future optimisation of this
      mechanism without needing changes to high level unlink path, as
      well as making the unlink lock ordering predictable and synchronised
      with other operations that may require inode cluster locking.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: add in-memory iunlink log item
        xfs: add log item precommit operation
        xfs: combine iunlink inode update functions
        xfs: clean up xfs_iunlink_update_inode()
        xfs: double link the unlinked inode list
        xfs: introduce xfs_iunlink_lookup
        xfs: refactor xlog_recover_process_iunlinks()
        xfs: track the iunlink list pointer in the xfs_inode
        xfs: factor the xfs_iunlink functions
        xfs: flush inode gc workqueue before clearing agi bucket
      4613b17c
    • Dave Chinner's avatar
      xfs: lockless buffer lookup · 298f3422
      Dave Chinner authored
      Now that we have a standalone fast path for buffer lookup, we can
      easily convert it to use rcu lookups. When we continually hammer the
      buffer cache with trylock lookups, we end up with a huge amount of
      lock contention on the per-ag buffer hash locks:
      
      -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
         - 92.67% xfs_inodegc_worker
            - 92.13% xfs_inode_unlink
               - 91.52% xfs_inactive_ifree
                  - 85.63% xfs_read_agi
                     - 85.61% xfs_trans_read_buf_map
                        - 85.59% xfs_buf_read_map
                           - xfs_buf_get_map
                              - 85.55% xfs_buf_find
                                 - 72.87% _raw_spin_lock
                                    - do_raw_spin_lock
                                         71.86% __pv_queued_spin_lock_slowpath
                                 - 8.74% xfs_buf_rele
                                    - 7.88% _raw_spin_lock
                                       - 7.88% do_raw_spin_lock
                                            7.63% __pv_queued_spin_lock_slowpath
                                 - 1.70% xfs_buf_trylock
                                    - 1.68% down_trylock
                                       - 1.41% _raw_spin_lock_irqsave
                                          - 1.39% do_raw_spin_lock
                                               __pv_queued_spin_lock_slowpath
                                 - 0.76% _raw_spin_unlock
                                      0.75% do_raw_spin_unlock
      
      This is basically hammering the pag->pag_buf_lock from lots of CPUs
      doing trylocks at the same time. Most of the buffer trylock
      operations ultimately fail after we've done the lookup, so we're
      really hammering the buf hash lock whilst making no progress.
      
      We can also see significant spinlock traffic on the same lock just
      under normal operation when lots of tasks are accessing metadata
      from the same AG, so let's avoid all this by converting the lookup
      fast path to leverages the rhashtable's ability to do rcu protected
      lookups.
      
      We avoid races with the buffer release path by using
      atomic_inc_not_zero() on the buffer hold count. Any buffer that is
      in the LRU will have a non-zero count, thereby allowing the lockless
      fast path to be taken in most cache hit situations. If the buffer
      hold count is zero, then it is likely going through the release path
      so in that case we fall back to the existing lookup miss slow path.
      
      The slow path will then do an atomic lookup and insert under the
      buffer hash lock and hence serialise correctly against buffer
      release freeing the buffer.
      
      The use of rcu protected lookups means that buffer handles now need
      to be freed by RCU callbacks (same as inodes). We still free the
      buffer pages before the RCU callback - we won't be trying to access
      them at all on a buffer that has zero references - but we need the
      buffer handle itself to be present for the entire rcu protected read
      side to detect a zero hold count correctly.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      298f3422
    • Dave Chinner's avatar
      xfs: remove a superflous hash lookup when inserting new buffers · 32dd4f9c
      Dave Chinner authored
      Currently on the slow path insert we repeat the initial hash table
      lookup before we attempt the insert, resulting in a two traversals
      of the hash table to ensure the insert is valid. The rhashtable API
      provides a method for an atomic lookup and insert operation, so we
      can avoid one of the hash table traversals by using this method.
      
      Adapted from a large patch containing this optimisation by Christoph
      Hellwig.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      32dd4f9c
    • Dave Chinner's avatar
      xfs: reduce the number of atomic when locking a buffer after lookup · d8d9bbb0
      Dave Chinner authored
      Avoid an extra atomic operation in the non-trylock case by only
      doing a trylock if the XBF_TRYLOCK flag is set. This follows the
      pattern in the IO path with NOWAIT semantics where the
      "trylock-fail-lock" path showed 5-10% reduced throughput compared to
      just using single lock call when not under NOWAIT conditions. So
      make that same change here, too.
      
      See commit 942491c9 ("xfs: fix AIM7 regression") for details.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      [hch: split from a larger patch]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d8d9bbb0
    • Dave Chinner's avatar
      xfs: merge xfs_buf_find() and xfs_buf_get_map() · 34800080
      Dave Chinner authored
      Now that we factored xfs_buf_find(), we can start separating into
      distinct fast and slow paths from xfs_buf_get_map(). We start by
      moving the lookup map and perag setup to _get_map(), and then move
      all the specifics of the fast path lookup into xfs_buf_lookup()
      and call it directly from _get_map(). We the move all the slow path
      code to xfs_buf_find_insert(), which is now also called directly
      from _get_map(). As such, xfs_buf_find() now goes away.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      34800080
    • Dave Chinner's avatar
      xfs: break up xfs_buf_find() into individual pieces · de67dc57
      Dave Chinner authored
      xfs_buf_find() is made up of three main parts: lookup, insert and
      locking. The interactions with xfs_buf_get_map() require it to be
      called twice - once for a pure lookup, and again on lookup failure
      so the insert path can be run. We want to simplify this down a lot,
      so split it into a fast path lookup, a slow path insert and a "lock
      the found buffer" helper. This will then let us integrate these
      operations more effectively into xfs_buf_get_map() in future
      patches.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      de67dc57
    • Dave Chinner's avatar
      xfs: add in-memory iunlink log item · 784eb7d8
      Dave Chinner authored
      Now that we have a clean operation to update the di_next_unlinked
      field of inode cluster buffers, we can easily defer this operation
      to transaction commit time so we can order the inode cluster buffer
      locking consistently.
      
      To do this, we introduce a new in-memory log item to track the
      unlinked list item modification that we are going to make. This
      follows the same observations as the in-memory double linked list
      used to track unlinked inodes in that the inodes on the list are
      pinned in memory and cannot go away, and hence we can simply
      reference them for the duration of the transaction without needing
      to take active references or pin them or look them up.
      
      This allows us to pass the xfs_inode to the transaction commit code
      along with the modification to be made, and then order the logged
      modifications via the ->iop_sort and ->iop_precommit operations
      for the new log item type. As this is an in-memory log item, it
      doesn't have formatting, CIL or AIL operational hooks - it exists
      purely to run the inode unlink modifications and is then removed
      from the transaction item list and freed once the precommit
      operation has run.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      784eb7d8
    • Dave Chinner's avatar
      xfs: add log item precommit operation · fad743d7
      Dave Chinner authored
      For inodes that are dirty, we have an attached cluster buffer that
      we want to use to track the dirty inode through the AIL.
      Unfortunately, locking the cluster buffer and adding it to the
      transaction when the inode is first logged in a transaction leads to
      buffer lock ordering inversions.
      
      The specific problem is ordering against the AGI buffer. When
      modifying unlinked lists, the buffer lock order is AGI -> inode
      cluster buffer as the AGI buffer lock serialises all access to the
      unlinked lists. Unfortunately, functionality like xfs_droplink()
      logs the inode before calling xfs_iunlink(), as do various directory
      manipulation functions. The inode can be logged way down in the
      stack as far as the bmapi routines and hence, without a major
      rewrite of lots of APIs there's no way we can avoid the inode being
      logged by something until after the AGI has been logged.
      
      As we are going to be using ordered buffers for inode AIL tracking,
      there isn't a need to actually lock that buffer against modification
      as all the modifications are captured by logging the inode item
      itself. Hence we don't actually need to join the cluster buffer into
      the transaction until just before it is committed. This means we do
      not perturb any of the existing buffer lock orders in transactions,
      and the inode cluster buffer is always locked last in a transaction
      that doesn't otherwise touch inode cluster buffers.
      
      We do this by introducing a precommit log item method.  This commit
      just introduces the mechanism; the inode item implementation is in
      followup commits.
      
      The precommit items need to be sorted into consistent order as we
      may be locking multiple items here. Hence if we have two dirty
      inodes in cluster buffers A and B, and some other transaction has
      two separate dirty inodes in the same cluster buffers, locking them
      in different orders opens us up to ABBA deadlocks. Hence we sort the
      items on the transaction based on the presence of a sort log item
      method.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      fad743d7
    • Dave Chinner's avatar
      xfs: combine iunlink inode update functions · 062efdb0
      Dave Chinner authored
      Combine the logging of the inode unlink list update into the
      calling function that looks up the buffer we end up logging. These
      do not need to be separate functions as they are both short, simple
      operations and there's only a single call path through them. This
      new function will end up being the core of the iunlink log item
      processing...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      062efdb0
    • Dave Chinner's avatar
      xfs: clean up xfs_iunlink_update_inode() · 5301f870
      Dave Chinner authored
      We no longer need to have this function return the previous next
      agino value from the on-disk inode as we have it in the in-core
      inode now.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5301f870
    • Dave Chinner's avatar
      xfs: double link the unlinked inode list · 2fd26cc0
      Dave Chinner authored
      Now we have forwards traversal via the incore inode in place, we now
      need to add back pointers to the incore inode to entirely replace
      the back reference cache. We use the same lookup semantics and
      constraints as for the forwards pointer lookups during unlinks, and
      so we can look up any inode in the unlinked list directly and update
      the list pointers, forwards or backwards, at any time.
      
      The only wrinkle in converting the unlinked list manipulations to
      use in-core previous pointers is that log recovery doesn't have the
      incore inode state built up so it can't just read in an inode and
      release it to finish off the unlink. Hence we need to modify the
      traversal in recovery to read one inode ahead before we
      release the inode at the head of the list. This populates the
      next->prev relationship sufficient to be able to replay the unlinked
      list and hence greatly simplify the runtime code.
      
      This recovery algorithm also requires that we actually remove inodes
      from the unlinked list one at a time as background inode
      inactivation will result in unlinked list removal racing with the
      building of the in-memory unlinked list state. We could serialise
      this by holding the AGI buffer lock when constructing the in memory
      state, but all that does is lockstep background processing with list
      building. It is much simpler to flush the inodegc immediately after
      releasing the inode so that it is unlinked immediately and there is
      no races present at all.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      2fd26cc0
    • Dave Chinner's avatar
      xfs: introduce xfs_iunlink_lookup · a83d5a8b
      Dave Chinner authored
      When an inode is on an unlinked list during normal operation, it is
      guaranteed to be pinned in memory as it is either referenced by the
      current unlink operation or it has a open file descriptor that
      references it and has it pinned in memory. Hence to look up an inode
      on the unlinked list, we can do a direct inode cache lookup and
      always expect the lookup to succeed.
      
      Add a function to do this lookup based on the agino that we use to
      link the chain of unlinked inodes together so we can begin the
      conversion the unlinked list manipulations to use in-memory inodes
      rather than inode cluster buffers and remove the backref cache.
      
      Use this lookup function to replace the on-disk inode buffer walk
      when removing inodes from the unlinked list with an in-core inode
      unlinked list walk.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a83d5a8b
    • Dave Chinner's avatar
      xfs: refactor xlog_recover_process_iunlinks() · 04755d2e
      Dave Chinner authored
      For upcoming changes to the way inode unlinked list processing is
      done, the structure of recovery needs to change slightly. We also
      really need to untangle the messy error handling in list recovery
      so that actions like emptying the bucket on inode lookup failure
      are associated with the bucket list walk failing, not failing
      to look up the inode.
      
      Refactor the recovery code now to keep the re-organisation seperate
      to the algorithm changes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      04755d2e
    • Dave Chinner's avatar
      xfs: track the iunlink list pointer in the xfs_inode · 4fcc94d6
      Dave Chinner authored
      Having direct access to the i_next_unlinked pointer in unlinked
      inodes greatly simplifies the processing of inodes on the unlinked
      list. We no longer need to look up the inode buffer just to find
      next inode in the list if the xfs_inode is in memory. These
      improvements will be realised over upcoming patches as other
      dependencies on the inode buffer for unlinked list processing are
      removed.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4fcc94d6
    • Dave Chinner's avatar
      xfs: factor the xfs_iunlink functions · a4454cd6
      Dave Chinner authored
      Prep work that separates the locking that protects the unlinked list
      from the actual operations being performed. This also helps document
      the fact they are performing list insert  and remove operations. No
      functional code change.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a4454cd6
    • Zhang Yi's avatar
      xfs: flush inode gc workqueue before clearing agi bucket · 04a98a03
      Zhang Yi authored
      In the procedure of recover AGI unlinked lists, if something bad
      happenes on one of the unlinked inode in the bucket list, we would call
      xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list,
      not the unlinked inodes after the bad one. If we have already added some
      inodes to the gc workqueue before the bad inode in the list, we could
      get below error when freeing those inodes, and finaly fail to complete
      the log recover procedure.
      
       XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file
       fs/xfs/xfs_inode.c.  Caller xfs_ifree+0xb0/0x360 [xfs]
      
      The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so
      the gc worker fail to check the agino in xfs_verify_agino(). Fix this by
      flush workqueue before clearing the bucket.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      04a98a03
  2. 12 Jul, 2022 2 commits
  3. 09 Jul, 2022 7 commits
    • Darrick J. Wong's avatar
      xfs: use XFS_IFORK_Q to determine the presence of an xattr fork · e45d7cb2
      Darrick J. Wong authored
      Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
      attribute fork but i_forkoff is zero.  This eliminates the ambiguity
      between i_forkoff and i_af.if_present, which should make it easier to
      understand the lifetime of attr forks.
      
      While we're at it, remove the if_present checks around calls to
      xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
      forks that have already been torn down.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e45d7cb2
    • Darrick J. Wong's avatar
      xfs: make inode attribute forks a permanent part of struct xfs_inode · 2ed5b09b
      Darrick J. Wong authored
      Syzkaller reported a UAF bug a while back:
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
      Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
      
      CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
      5.15.0-0.30.3-20220406_1406 #3
      Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
      04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
       print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
       xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
       xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
       xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
       __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
       cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
       security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
       dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
       dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
       do_truncate+0xc3/0x1e0 fs/open.c:56
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      RIP: 0033:0x7f7ef4bb753d
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
      89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
      01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
      RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
      RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
      RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
      R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
       </TASK>
      
      Allocated by task 2953:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:254 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3213 [inline]
       slab_alloc mm/slub.c:3221 [inline]
       kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
       kmem_cache_zalloc include/linux/slab.h:711 [inline]
       xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
       xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
       xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
       __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
       __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
       vfs_setxattr+0x154/0x33d fs/xattr.c:301
       setxattr+0x216/0x29f fs/xattr.c:575
       __do_sys_fsetxattr fs/xattr.c:632 [inline]
       __se_sys_fsetxattr fs/xattr.c:621 [inline]
       __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      Freed by task 2949:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:230 [inline]
       slab_free_hook mm/slub.c:1700 [inline]
       slab_free_freelist_hook mm/slub.c:1726 [inline]
       slab_free mm/slub.c:3492 [inline]
       kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
       xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
       xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
       xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
       xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
       xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_removexattr+0x106/0x16a fs/xattr.c:468
       cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
       security_inode_killpriv+0x54/0xa1 security/security.c:1414
       setattr_prepare+0x1a6/0x897 fs/attr.c:146
       xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
       xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
       xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
       notify_change+0xae5/0x10a1 fs/attr.c:410
       do_truncate+0x134/0x1e0 fs/open.c:64
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      The buggy address belongs to the object at ffff88802cec9188
       which belongs to the cache xfs_ifork of size 40
      The buggy address is located 20 bytes inside of
       40-byte region [ffff88802cec9188, ffff88802cec91b0)
      The buggy address belongs to the page:
      page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x2cec9
      flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
      raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
       ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
      >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                  ^
       ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
       ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
      ==================================================================
      
      The root cause of this bug is the unlocked access to xfs_inode.i_afp
      from the getxattr code paths while trying to determine which ILOCK mode
      to use to stabilize the xattr data.  Unfortunately, the VFS does not
      acquire i_rwsem when vfs_getxattr (or listxattr) call into the
      filesystem, which means that getxattr can race with a removexattr that's
      tearing down the attr fork and crash:
      
      xfs_attr_set:                          xfs_attr_get:
      xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:
      
      xfs_idestroy_fork(ip->i_afp);
      kmem_cache_free(xfs_ifork_cache, ip->i_afp);
      
                                             if (ip->i_afp &&
      
      ip->i_afp = NULL;
      
                                                 xfs_need_iread_extents(ip->i_afp))
                                             <KABOOM>
      
      ip->i_forkoff = 0;
      
      Regrettably, the VFS is much more lax about i_rwsem and getxattr than
      is immediately obvious -- not only does it not guarantee that we hold
      i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
      The getxattr system call won't acquire the lock before calling XFS, but
      the file capabilities code calls getxattr with and without i_rwsem held
      to determine if the "security.capabilities" xattr is set on the file.
      
      Fixing the VFS locking requires a treewide investigation into every code
      path that could touch an xattr and what i_rwsem state it expects or sets
      up.  That could take years or even prove impossible; fortunately, we
      can fix this UAF problem inside XFS.
      
      An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
      ensure that i_forkoff is always zeroed before i_afp is set to null and
      changed the read paths to use smp_rmb before accessing i_forkoff and
      i_afp, which avoided these UAF problems.  However, the patch author was
      too busy dealing with other problems in the meantime, and by the time he
      came back to this issue, the situation had changed a bit.
      
      On a modern system with selinux, each inode will always have at least
      one xattr for the selinux label, so it doesn't make much sense to keep
      incurring the extra pointer dereference.  Furthermore, Allison's
      upcoming parent pointer patchset will also cause nearly every inode in
      the filesystem to have extended attributes.  Therefore, make the inode
      attribute fork structure part of struct xfs_inode, at a cost of 40 more
      bytes.
      
      This patch adds a clunky if_present field where necessary to maintain
      the existing logic of xattr fork null pointer testing in the existing
      codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
      all goes away.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      2ed5b09b
    • Darrick J. Wong's avatar
      xfs: convert XFS_IFORK_PTR to a static inline helper · 732436ef
      Darrick J. Wong authored
      We're about to make this logic do a bit more, so convert the macro to a
      static inline function for better typechecking and fewer shouty macros.
      No functional changes here.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      732436ef
    • Andrey Strachuk's avatar
      xfs: removed useless condition in function xfs_attr_node_get · 0f38063d
      Andrey Strachuk authored
      At line 1561, variable "state" is being compared
      with NULL every loop iteration.
      
      -------------------------------------------------------------------
      1561	for (i = 0; state != NULL && i < state->path.active; i++) {
      1562		xfs_trans_brelse(args->trans, state->path.blk[i].bp);
      1563		state->path.blk[i].bp = NULL;
      1564	}
      -------------------------------------------------------------------
      
      However, it cannot be NULL.
      
      ----------------------------------------
      1546	state = xfs_da_state_alloc(args);
      ----------------------------------------
      
      xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is
      called with __GFP_NOFAIL flag and, therefore, it cannot return NULL.
      
      --------------------------------------------------------------------------
      	struct xfs_da_state *
      	xfs_da_state_alloc(
      	struct xfs_da_args	*args)
      	{
      		struct xfs_da_state	*state;
      
      		state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL);
      		state->args = args;
      		state->mp = args->dp->i_mount;
      		return state;
      	}
      --------------------------------------------------------------------------
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      Signed-off-by: default avatarAndrey Strachuk <strochuk@ispras.ru>
      
      Fixes: 4d0cdd2b ("xfs: clean up xfs_attr_node_hasname")
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0f38063d
    • Eric Sandeen's avatar
      xfs: add selinux labels to whiteout inodes · 70b589a3
      Eric Sandeen authored
      We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't
      apply an SELinux label on xfs" as it does on other filesystems
      (for example, ext4 and tmpfs.)  While I'm not quite sure how labels
      may interact w/ whiteout files, leaving them as unlabeled seems
      inconsistent at best. Now that xfs_init_security is not static,
      rename it to xfs_inode_init_security per dchinner's suggestion.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      70b589a3
    • Darrick J. Wong's avatar
      Merge tag 'xfs-perag-conv-5.20' of... · fddb564f
      Darrick J. Wong authored
      Merge tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA
      
      xfs: per-ag conversions for 5.20
      
      This series drives the perag down into the AGI, AGF and AGFL access
      routines and unifies the perag structure initialisation with the
      high level AG header read functions. This largely replaces the
      xfs_mount/agno pair that is passed to all these functions with a
      perag, and in most places we already have a perag ready to pass in.
      There are a few places where perags need to be grabbed before
      reading the AG header buffers - some of these will need to be driven
      to higher layers to ensure we can run operations on AGs without
      getting stuck part way through waiting on a perag reference.
      
      The latter section of this patchset moves some of the AG geometry
      information from the xfs_mount to the xfs_perag, and starts
      converting code that requires geometry validation to use a perag
      instead of a mount and having to extract the AGNO from the object
      location. This also allows us to store the AG size in the perag and
      then we can stop having to compare the agno against sb_agcount to
      determine if the AG is the last AG and so has a runt size.  This
      greatly simplifies some of the type validity checking we do and
      substantially reduces the CPU overhead of type validity checking. It
      also cuts over 1.2kB out of the binary size.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: make is_log_ag() a first class helper
        xfs: replace xfs_ag_block_count() with perag accesses
        xfs: Pre-calculate per-AG agino geometry
        xfs: Pre-calculate per-AG agbno geometry
        xfs: pass perag to xfs_alloc_read_agfl
        xfs: pass perag to xfs_alloc_put_freelist
        xfs: pass perag to xfs_alloc_get_freelist
        xfs: pass perag to xfs_read_agf
        xfs: pass perag to xfs_read_agi
        xfs: pass perag to xfs_alloc_read_agf()
        xfs: kill xfs_alloc_pagf_init()
        xfs: pass perag to xfs_ialloc_read_agi()
        xfs: kill xfs_ialloc_pagi_init()
        xfs: make last AG grow/shrink perag centric
      fddb564f
    • Darrick J. Wong's avatar
      Merge tag 'xfs-cil-scale-5.20' of... · dd81dc05
      Darrick J. Wong authored
      Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA
      
      xfs: improve CIL scalability
      
      This series aims to improve the scalability of XFS transaction
      commits on large CPU count machines. My 32p machine hits contention
      limits in xlog_cil_commit() at about 700,000 transaction commits a
      section. It hits this at 16 thread workloads, and 32 thread
      workloads go no faster and just burn CPU on the CIL spinlocks.
      
      This patchset gets rid of spinlocks and global serialisation points
      in the xlog_cil_commit() path. It does this by moving to a
      combination of per-cpu counters, unordered per-cpu lists and
      post-ordered per-cpu lists.
      
      This results in transaction commit rates exceeding 1.4 million
      commits/s under unlink certain workloads, and while the log lock
      contention is largely gone there is still significant lock
      contention in the VFS (dentry cache, inode cache and security layers)
      at >600,000 transactions/s that still limit scalability.
      
      The changes to the CIL accounting and behaviour, combined with the
      structural changes to xlog_write() in prior patchsets make the
      per-cpu restructuring possible and sane. This allows us to move to
      precalculated reservation requirements that allow for reservation
      stealing to be accounted across multiple CPUs accurately.
      
      That is, instead of trying to account for continuation log opheaders
      on a "growth" basis, we pre-calculate how many iclogs we'll need to
      write out a maximally sized CIL checkpoint and steal that reserveD
      that space one commit at a time until the CIL has a full
      reservation. If we ever run a commit when we are already at the hard
      limit (because post-throttling) we simply take an extra reservation
      from each commit that is run when over the limit. Hence we don't
      need to do space usage math in the fast path and so never need to
      sum the per-cpu counters in this fast path.
      
      Similarly, per-cpu lists have the problem of ordering - we can't
      remove an item from a per-cpu list if we want to move it forward in
      the CIL. We solve this problem by using an atomic counter to give
      every commit a sequence number that is copied into the log items in
      that transaction. Hence relogging items just overwrites the sequence
      number in the log item, and does not move it in the per-cpu lists.
      Once we reaggregate the per-cpu lists back into a single list in the
      CIL push work, we can run it through list-sort() and reorder it back
      into a globally ordered list. This costs a bit of CPU time, but now
      that the CIL can run multiple works and pipelines properly, this is
      not a limiting factor for performance. It does increase fsync
      latency when the CIL is full, but workloads issuing large numbers of
      fsync()s or sync transactions end up with very small CILs and so the
      latency impact or sorting is not measurable for such workloads.
      
      OVerall, this pushes the transaction commit bottleneck out to the
      lockless reservation grant head updates. These atomic updates don't
      start to be a limiting fact until > 1.5 million transactions/s are
      being run, at which point the accounting functions start to show up
      in profiles as the highest CPU users. Still, this series doubles
      transaction throughput without increasing CPU usage before we get
      to that cacheline contention breakdown point...
      `
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: expanding delayed logging design with background material
        xfs: xlog_sync() manually adjusts grant head space
        xfs: avoid cil push lock if possible
        xfs: move CIL ordering to the logvec chain
        xfs: convert log vector chain to use list heads
        xfs: convert CIL to unordered per cpu lists
        xfs: Add order IDs to log items in CIL
        xfs: convert CIL busy extents to per-cpu
        xfs: track CIL ticket reservation in percpu structure
        xfs: implement percpu cil space used calculation
        xfs: introduce per-cpu CIL tracking structure
        xfs: rework per-iclog header CIL reservation
        xfs: lift init CIL reservation out of xc_cil_lock
        xfs: use the CIL space used counter for emptiness checks
      dd81dc05
  4. 07 Jul, 2022 12 commits