1. 09 Jul, 2022 5 commits
    • Darrick J. Wong's avatar
      xfs: convert XFS_IFORK_PTR to a static inline helper · 732436ef
      Darrick J. Wong authored
      We're about to make this logic do a bit more, so convert the macro to a
      static inline function for better typechecking and fewer shouty macros.
      No functional changes here.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      732436ef
    • Andrey Strachuk's avatar
      xfs: removed useless condition in function xfs_attr_node_get · 0f38063d
      Andrey Strachuk authored
      At line 1561, variable "state" is being compared
      with NULL every loop iteration.
      
      -------------------------------------------------------------------
      1561	for (i = 0; state != NULL && i < state->path.active; i++) {
      1562		xfs_trans_brelse(args->trans, state->path.blk[i].bp);
      1563		state->path.blk[i].bp = NULL;
      1564	}
      -------------------------------------------------------------------
      
      However, it cannot be NULL.
      
      ----------------------------------------
      1546	state = xfs_da_state_alloc(args);
      ----------------------------------------
      
      xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is
      called with __GFP_NOFAIL flag and, therefore, it cannot return NULL.
      
      --------------------------------------------------------------------------
      	struct xfs_da_state *
      	xfs_da_state_alloc(
      	struct xfs_da_args	*args)
      	{
      		struct xfs_da_state	*state;
      
      		state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL);
      		state->args = args;
      		state->mp = args->dp->i_mount;
      		return state;
      	}
      --------------------------------------------------------------------------
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      Signed-off-by: default avatarAndrey Strachuk <strochuk@ispras.ru>
      
      Fixes: 4d0cdd2b ("xfs: clean up xfs_attr_node_hasname")
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0f38063d
    • Eric Sandeen's avatar
      xfs: add selinux labels to whiteout inodes · 70b589a3
      Eric Sandeen authored
      We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't
      apply an SELinux label on xfs" as it does on other filesystems
      (for example, ext4 and tmpfs.)  While I'm not quite sure how labels
      may interact w/ whiteout files, leaving them as unlabeled seems
      inconsistent at best. Now that xfs_init_security is not static,
      rename it to xfs_inode_init_security per dchinner's suggestion.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      70b589a3
    • Darrick J. Wong's avatar
      Merge tag 'xfs-perag-conv-5.20' of... · fddb564f
      Darrick J. Wong authored
      Merge tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA
      
      xfs: per-ag conversions for 5.20
      
      This series drives the perag down into the AGI, AGF and AGFL access
      routines and unifies the perag structure initialisation with the
      high level AG header read functions. This largely replaces the
      xfs_mount/agno pair that is passed to all these functions with a
      perag, and in most places we already have a perag ready to pass in.
      There are a few places where perags need to be grabbed before
      reading the AG header buffers - some of these will need to be driven
      to higher layers to ensure we can run operations on AGs without
      getting stuck part way through waiting on a perag reference.
      
      The latter section of this patchset moves some of the AG geometry
      information from the xfs_mount to the xfs_perag, and starts
      converting code that requires geometry validation to use a perag
      instead of a mount and having to extract the AGNO from the object
      location. This also allows us to store the AG size in the perag and
      then we can stop having to compare the agno against sb_agcount to
      determine if the AG is the last AG and so has a runt size.  This
      greatly simplifies some of the type validity checking we do and
      substantially reduces the CPU overhead of type validity checking. It
      also cuts over 1.2kB out of the binary size.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-perag-conv-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: make is_log_ag() a first class helper
        xfs: replace xfs_ag_block_count() with perag accesses
        xfs: Pre-calculate per-AG agino geometry
        xfs: Pre-calculate per-AG agbno geometry
        xfs: pass perag to xfs_alloc_read_agfl
        xfs: pass perag to xfs_alloc_put_freelist
        xfs: pass perag to xfs_alloc_get_freelist
        xfs: pass perag to xfs_read_agf
        xfs: pass perag to xfs_read_agi
        xfs: pass perag to xfs_alloc_read_agf()
        xfs: kill xfs_alloc_pagf_init()
        xfs: pass perag to xfs_ialloc_read_agi()
        xfs: kill xfs_ialloc_pagi_init()
        xfs: make last AG grow/shrink perag centric
      fddb564f
    • Darrick J. Wong's avatar
      Merge tag 'xfs-cil-scale-5.20' of... · dd81dc05
      Darrick J. Wong authored
      Merge tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeA
      
      xfs: improve CIL scalability
      
      This series aims to improve the scalability of XFS transaction
      commits on large CPU count machines. My 32p machine hits contention
      limits in xlog_cil_commit() at about 700,000 transaction commits a
      section. It hits this at 16 thread workloads, and 32 thread
      workloads go no faster and just burn CPU on the CIL spinlocks.
      
      This patchset gets rid of spinlocks and global serialisation points
      in the xlog_cil_commit() path. It does this by moving to a
      combination of per-cpu counters, unordered per-cpu lists and
      post-ordered per-cpu lists.
      
      This results in transaction commit rates exceeding 1.4 million
      commits/s under unlink certain workloads, and while the log lock
      contention is largely gone there is still significant lock
      contention in the VFS (dentry cache, inode cache and security layers)
      at >600,000 transactions/s that still limit scalability.
      
      The changes to the CIL accounting and behaviour, combined with the
      structural changes to xlog_write() in prior patchsets make the
      per-cpu restructuring possible and sane. This allows us to move to
      precalculated reservation requirements that allow for reservation
      stealing to be accounted across multiple CPUs accurately.
      
      That is, instead of trying to account for continuation log opheaders
      on a "growth" basis, we pre-calculate how many iclogs we'll need to
      write out a maximally sized CIL checkpoint and steal that reserveD
      that space one commit at a time until the CIL has a full
      reservation. If we ever run a commit when we are already at the hard
      limit (because post-throttling) we simply take an extra reservation
      from each commit that is run when over the limit. Hence we don't
      need to do space usage math in the fast path and so never need to
      sum the per-cpu counters in this fast path.
      
      Similarly, per-cpu lists have the problem of ordering - we can't
      remove an item from a per-cpu list if we want to move it forward in
      the CIL. We solve this problem by using an atomic counter to give
      every commit a sequence number that is copied into the log items in
      that transaction. Hence relogging items just overwrites the sequence
      number in the log item, and does not move it in the per-cpu lists.
      Once we reaggregate the per-cpu lists back into a single list in the
      CIL push work, we can run it through list-sort() and reorder it back
      into a globally ordered list. This costs a bit of CPU time, but now
      that the CIL can run multiple works and pipelines properly, this is
      not a limiting factor for performance. It does increase fsync
      latency when the CIL is full, but workloads issuing large numbers of
      fsync()s or sync transactions end up with very small CILs and so the
      latency impact or sorting is not measurable for such workloads.
      
      OVerall, this pushes the transaction commit bottleneck out to the
      lockless reservation grant head updates. These atomic updates don't
      start to be a limiting fact until > 1.5 million transactions/s are
      being run, at which point the accounting functions start to show up
      in profiles as the highest CPU users. Still, this series doubles
      transaction throughput without increasing CPU usage before we get
      to that cacheline contention breakdown point...
      `
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-cil-scale-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: expanding delayed logging design with background material
        xfs: xlog_sync() manually adjusts grant head space
        xfs: avoid cil push lock if possible
        xfs: move CIL ordering to the logvec chain
        xfs: convert log vector chain to use list heads
        xfs: convert CIL to unordered per cpu lists
        xfs: Add order IDs to log items in CIL
        xfs: convert CIL busy extents to per-cpu
        xfs: track CIL ticket reservation in percpu structure
        xfs: implement percpu cil space used calculation
        xfs: introduce per-cpu CIL tracking structure
        xfs: rework per-iclog header CIL reservation
        xfs: lift init CIL reservation out of xc_cil_lock
        xfs: use the CIL space used counter for emptiness checks
      dd81dc05
  2. 07 Jul, 2022 24 commits
    • Dave Chinner's avatar
      xfs: make is_log_ag() a first class helper · 36029dee
      Dave Chinner authored
      We check if an ag contains the log in many places, so make this
      a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and
      renaming it xfs_ag_contains_log(). The convert all the places that
      check if the AG contains the log to use this helper.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      36029dee
    • Dave Chinner's avatar
      xfs: replace xfs_ag_block_count() with perag accesses · 3829c9a1
      Dave Chinner authored
      Many of the places that call xfs_ag_block_count() have a perag
      available. These places can just read pag->block_count directly
      instead of calculating the AG block count from first principles.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      3829c9a1
    • Dave Chinner's avatar
      xfs: Pre-calculate per-AG agino geometry · 2d6ca832
      Dave Chinner authored
      There is a lot of overhead in functions like xfs_verify_agino() that
      repeatedly calculate the geometry limits of an AG. These can be
      pre-calculated as they are static and the verification context has
      a per-ag context it can quickly reference.
      
      In the case of xfs_verify_agino(), we now always have a perag
      context handy, so we can store the minimum and maximum agino values
      in the AG in the perag. This means we don't have to calculate
      it on every call and it can be inlined in callers if we move it
      to xfs_ag.h.
      
      xfs_verify_agino_or_null() gets the same perag treatment.
      
      xfs_agino_range() is moved to xfs_ag.c as it's not really a type
      function, and it's use is largely restricted as the first and last
      aginos can be grabbed straight from the perag in most cases.
      
      Note that we leave the original xfs_verify_agino in place in
      xfs_types.c as a static function as other callers in that file do
      not have per-ag contexts so still need to go the long way. It's been
      renamed to xfs_verify_agno_agino() to indicate it takes both an agno
      and an agino to differentiate it from new function.
      
      $ size --totals fs/xfs/built-in.a
      	   text    data     bss     dec     hex filename
      before	1482185	 329588	    572	1812345	 1ba779	(TOTALS)
      after	1481937	 329588	    572	1812097	 1ba681	(TOTALS)
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2d6ca832
    • Dave Chinner's avatar
      xfs: Pre-calculate per-AG agbno geometry · 0800169e
      Dave Chinner authored
      There is a lot of overhead in functions like xfs_verify_agbno() that
      repeatedly calculate the geometry limits of an AG. These can be
      pre-calculated as they are static and the verification context has
      a per-ag context it can quickly reference.
      
      In the case of xfs_verify_agbno(), we now always have a perag
      context handy, so we can store the AG length and the minimum valid
      block in the AG in the perag. This means we don't have to calculate
      it on every call and it can be inlined in callers if we move it
      to xfs_ag.h.
      
      Move xfs_ag_block_count() to xfs_ag.c because it's really a
      per-ag function and not an XFS type function. We need a little
      bit of rework that is specific to xfs_initialise_perag() to allow
      growfs to calculate the new perag sizes before we've updated the
      primary superblock during the grow (chicken/egg situation).
      
      Note that we leave the original xfs_verify_agbno in place in
      xfs_types.c as a static function as other callers in that file do
      not have per-ag contexts so still need to go the long way. It's been
      renamed to xfs_verify_agno_agbno() to indicate it takes both an agno
      and an agbno to differentiate it from new function.
      
      Future commits will make similar changes for other per-ag geometry
      validation functions.
      
      Further:
      
      $ size --totals fs/xfs/built-in.a
      	   text    data     bss     dec     hex filename
      before	1483006	 329588	    572	1813166	 1baaae	(TOTALS)
      after	1482185	 329588	    572	1812345	 1ba779	(TOTALS)
      
      This rework reduces the binary size by ~820 bytes, indicating
      that much less work is being done to bounds check the agbno values
      against on per-ag geometry information.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0800169e
    • Dave Chinner's avatar
      xfs: pass perag to xfs_alloc_read_agfl · cec7bb7d
      Dave Chinner authored
      We have the perag in most places we call xfs_alloc_read_agfl, so
      pass the perag instead of a mount/agno pair.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      cec7bb7d
    • Dave Chinner's avatar
      xfs: pass perag to xfs_alloc_put_freelist · 8c392eb2
      Dave Chinner authored
      It's available in all callers, so pass it in so that the perag can
      be passed further down the stack.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8c392eb2
    • Dave Chinner's avatar
      xfs: pass perag to xfs_alloc_get_freelist · 49f0d84e
      Dave Chinner authored
      It's available in all callers, so pass it in so that the perag can
      be passed further down the stack.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      49f0d84e
    • Dave Chinner's avatar
      xfs: pass perag to xfs_read_agf · fa044ae7
      Dave Chinner authored
      We have the perag in most places we call xfs_read_agf, so pass the
      perag instead of a mount/agno pair.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      fa044ae7
    • Dave Chinner's avatar
      xfs: pass perag to xfs_read_agi · 61021deb
      Dave Chinner authored
      We have the perag in most palces we call xfs_read_agi, so pass the
      perag instead of a mount/agno pair.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      61021deb
    • Dave Chinner's avatar
      xfs: pass perag to xfs_alloc_read_agf() · 08d3e84f
      Dave Chinner authored
      xfs_alloc_read_agf() initialises the perag if it hasn't been done
      yet, so it makes sense to pass it the perag rather than pull a
      reference from the buffer. This allows callers to be per-ag centric
      rather than passing mount/agno pairs everywhere.
      
      Whilst modifying the xfs_reflink_find_shared() function definition,
      declare it static and remove the extern declaration as it is an
      internal function only these days.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      08d3e84f
    • Dave Chinner's avatar
      xfs: kill xfs_alloc_pagf_init() · 76b47e52
      Dave Chinner authored
      Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced
      by passing a NULL agfbp to xfs_alloc_read_agf().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      76b47e52
    • Dave Chinner's avatar
      xfs: pass perag to xfs_ialloc_read_agi() · 99b13c7f
      Dave Chinner authored
      xfs_ialloc_read_agi() initialises the perag if it hasn't been done
      yet, so it makes sense to pass it the perag rather than pull a
      reference from the buffer. This allows callers to be per-ag centric
      rather than passing mount/agno pairs everywhere.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      99b13c7f
    • Dave Chinner's avatar
      xfs: kill xfs_ialloc_pagi_init() · a95fee40
      Dave Chinner authored
      This is just a basic wrapper around xfs_ialloc_read_agi(), which can
      be entirely handled by xfs_ialloc_read_agi() by passing a NULL
      agibpp....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a95fee40
    • Dave Chinner's avatar
      xfs: make last AG grow/shrink perag centric · c6aee248
      Dave Chinner authored
      Because the perag must exist for these operations, look it up as
      part of the common shrink operations and pass it instead of the
      mount/agno pair.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c6aee248
    • Dave Chinner's avatar
      xfs: expanding delayed logging design with background material · 51a117ed
      Dave Chinner authored
      I wrote up a description of how transactions, space reservations and
      relogging work together in response to a question for background
      material on the delayed logging design. Add this to the existing
      document for ease of future reference.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      51a117ed
    • Dave Chinner's avatar
      xfs: xlog_sync() manually adjusts grant head space · d9f68777
      Dave Chinner authored
      When xlog_sync() rounds off the tail the iclog that is being
      flushed, it manually subtracts that space from the grant heads. This
      space is actually reserved by the transaction ticket that covers
      the xlog_sync() call from xlog_write(), but we don't plumb the
      ticket down far enough for it to account for the space consumed in
      the current log ticket.
      
      The grant heads are hot, so we really should be accounting this to
      the ticket is we can, rather than adding thousands of extra grant
      head updates every CIL commit.
      
      Interestingly, this actually indicates a potential log space overrun
      can occur when we force the log. By the time that xfs_log_force()
      pushes out an active iclog and consumes the roundoff space, the
      reservation for that roundoff space has been returned to the grant
      heads and is no longer covered by a reservation. In theory the
      roundoff added to log force on an already full log could push the
      write head past the tail. In practice, the CIL commit that writes to
      the log and needs the iclog pushed will have reserved space for
      roundoff, so when it releases the ticket there will still be
      physical space for the roundoff to be committed to the log, even
      though it is no longer reserved. This roundoff won't be enough space
      to allow a transaction to be woken if the log is full, so overruns
      should not actually occur in practice.
      
      That said, it indicates that we should not release the CIL context
      log ticket until after we've released the commit iclog. It also
      means that xlog_sync() still needs the direct grant head
      manipulation if we don't provide it with a ticket. Log forces are
      rare when we are in fast paths running 1.5 million transactions/s
      that make the grant heads hot, so let's optimise the hot case and
      pass CIL log tickets down to the xlog_sync() code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      d9f68777
    • Dave Chinner's avatar
      xfs: avoid cil push lock if possible · 1ccb0745
      Dave Chinner authored
      Because now it hurts when the CIL fills up.
      
        - 37.20% __xfs_trans_commit
            - 35.84% xfs_log_commit_cil
               - 19.34% _raw_spin_lock
                  - do_raw_spin_lock
                       19.01% __pv_queued_spin_lock_slowpath
               - 4.20% xfs_log_ticket_ungrant
                    0.90% xfs_log_space_wake
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1ccb0745
    • Dave Chinner's avatar
      xfs: move CIL ordering to the logvec chain · 4eb56069
      Dave Chinner authored
      Adding a list_sort() call to the CIL push work while the xc_ctx_lock
      is held exclusively has resulted in fairly long lock hold times and
      that stops all front end transaction commits from making progress.
      
      We can move the sorting out of the xc_ctx_lock if we can transfer
      the ordering information to the log vectors as they are detached
      from the log items and then we can sort the log vectors.  With these
      changes, we can move the list_sort() call to just before we call
      xlog_write() when we aren't holding any locks at all.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      4eb56069
    • Dave Chinner's avatar
      xfs: convert log vector chain to use list heads · 16924853
      Dave Chinner authored
      Because the next change is going to require sorting log vectors, and
      that requires arbitrary rearrangement of the list which cannot be
      done easily with a single linked list.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      16924853
    • Dave Chinner's avatar
      xfs: convert CIL to unordered per cpu lists · c0fb4765
      Dave Chinner authored
      So that we can remove the cil_lock which is a global serialisation
      point. We've already got ordering sorted, so all we need to do is
      treat the CIL list like the busy extent list and reconstruct it
      before the push starts.
      
      This is what we're trying to avoid:
      
       -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
          - 46.35% xfs_log_commit_cil
             - 41.54% _raw_spin_lock
                - 67.30% do_raw_spin_lock
                     66.96% __pv_queued_spin_lock_slowpath
      
      Which happens on a 32p system when running a 32-way 'rm -rf'
      workload. After this patch:
      
      -   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
         - 17.67% xfs_log_commit_cil
            - 6.51% xfs_log_ticket_ungrant
                 1.40% xfs_log_space_wake
              2.32% memcpy_erms
            - 2.18% xfs_buf_item_committing
               - 2.12% xfs_buf_item_release
                  - 1.03% xfs_buf_unlock
                       0.96% up
                    0.72% xfs_buf_rele
              1.33% xfs_inode_item_format
              1.19% down_read
              0.91% up_read
              0.76% xfs_buf_item_format
            - 0.68% kmem_alloc_large
               - 0.67% kmem_alloc
                    0.64% __kmalloc
              0.50% xfs_buf_item_size
      
      It kinda looks like the workload is running out of log space all
      the time. But all the spinlock contention is gone and the
      transaction commit rate has gone from 800k/s to 1.3M/s so the amount
      of real work being done has gone up a *lot*.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c0fb4765
    • Dave Chinner's avatar
      xfs: Add order IDs to log items in CIL · 016a2338
      Dave Chinner authored
      Before we split the ordered CIL up into per cpu lists, we need a
      mechanism to track the order of the items in the CIL. We need to do
      this because there are rules around the order in which related items
      must physically appear in the log even inside a single checkpoint
      transaction.
      
      An example of this is intents - an intent must appear in the log
      before it's intent done record so that log recovery can cancel the
      intent correctly. If we have these two records misordered in the
      CIL, then they will not be recovered correctly by journal replay.
      
      We also will not be able to move items to the tail of
      the CIL list when they are relogged, hence the log items will need
      some mechanism to allow the correct log item order to be recreated
      before we write log items to the hournal.
      
      Hence we need to have a mechanism for recording global order of
      transactions in the log items  so that we can recover that order
      from un-ordered per-cpu lists.
      
      Do this with a simple monotonic increasing commit counter in the CIL
      context. Each log item in the transaction gets stamped with the
      current commit order ID before it is added to the CIL. If the item
      is already in the CIL, leave it where it is instead of moving it to
      the tail of the list and instead sort the list before we start the
      push work.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      016a2338
    • Dave Chinner's avatar
      xfs: convert CIL busy extents to per-cpu · df7a4a21
      Dave Chinner authored
      To get them out from under the CIL lock.
      
      This is an unordered list, so we can simply punt it to per-cpu lists
      during transaction commits and reaggregate it back into a single
      list during the CIL push work.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      df7a4a21
    • Dave Chinner's avatar
      xfs: track CIL ticket reservation in percpu structure · 1dd2a2c1
      Dave Chinner authored
      To get it out from under the cil spinlock.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      1dd2a2c1
    • Dave Chinner's avatar
      xfs: implement percpu cil space used calculation · 7c8ade21
      Dave Chinner authored
      Now that we have the CIL percpu structures in place, implement the
      space used counter as a per-cpu counter.
      
      We have to be really careful now about ensuring that the checks and
      updates run without arbitrary delays, which means they need to run
      with pre-emption disabled. We do this by careful placement of
      the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures
      for that CPU.
      
      We need to be able to reliably detect that the CIL has reached
      the hard limit threshold so we can take extra reservations for the
      iclog headers when the space used overruns the original reservation.
      hence we factor out xlog_cil_over_hard_limit() from
      xlog_cil_push_background().
      
      The global CIL space used is an atomic variable that is backed by
      per-cpu aggregation to minimise the number of atomic updates we do
      to the global state in the fast path. While we are under the soft
      limit, we aggregate only when the per-cpu aggregation is over the
      proportion of the soft limit assigned to that CPU. This means that
      all CPUs can use all but one byte of their aggregation threshold
      and we will not go over the soft limit.
      
      Hence once we detect that we've gone over both a per-cpu aggregation
      threshold and the soft limit, we know that we have only
      exceeded the soft limit by one per-cpu aggregation threshold. Even
      if all CPUs hit this at the same time, we can't be over the hard
      limit, so we can run an aggregation back into the atomic counter
      at this point and still be under the hard limit.
      
      At this point, we will be over the soft limit and hence we'll
      aggregate into the global atomic used space directly rather than the
      per-cpu counters, hence providing accurate detection of hard limit
      excursion for accounting and reservation purposes.
      
      Hence we get the best of both worlds - lockless, scalable per-cpu
      fast path plus accurate, atomic detection of hard limit excursion.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      7c8ade21
  3. 03 Jul, 2022 4 commits
    • Linus Torvalds's avatar
      Linux 5.19-rc5 · 88084a3d
      Linus Torvalds authored
      88084a3d
    • Linus Torvalds's avatar
      lockref: remove unused 'lockref_get_or_lock()' function · b8d5109f
      Linus Torvalds authored
      Looking at the conditional lock acquire functions in the kernel due to
      the new sparse support (see commit 4a557a5d "sparse: introduce
      conditional lock acquire function attribute"), it became obvious that
      the lockref code has a couple of them, but they don't match the usual
      naming convention for the other ones, and their return value logic is
      also reversed.
      
      In the other very similar places, the naming pattern is '*_and_lock()'
      (eg 'atomic_put_and_lock()' and 'refcount_dec_and_lock()'), and the
      function returns true when the lock is taken.
      
      The lockref code is superficially very similar to the refcount code,
      only with the special "atomic wrt the embedded lock" semantics.  But
      instead of the '*_and_lock()' naming it uses '*_or_lock()'.
      
      And instead of returning true in case it took the lock, it returns true
      if it *didn't* take the lock.
      
      Now, arguably the reflock code is quite logical: it really is a "either
      decrement _or_ lock" kind of situation - and the return value is about
      whether the operation succeeded without any special care needed.
      
      So despite the similarities, the differences do make some sense, and
      maybe it's not worth trying to unify the different conditional locking
      primitives in this area.
      
      But while looking at this all, it did become obvious that the
      'lockref_get_or_lock()' function hasn't actually had any users for
      almost a decade.
      
      The only user it ever had was the shortlived 'd_rcu_to_refcount()'
      function, and it got removed and replaced with 'lockref_get_not_dead()'
      back in 2013 in commits 0d98439e ("vfs: use lockred 'dead' flag to
      mark unrecoverably dead dentries") and e5c832d5 ("vfs: fix dentry
      RCU to refcounting possibly sleeping dput()")
      
      In fact, that single use was removed less than a week after the whole
      function was introduced in commit b3abd802 ("lockref: add
      'lockref_get_or_lock() helper") so this function has been around for a
      decade, but only had a user for six days.
      
      Let's just put this mis-designed and unused function out of its misery.
      
      We can think about the naming and semantic oddities of the remaining
      'lockref_put_or_lock()' later, but at least that function has users.
      
      And while the naming is different and the return value doesn't match,
      that function matches the whole '{atomic,refcount}_dec_and_test()'
      pattern much better (ie the magic happens when the count goes down to
      zero, not when it is incremented from zero).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8d5109f
    • Linus Torvalds's avatar
      sparse: introduce conditional lock acquire function attribute · 4a557a5d
      Linus Torvalds authored
      The kernel tends to try to avoid conditional locking semantics because
      it makes it harder to think about and statically check locking rules,
      but we do have a few fundamental locking primitives that take locks
      conditionally - most obviously the 'trylock' functions.
      
      That has always been a problem for 'sparse' checking for locking
      imbalance, and we've had a special '__cond_lock()' macro that we've used
      to let sparse know how the locking works:
      
          # define __cond_lock(x,c)        ((c) ? ({ __acquire(x); 1; }) : 0)
      
      so that you can then use this to tell sparse that (for example) the
      spinlock trylock macro ends up acquiring the lock when it succeeds, but
      not when it fails:
      
          #define raw_spin_trylock(lock)  __cond_lock(lock, _raw_spin_trylock(lock))
      
      and then sparse can follow along the locking rules when you have code like
      
              if (!spin_trylock(&dentry->d_lock))
                      return LRU_SKIP;
      	.. sparse sees that the lock is held here..
              spin_unlock(&dentry->d_lock);
      
      and sparse ends up happy about the lock contexts.
      
      However, this '__cond_lock()' use does result in very ugly header files,
      and requires you to basically wrap the real function with that macro
      that uses '__cond_lock'.  Which has made PeterZ NAK things that try to
      fix sparse warnings over the years [1].
      
      To solve this, there is now a very experimental patch to sparse that
      basically does the exact same thing as '__cond_lock()' did, but using a
      function attribute instead.  That seems to make PeterZ happy [2].
      
      Note that this does not replace existing use of '__cond_lock()', but
      only exposes the new proposed attribute and uses it for the previously
      unannotated 'refcount_dec_and_lock()' family of functions.
      
      For existing sparse installations, this will make no difference (a
      negative output context was ignored), but if you have the experimental
      sparse patch it will make sparse now understand code that uses those
      functions, the same way '__cond_lock()' makes sparse understand the very
      similar 'atomic_dec_and_lock()' uses that have the old '__cond_lock()'
      annotations.
      
      Note that in some cases this will silence existing context imbalance
      warnings.  But in other cases it may end up exposing new sparse warnings
      for code that sparse just didn't see the locking for at all before.
      
      This is a trial, in other words.  I'd expect that if it ends up being
      successful, and new sparse releases end up having this new attribute,
      we'll migrate the old-style '__cond_lock()' users to use the new-style
      '__cond_acquires' function attribute.
      
      The actual experimental sparse patch was posted in [3].
      
      Link: https://lore.kernel.org/all/20130930134434.GC12926@twins.programming.kicks-ass.net/ [1]
      Link: https://lore.kernel.org/all/Yr60tWxN4P568x3W@worktop.programming.kicks-ass.net/ [2]
      Link: https://lore.kernel.org/all/CAHk-=wjZfO9hGqJ2_hGQG3U_XzSh9_XaXze=HgPdvJbgrvASfA@mail.gmail.com/ [3]
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Aring <aahringo@redhat.com>
      Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a557a5d
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 20855e4c
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
       "This fixes some stalling problems and corrects the last of the
        problems (I hope) observed during testing of the new atomic xattr
        update feature.
      
         - Fix statfs blocking on background inode gc workers
      
         - Fix some broken inode lock assertion code
      
         - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
           operation
      
         - Clean up xattr recovery to make it easier to understand.
      
         - Fix xattr leaf block verifiers tripping over empty blocks.
      
         - Remove complicated and error prone xattr leaf block bholding mess.
      
         - Fix a bug where an rt extent crossing EOF was treated as "posteof"
           blocks and cleaned unnecessarily.
      
         - Fix a UAF when log shutdown races with unmount"
      
      * tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: prevent a UAF when log IO errors race with unmount
        xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
        xfs: don't hold xattr leaf buffers across transaction rolls
        xfs: empty xattr leaf header blocks are not corruption
        xfs: clean up the end of xfs_attri_item_recover
        xfs: always free xattri_leaf_bp when cancelling a deferred op
        xfs: use invalidate_lock to check the state of mmap_lock
        xfs: factor out the common lock flags assert
        xfs: introduce xfs_inodegc_push()
        xfs: bound maximum wait time for inodegc work
      20855e4c
  4. 02 Jul, 2022 7 commits