1. 13 Sep, 2023 6 commits
  2. 12 Sep, 2023 7 commits
    • Darrick J. Wong's avatar
      xfs: make inode unlinked bucket recovery work with quotacheck · 49813a21
      Darrick J. Wong authored
      Teach quotacheck to reload the unlinked inode lists when walking the
      inode table.  This requires extra state handling, since it's possible
      that a reloaded inode will get inactivated before quotacheck tries to
      scan it; in this case, we need to ensure that the reloaded inode does
      not have dquots attached when it is freed.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      49813a21
    • Darrick J. Wong's avatar
      xfs: load uncached unlinked inodes into memory on demand · 68b957f6
      Darrick J. Wong authored
      shrikanth hegde reports that filesystems fail shortly after mount with
      the following failure:
      
      	WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]
      
      This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:
      
      	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
      	if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }
      
      From diagnostic data collected by the bug reporters, it would appear
      that we cleanly mounted a filesystem that contained unlinked inodes.
      Unlinked inodes are only processed as a final step of log recovery,
      which means that clean mounts do not process the unlinked list at all.
      
      Prior to the introduction of the incore unlinked lists, this wasn't a
      problem because the unlink code would (very expensively) traverse the
      entire ondisk metadata iunlink chain to keep things up to date.
      However, the incore unlinked list code complains when it realizes that
      it is out of sync with the ondisk metadata and shuts down the fs, which
      is bad.
      
      Ritesh proposed to solve this problem by unconditionally parsing the
      unlinked lists at mount time, but this imposes a mount time cost for
      every filesystem to catch something that should be very infrequent.
      Instead, let's target the places where we can encounter a next_unlinked
      pointer that refers to an inode that is not in cache, and load it into
      cache.
      
      Note: This patch does not address the problem of iget loading an inode
      from the middle of the iunlink list and needing to set i_prev_unlinked
      correctly.
      Reported-by: default avatarshrikanth hegde <sshegde@linux.vnet.ibm.com>
      Triaged-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      68b957f6
    • Darrick J. Wong's avatar
      xfs: reserve less log space when recovering log intent items · 3c919b09
      Darrick J. Wong authored
      Wengang Wang reports that a customer's system was running a number of
      truncate operations on a filesystem with a very small log.  Contention
      on the reserve heads lead to other threads stalling on smaller updates
      (e.g.  mtime updates) long enough to result in the node being rebooted
      on account of the lack of responsivenes.  The node failed to recover
      because log recovery of an EFI became stuck waiting for a grant of
      reserve space.  From Wengang's report:
      
      "For the file deletion, log bytes are reserved basing on
      xfs_mount->tr_itruncate which is:
      
          tr_logres = 175488,
          tr_logcount = 2,
          tr_logflags = XFS_TRANS_PERM_LOG_RES,
      
      "You see it's a permanent log reservation with two log operations (two
      transactions in rolling mode).  After calculation (xlog_calc_unit_res()
      adds space for various log headers), the final log space needed per
      transaction changes from  175488 to 180208 bytes.  So the total log
      space needed is 360416 bytes (180208 * 2).  [That quantity] of log space
      (360416 bytes) needs to be reserved for both run time inode removing
      (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover())."
      
      In other words, runtime pre-reserves 360K of space in anticipation of
      running a chain of two transactions in which each transaction gets a
      180K reservation.
      
      Now that we've allocated the transaction, we delete the bmap mapping,
      log an EFI to free the space, and roll the transaction as part of
      finishing the deferops chain.  Rolling creates a new xfs_trans which
      shares its ticket with the old transaction.  Next, xfs_trans_roll calls
      __xfs_trans_commit with regrant == true, which calls xlog_cil_commit
      with the same regrant parameter.
      
      xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
      subtracts t_curr_res from the reservation and write heads.
      
      If the filesystem is fresh and the first transaction only used (say)
      20K, then t_curr_res will be 160K, and we give that much reservation
      back to the reservation head.  Or if the file is really fragmented and
      the first transaction actually uses 170K, then t_curr_res will be 10K,
      and that's what we give back to the reservation.
      
      Having done that, we're now headed into the second transaction with an
      EFI and 180K of reservation.  Other threads apparently consumed all the
      reservation for smaller transactions, such as timestamp updates.
      
      Now let's say the first transaction gets written to disk and we crash
      without ever completing the second transaction.  Now we remount the fs,
      log recovery finds the unfinished EFI, and calls xfs_efi_recover to
      finish the EFI.  However, xfs_efi_recover starts a new tr_itruncate
      tranasction, which asks for 360K log reservation.  This is a lot more
      than the 180K that we had reserved at the time of the crash.  If the
      first EFI to be recovered is also pinning the tail of the log, we will
      be unable to free any space in the log, and recovery livelocks.
      
      Wengang confirmed this:
      
      "Now we have the second transaction which has 180208 log bytes reserved
      too. The second transaction is supposed to process intents including
      extent freeing.  With my hacking patch, I blocked the extent freeing 5
      hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved.
      
      "With my test case, other transactions (update timestamps) then happen.
      As my hacking patch pins the journal tail, those timestamp-updating
      transactions finally use up (almost) all the left available log space
      (in memory in on disk).  And finally the on disk (and in memory)
      available log space goes down near to 180208 bytes.  Those 180208 bytes
      are reserved by [the] second (extent-free) transaction [in the chain]."
      
      Wengang and I noticed that EFI recovery starts a transaction, completes
      one step of the chain, and commits the transaction without completing
      any other steps of the chain.  Those subsequent steps are completed by
      xlog_finish_defer_ops, which allocates yet another transaction to
      finish the rest of the chain.  That transaction gets the same tr_logres
      as the head transaction, but with tr_logcount = 1 to force regranting
      with every roll to avoid livelocks.
      
      In other words, we already figured this out in commit 929b92f6
      ("xfs: xfs_defer_capture should absorb remaining transaction
      reservation"), but should have applied that logic to each intent item's
      recovery function.  For Wengang's case, the xfs_trans_alloc call in the
      EFI recovery function should only be asking for a single transaction's
      worth of log reservation -- 180K, not 360K.
      
      Quoting Wengang again:
      
      "With log recovery, during EFI recovery, we use tr_itruncate again to
      reserve two transactions that needs 360416 log bytes.  Reserving 360416
      bytes fails [stalls] because we now only have about 180208 available.
      
      "Actually during the EFI recover, we only need one transaction to free
      the extents just like the 2nd transaction at RUNTIME.  So it only needs
      to reserve 180208 rather than 360416 bytes.  We have (a bit) more than
      180208 available log bytes on disk, so [if we decrease the reservation
      to 180K] the reservation goes and the recovery [finishes].  That is to
      say: we can fix the log recover part to fix the issue. We can introduce
      a new xfs_trans_res xfs_mount->tr_ext_free
      
      {
        tr_logres = 175488,
        tr_logcount = 0,
        tr_logflags = 0,
      }
      
      "and use tr_ext_free instead of tr_itruncate in EFI recover."
      
      However, I don't think it quite makes sense to create an entirely new
      transaction reservation type to handle single-stepping during log
      recovery.  Instead, we should copy the transaction reservation
      information in the xfs_mount, change tr_logcount to 1, and pass that
      into xfs_trans_alloc.  We know this won't risk changing the min log size
      computation since we always ask for a fraction of the reservation for
      all known transaction types.
      
      This looks like it's been lurking in the codebase since commit
      3d3c8b52, which changed the xfs_trans_reserve call in
      xlog_recover_process_efi to use the tr_logcount in tr_itruncate.
      That changed the EFI recovery transaction from making a
      non-XFS_TRANS_PERM_LOG_RES request for one transaction's worth of log
      space to a XFS_TRANS_PERM_LOG_RES request for two transactions worth.
      
      Fixes: 3d3c8b52 ("xfs: refactor xfs_trans_reserve() interface")
      Complements: 929b92f6 ("xfs: xfs_defer_capture should absorb remaining transaction reservation")
      Suggested-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
      Cc: Srikanth C S <srikanth.c.s@oracle.com>
      [djwong: apply the same transformation to all log intent recovery]
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      3c919b09
    • Darrick J. Wong's avatar
      xfs: fix log recovery when unknown rocompat bits are set · 74ad4693
      Darrick J. Wong authored
      Log recovery has always run on read only mounts, even where the primary
      superblock advertises unknown rocompat bits.  Due to a misunderstanding
      between Eric and Darrick back in 2018, we accidentally changed the
      superblock write verifier to shutdown the fs over that exact scenario.
      As a result, the log cleaning that occurs at the end of the mounting
      process fails if there are unknown rocompat bits set.
      
      As we now allow writing of the superblock if there are unknown rocompat
      bits set on a RO mount, we no longer want to turn off RO state to allow
      log recovery to succeed on a RO mount.  Hence we also remove all the
      (now unnecessary) RO state toggling from the log recovery path.
      
      Fixes: 9e037cb7 ("xfs: check for unknown v5 feature bits in superblock write verifier"
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      74ad4693
    • Darrick J. Wong's avatar
      xfs: reload entire unlinked bucket lists · 83771c50
      Darrick J. Wong authored
      The previous patch to reload unrecovered unlinked inodes when adding a
      newly created inode to the unlinked list is missing a key piece of
      functionality.  It doesn't handle the case that someone calls xfs_iget
      on an inode that is not the last item in the incore list.  For example,
      if at mount time the ondisk iunlink bucket looks like this:
      
      AGI -> 7 -> 22 -> 3 -> NULL
      
      None of these three inodes are cached in memory.  Now let's say that
      someone tries to open inode 3 by handle.  We need to walk the list to
      make sure that inodes 7 and 22 get loaded cold, and that the
      i_prev_unlinked of inode 3 gets set to 22.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      83771c50
    • Darrick J. Wong's avatar
      xfs: allow inode inactivation during a ro mount log recovery · 76e58901
      Darrick J. Wong authored
      In the next patch, we're going to prohibit log recovery if the primary
      superblock contains an unrecognized rocompat feature bit even on
      readonly mounts.  This requires removing all the code in the log
      mounting process that temporarily disables the readonly state.
      
      Unfortunately, inode inactivation disables itself on readonly mounts.
      Clearing the iunlinked lists after log recovery needs inactivation to
      run to free the unreferenced inodes, which (AFAICT) is the only reason
      why log mounting plays games with the readonly state in the first place.
      
      Therefore, change the inactivation predicates to allow inactivation
      during log recovery of a readonly mount.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      76e58901
    • Darrick J. Wong's avatar
      xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list · f12b9668
      Darrick J. Wong authored
      Alter the definition of i_prev_unlinked slightly to make it more obvious
      when an inode with 0 link count is not part of the iunlink bucket lists
      rooted in the AGI.  This distinction is necessary because it is not
      sufficient to check inode.i_nlink to decide if an inode is on the
      unlinked list.  Updates to i_nlink can happen while holding only
      ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
      (which happen after the nlink update) requires both ILOCK_EXCL and the
      AGI buffer lock.
      
      The next few patches will make it possible to reload an entire unlinked
      bucket list when we're walking the inode table or performing handle
      operations and need more than the ability to iget the last inode in the
      chain.
      
      The upcoming directory repair code also needs to be able to make this
      distinction to decide if a zero link count directory should be moved to
      the orphanage or allowed to inactivate.  An upcoming enhancement to the
      online AGI fsck code will need this distinction to check and rebuild the
      AGI unlinked buckets.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      f12b9668
  3. 11 Sep, 2023 6 commits
    • Darrick J. Wong's avatar
      xfs: remove CPU hotplug infrastructure · ef7d9593
      Darrick J. Wong authored
      There are no users of the cpu hotplug hooks in xfs now, so remove it.
      This reverts f1653c2e ("xfs: introduce CPU hotplug
      infrastructure").
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      ef7d9593
    • Darrick J. Wong's avatar
      xfs: remove the all-mounts list · f5bfa695
      Darrick J. Wong authored
      Revert commit 0ed17f01 ("xfs: introduce all-mounts list for cpu
      hotplug notifications") because the cpu hotplug hooks are now pointless,
      so we don't need this list anymore.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      f5bfa695
    • Darrick J. Wong's avatar
      xfs: use per-mount cpumask to track nonempty percpu inodegc lists · 62334fab
      Darrick J. Wong authored
      Directly track which CPUs have contributed to the inodegc percpu lists
      instead of trusting the cpu online mask.  This eliminates a theoretical
      problem where the inodegc flush functions might fail to flush a CPU's
      inodes if that CPU happened to be dying at exactly the same time.  Most
      likely nobody's noticed this because the CPU dead hook moves the percpu
      inodegc list to another CPU and schedules that worker immediately.  But
      it's quite possible that this is a subtle race leading to UAF if the
      inodegc flush were part of an unmount.
      
      Further benefits: This reduces the overhead of the inodegc flush code
      slightly by allowing us to ignore CPUs that have empty lists.  Better
      yet, it reduces our dependence on the cpu online masks, which have been
      the cause of confusion and drama lately.
      
      Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      62334fab
    • Darrick J. Wong's avatar
      xfs: fix an agbno overflow in __xfs_getfsmap_datadev · cfa2df68
      Darrick J. Wong authored
      Dave Chinner reported that xfs/273 fails if the AG size happens to be an
      exact power of two.  I traced this to an agbno integer overflow when the
      current GETFSMAP call is a continuation of a previous GETFSMAP call, and
      the last record returned was non-shareable space at the end of an AG.
      
      __xfs_getfsmap_datadev sets up a data device query by converting the
      incoming fmr_physical into an xfs_fsblock_t and cracking it into an agno
      and agbno pair.  In the (failing) case of where fmr_blockcount of the
      low key is nonzero and the record was for a non-shareable extent, it
      will add fmr_blockcount to start_fsb and info->low.rm_startblock.
      
      If the low key was actually the last record for that AG, then this
      addition causes info->low.rm_startblock to point beyond EOAG.  When the
      rmapbt range query starts, it'll return an empty set, and fsmap moves on
      to the next AG.
      
      Or so I thought.  Remember how we added to start_fsb?
      
      If agsize < 1<<agblklog, start_fsb points to the same AG as the original
      fmr_physical from the low key.  We run the rmapbt query, which returns
      nothing, so getfsmap zeroes info->low and moves on to the next AG.
      
      If agsize == 1<<agblklog, start_fsb now points to the next AG.  We run
      the rmapbt query on the next AG with the excessively large
      rm_startblock.  If this next AG is actually the last AG, we'll set
      info->high to EOFS (which is now has a lower rm_startblock than
      info->low), and the ranged btree query code will return -EINVAL.  If
      it's not the last AG, we ignore all records for the intermediate AGs.
      
      Oops.
      
      Fix this by decoding start_fsb into agno and agbno only after making
      adjustments to start_fsb.  This means that info->low.rm_startblock will
      always be set to a valid agbno, and we always start the rmapbt iteration
      in the correct AG.
      
      While we're at it, fix the predicate for determining if an fsmap record
      represents non-shareable space to include file data on pre-reflink
      filesystems.
      Reported-by: default avatarDave Chinner <david@fromorbit.com>
      Fixes: 63ef7a35 ("xfs: fix interval filtering in multi-step fsmap queries")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      cfa2df68
    • Darrick J. Wong's avatar
      xfs: fix per-cpu CIL structure aggregation racing with dying cpus · ecd49f7a
      Darrick J. Wong authored
      In commit 7c8ade21 ("xfs: implement percpu cil space used
      calculation"), the XFS committed (log) item list code was converted to
      use per-cpu lists and space tracking to reduce cpu contention when
      multiple threads are modifying different parts of the filesystem and
      hence end up contending on the log structures during transaction commit.
      Each CPU tracks its own commit items and space usage, and these do not
      have to be merged into the main CIL until either someone wants to push
      the CIL items, or we run over a soft threshold and switch to slower (but
      more accurate) accounting with atomics.
      
      Unfortunately, the for_each_cpu iteration suffers from the same race
      with cpu dying problem that was identified in commit 8b57b11c
      ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from
      cpu_online_mask before the CPUHP_XFS_DEAD callback gets called.  As a
      result, both CIL percpu structure aggregation functions fail to collect
      the items and accounted space usage at the correct point in time.
      
      If we're lucky, the items that are collected from the online cpus exceed
      the space given to those cpus, and the log immediately shuts down in
      xlog_cil_insert_items due to the (apparent) log reservation overrun.
      This happens periodically with generic/650, which exercises cpu hotplug
      vs. the filesystem code:
      
      smpboot: CPU 3 is now offline
      XFS (sda3): ctx ticket reservation ran out. Need to up reservation
      XFS (sda3): ticket reservation summary:
      XFS (sda3):   unit res    = 9268 bytes
      XFS (sda3):   current res = -40 bytes
      XFS (sda3):   original count  = 1
      XFS (sda3):   remaining count = 1
      XFS (sda3): Filesystem has been shut down due to log error (0x2).
      
      Applying the same sort of fix from 8b57b11c to the CIL code seems
      to make the generic/650 problem go away, but I've been told that tglx
      was not happy when he saw:
      
      "...the only thing we actually need to care about is that
      percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when
      there are no CPUs dying, it has no addition overhead except for a
      cpumask_or() operation."
      
      The CPU hotplug code is rather complex and difficult to understand and I
      don't want to try to understand the cpu hotplug locking well enough to
      use cpu_dying mask.  Furthermore, there's a performance improvement that
      could be had here.  Attach a private cpu mask to the CIL structure so
      that we can track exactly which cpus have accessed the percpu data at
      all.  It doesn't matter if the cpu has since gone offline; log item
      aggregation will still find the items.  Better yet, we skip cpus that
      have not recently logged anything.
      
      Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU
      hot remove racing with an xfs mount can crash if the cpu_dead notifier
      tries to access the log but the mount hasn't yet set up the log.
      
      Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/
      Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/
      Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
      Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/
      Cc: tglx@linutronix.de
      Cc: peterz@infradead.org
      Reported-by: ritesh.list@gmail.com
      Reported-by: sandeen@sandeen.net
      Fixes: af1c2146 ("xfs: introduce per-cpu CIL tracking structure")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      ecd49f7a
    • Lukas Bulwahn's avatar
      xfs: fix select in config XFS_ONLINE_SCRUB_STATS · 57c0f4a8
      Lukas Bulwahn authored
      Commit d7a74cad ("xfs: track usage statistics of online fsck")
      introduces config XFS_ONLINE_SCRUB_STATS, which selects the non-existing
      config FS_DEBUG. It is probably intended to select the existing config
      XFS_DEBUG.
      
      Fix the select in config XFS_ONLINE_SCRUB_STATS.
      
      Fixes: d7a74cad ("xfs: track usage statistics of online fsck")
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      57c0f4a8
  4. 10 Sep, 2023 6 commits
    • Linus Torvalds's avatar
      Linux 6.6-rc1 · 0bb80ecc
      Linus Torvalds authored
      0bb80ecc
    • Linus Torvalds's avatar
      Merge tag 'topic/drm-ci-2023-08-31-1' of git://anongit.freedesktop.org/drm/drm · 1548b060
      Linus Torvalds authored
      Pull drm ci scripts from Dave Airlie:
       "This is a bunch of ci integration for the freedesktop gitlab instance
        where we currently do upstream userspace testing on diverse sets of
        GPU hardware. From my perspective I think it's an experiment worth
        going with and seeing how the benefits/noise playout keeping these
        files useful.
      
        Ideally I'd like to get this so we can do pre-merge testing on PRs
        eventually.
      
        Below is some info from danvet on why we've ended up making the
        decision and how we can roll it back if we decide it was a bad plan.
      
        Why in upstream?
      
         - like documentation, testcases, tools CI integration is one of these
           things where you can waste endless amounts of time if you
           accidentally have a version that doesn't match your source code
      
         - but also like the above, there's a balance, this is the initial cut
           of what we think makes sense to keep in sync vs out-of-tree,
           probably needs adjustment
      
         - gitlab supports out-of-repo gitlab integration and that's what's
           been used for the kernel in drm, but it results in per-driver
           fragmentation and lots of duplicated effort. the simple act of
           smashing an arbitrary winner into a topic branch already started
           surfacing patches on dri-devel and sparking good cross driver team
           discussions
      
        Why gitlab?
      
         - it's not any more shit than any of the other CI
      
         - drm userspace uses it extensively for everything in userspace, we
           have a lot of people and experience with this, including
           integration of hw testing labs
      
         - media userspace like gstreamer is also on gitlab.fd.o, and there's
           discussion to extend this to the media subsystem in some fashion
      
        Can this be shared?
      
         - there's definitely a pile of code that could move to scripts/ if
           other subsystem adopt ci integration in upstream kernel git. other
           bits are more drm/gpu specific like the igt-gpu-tests/tools
           integration
      
         - docker images can be run locally or in other CI runners
      
        Will we regret this?
      
         - it's all in one directory, intentionally, for easy deletion
      
         - probably 1-2 years in upstream to see whether this is worth it or a
           Big Mistake. that's roughly what it took to _really_ roll out solid
           CI in the bigger userspace projects we have on gitlab.fd.o like
           mesa3d"
      
      * tag 'topic/drm-ci-2023-08-31-1' of git://anongit.freedesktop.org/drm/drm:
        drm: ci: docs: fix build warning - add missing escape
        drm: Add initial ci/ subdirectory
      1548b060
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e56b2b60
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "Fix preemption delays in the SGX code, remove unnecessarily
        UAPI-exported code, fix a ld.lld linker (in)compatibility quirk and
        make the x86 SMP init code a bit more conservative to fix kexec()
        lockups"
      
      * tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sgx: Break up long non-preemptible delays in sgx_vepc_release()
        x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI
        x86/build: Fix linker fill bytes quirk/incompatibility for ld.lld
        x86/smp: Don't send INIT to non-present and non-booted CPUs
      e56b2b60
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e79dbf03
      Linus Torvalds authored
      Pull x86 perf event fix from Ingo Molnar:
       "Work around a firmware bug in the uncore PMU driver, affecting certain
        Intel systems"
      
      * tag 'perf-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/uncore: Correct the number of CHAs on EMR
      e79dbf03
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-for-v6.6-1-2023-09-05' of... · 535a265d
      Linus Torvalds authored
      Merge tag 'perf-tools-for-v6.6-1-2023-09-05' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
      
      Pull perf tools updates from Arnaldo Carvalho de Melo:
       "perf tools maintainership:
      
         - Add git information for perf-tools and perf-tools-next trees and
           branches to the MAINTAINERS file. That is where development now
           takes place and myself and Namhyung Kim have write access, more
           people to come as we emulate other maintainer groups.
      
        perf record:
      
         - Record kernel data maps when 'perf record --data' is used, so that
           global variables can be resolved and used in tools that do data
           profiling.
      
        perf trace:
      
         - Remove the old, experimental support for BPF events in which a .c
           file was passed as an event: "perf trace -e hello.c" to then get
           compiled and loaded.
      
           The only known usage for that, that shipped with the kernel as an
           example for such events, augmented the raw_syscalls tracepoints and
           was converted to a libbpf skeleton, reusing all the user space
           components and the BPF code connected to the syscalls.
      
           In the end just the way to glue the BPF part and the user space
           type beautifiers changed, now being performed by libbpf skeletons.
      
           The next step is to use BTF to do pretty printing of all syscall
           types, as discussed with Alan Maguire and others.
      
           Now, on a perf built with BUILD_BPF_SKEL=1 we get most if not all
           path/filenames/strings, some of the networking data structures,
           perf_event_attr, etc, i.e. systemwide tracing of nanosleep calls
           and perf_event_open syscalls while 'perf stat' runs 'sleep' for 5
           seconds:
      
            # perf trace -a -e *nanosleep,perf* perf stat -e cycles,instructions sleep 5
               0.000 (   9.034 ms): perf/327641 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), sample_type: IDENTIFIER, read_format: TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING, disabled: 1, inherit: 1, enable_on_exec: 1, exclude_guest: 1 }, pid: 327642 (perf), cpu: -1, group_fd: -1, flags: FD_CLOEXEC) = 3
               9.039 (   0.006 ms): perf/327641 perf_event_open(attr_uptr: { type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x1 (PERF_COUNT_HW_INSTRUCTIONS), sample_type: IDENTIFIER, read_format: TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING, disabled: 1, inherit: 1, enable_on_exec: 1, exclude_guest: 1 }, pid: 327642 (perf-exec), cpu: -1, group_fd: -1, flags: FD_CLOEXEC) = 4
                   ? (           ): gpm/991  ... [continued]: clock_nanosleep())               = 0
              10.133 (           ): sleep/327642 clock_nanosleep(rqtp: { .tv_sec: 5, .tv_nsec: 0 }, rmtp: 0x7ffd36f83ed0) ...
                   ? (           ): pool-gsd-smart/3051  ... [continued]: clock_nanosleep())   = 0
              30.276 (           ): gpm/991 clock_nanosleep(rqtp: { .tv_sec: 2, .tv_nsec: 0 }, rmtp: 0x7ffcc6f73710) ...
             223.215 (1000.430 ms): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) = 0
              30.276 (2000.394 ms): gpm/991  ... [continued]: clock_nanosleep())               = 0
            1230.814 (           ): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) ...
            1230.814 (1000.404 ms): pool-gsd-smart/3051  ... [continued]: clock_nanosleep())   = 0
            2030.886 (           ): gpm/991 clock_nanosleep(rqtp: { .tv_sec: 2, .tv_nsec: 0 }, rmtp: 0x7ffcc6f73710) ...
            2237.709 (1000.153 ms): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) = 0
                   ? (           ): crond/1172  ... [continued]: clock_nanosleep())            = 0
            3242.699 (           ): pool-gsd-smart/3051 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7f6e7fffec90) ...
            2030.886 (2000.385 ms): gpm/991  ... [continued]: clock_nanosleep())               = 0
            3728.078 (           ): crond/1172 clock_nanosleep(rqtp: { .tv_sec: 60, .tv_nsec: 0 }, rmtp: 0x7ffe0971dcf0) ...
            3242.699 (1000.158 ms): pool-gsd-smart/3051  ... [continued]: clock_nanosleep())   = 0
            4031.409 (           ): gpm/991 clock_nanosleep(rqtp: { .tv_sec: 2, .tv_nsec: 0 }, rmtp: 0x7ffcc6f73710) ...
              10.133 (5000.375 ms): sleep/327642  ... [continued]: clock_nanosleep())          = 0
      
            Performance counter stats for 'sleep 5':
      
                   2,617,347      cycles
                   1,855,997      instructions                     #    0.71  insn per cycle
      
                 5.002282128 seconds time elapsed
      
                 0.000855000 seconds user
                 0.000852000 seconds sys
      
        perf annotate:
      
         - Building with binutils' libopcode now is opt-in (BUILD_NONDISTRO=1)
           for licensing reasons, and we missed a build test on
           tools/perf/tests makefile.
      
           Since we now default to NDEBUG=1, we ended up segfaulting when
           building with BUILD_NONDISTRO=1 because a needed initialization
           routine was being "error checked" via an assert.
      
           Fix it by explicitly checking the result and aborting instead if it
           fails.
      
           We better back propagate the error, but at least 'perf annotate' on
           samples collected for a BPF program is back working when perf is
           built with BUILD_NONDISTRO=1.
      
        perf report/top:
      
         - Add back TUI hierarchy mode header, that is seen when using 'perf
           report/top --hierarchy'.
      
         - Fix the number of entries for 'e' key in the TUI that was
           preventing navigation of lines when expanding an entry.
      
        perf report/script:
      
         - Support cross platform register handling, allowing a perf.data file
           collected on one architecture to have registers sampled correctly
           displayed when analysis tools such as 'perf report' and 'perf
           script' are used on a different architecture.
      
         - Fix handling of event attributes in pipe mode, i.e. when one uses:
      
        	perf record -o - | perf report -i -
      
           When no perf.data files are used.
      
         - Handle files generated via pipe mode with a version of perf and
           then read also via pipe mode with a different version of perf,
           where the event attr record may have changed, use the record size
           field to properly support this version mismatch.
      
        perf probe:
      
         - Accessing global variables from uprobes isn't supported, make the
           error message state that instead of stating that some minimal
           kernel version is needed to have that feature. This seems just a
           tool limitation, the kernel probably has all that is needed.
      
        perf tests:
      
         - Fix a reference count related leak in the dlfilter v0 API where the
           result of a thread__find_symbol_fb() is not matched with an
           addr_location__exit() to drop the reference counts of the resolved
           components (machine, thread, map, symbol, etc). Add a dlfilter test
           to make sure that doesn't regresses.
      
         - Lots of fixes for the 'perf test' written in shell script related
           to problems found with the shellcheck utility.
      
         - Fixes for 'perf test' shell scripts testing features enabled when
           perf is built with BUILD_BPF_SKEL=1, such as 'perf stat' bpf
           counters.
      
         - Add perf record sample filtering test, things like the following
           example, that gets implemented as a BPF filter attached to the
           event:
      
             # perf record -e task-clock -c 10000 --filter 'ip < 0xffffffff00000000'
      
         - Improve the way the task_analyzer test checks if libtraceevent is
           linked, using 'perf version --build-options' instead of the more
           expensinve 'perf record -e "sched:sched_switch"'.
      
         - Add support for riscv in the mmap-basic test. (This went as well
           via the RiscV tree, same contents).
      
        libperf:
      
         - Implement riscv mmap support (This went as well via the RiscV tree,
           same contents).
      
        perf script:
      
         - New tool that converts perf.data files to the firefox profiler
           format so that one can use the visualizer at
           https://profiler.firefox.com/. Done by Anup Sharma as part of this
           year's Google Summer of Code.
      
           One can generate the output and upload it to the web interface but
           Anup also automated everything:
      
             perf script gecko -F 99 -a sleep 60
      
         - Support syscall name parsing on arm64.
      
         - Print "cgroup" field on the same line as "comm".
      
        perf bench:
      
         - Add new 'uprobe' benchmark to measure the overhead of uprobes
           with/without BPF programs attached to it.
      
         - breakpoints are not available on power9, skip that test.
      
        perf stat:
      
         - Add #num_cpus_online literal to be used in 'perf stat' metrics, and
           add this extra 'perf test' check that exemplifies its purpose:
      
        	TEST_ASSERT_VAL("#num_cpus_online",
                               expr__parse(&num_cpus_online, ctx, "#num_cpus_online") == 0);
        	TEST_ASSERT_VAL("#num_cpus", expr__parse(&num_cpus, ctx, "#num_cpus") == 0);
        	TEST_ASSERT_VAL("#num_cpus >= #num_cpus_online", num_cpus >= num_cpus_online);
      
        Miscellaneous:
      
         - Improve tool startup time by lazily reading PMU, JSON, sysfs data.
      
         - Improve error reporting in the parsing of events, passing YYLTYPE
           to error routines, so that the output can show were the parsing
           error was found.
      
         - Add 'perf test' entries to check the parsing of events
           improvements.
      
         - Fix various leak for things detected by -fsanitize=address, mostly
           things that would be freed at tool exit, including:
      
             - Free evsel->filter on the destructor.
      
             - Allow tools to register a thread->priv destructor and use it in
               'perf trace'.
      
             - Free evsel->priv in 'perf trace'.
      
             - Free string returned by synthesize_perf_probe_point() when the
               caller fails to do all it needs.
      
         - Adjust various compiler options to not consider errors some
           warnings when building with broken headers found in things like
           python, flex, bison, as we otherwise build with -Werror. Some for
           gcc, some for clang, some for some specific version of those, some
           for some specific version of flex or bison, or some specific
           combination of these components, bah.
      
         - Allow customization of clang options for BPF target, this helps
           building on gentoo where there are other oddities where BPF targets
           gets passed some compiler options intended for the native build, so
           building with WERROR=0 helps while these oddities are fixed.
      
         - Dont pass ERR_PTR() values to perf_session__delete() in 'perf top'
           and 'perf lock', fixing some segfaults when handling some odd
           failures.
      
         - Add LTO build option.
      
         - Fix format of unordered lists in the perf docs
           (tools/perf/Documentation)
      
         - Overhaul the bison files, using constructs such as YYNOMEM.
      
         - Remove unused tokens from the bison .y files.
      
         - Add more comments to various structs.
      
         - A few LoongArch enablement patches.
      
        Vendor events (JSON):
      
         - Add JSON metrics for Yitian 710 DDR (aarch64). Things like:
      
        	EventName, BriefDescription
        	visible_window_limit_reached_rd, "At least one entry in read queue reaches the visible window limit.",
        	visible_window_limit_reached_wr, "At least one entry in write queue reaches the visible window limit.",
        	op_is_dqsosc_mpc	       , "A DQS Oscillator MPC command to DRAM.",
        	op_is_dqsosc_mrr	       , "A DQS Oscillator MRR command to DRAM.",
        	op_is_tcr_mrr		       , "A Temperature Compensated Refresh(TCR) MRR command to DRAM.",
      
         - Add AmpereOne metrics (aarch64).
      
         - Update N2 and V2 metrics (aarch64) and events using Arm telemetry
           repo.
      
         - Update scale units and descriptions of common topdown metrics on
           aarch64. Things like:
             - "MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
             - "BriefDescription": "Frontend bound L1 topdown metric",
             + "MetricExpr": "100 * (stall_slot_frontend / (#slots * cpu_cycles))",
             + "BriefDescription": "This metric is the percentage of total slots that were stalled due to resource constraints in the frontend of the processor.",
      
         - Update events for intel: meteorlake to 1.04, sapphirerapids to
           1.15, Icelake+ metric constraints.
      
         - Update files for the power10 platform"
      
      * tag 'perf-tools-for-v6.6-1-2023-09-05' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (217 commits)
        perf parse-events: Fix driver config term
        perf parse-events: Fixes relating to no_value terms
        perf parse-events: Fix propagation of term's no_value when cloning
        perf parse-events: Name the two term enums
        perf list: Don't print Unit for "default_core"
        perf vendor events intel: Fix modifier in tma_info_system_mem_parallel_reads for skylake
        perf dlfilter: Avoid leak in v0 API test use of resolve_address()
        perf metric: Add #num_cpus_online literal
        perf pmu: Remove str from perf_pmu_alias
        perf parse-events: Make common term list to strbuf helper
        perf parse-events: Minor help message improvements
        perf pmu: Avoid uninitialized use of alias->str
        perf jevents: Use "default_core" for events with no Unit
        perf test stat_bpf_counters_cgrp: Enhance perf stat cgroup BPF counter test
        perf test shell stat_bpf_counters: Fix test on Intel
        perf test shell record_bpf_filter: Skip 6.2 kernel
        libperf: Get rid of attr.id field
        perf tools: Convert to perf_record_header_attr_id()
        libperf: Add perf_record_header_attr_id()
        perf tools: Handle old data in PERF_RECORD_ATTR
        ...
      535a265d
    • Linus Torvalds's avatar
      Merge tag '6.6-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 · fd3a5940
      Linus Torvalds authored
      Pull smb client fixes from Steve French:
      
       - six smb3 client fixes including ones to allow controlling smb3
         directory caching timeout and limits, and one debugging improvement
      
       - one fix for nls Kconfig (don't need to expose NLS_UCS2_UTILS option)
      
       - one minor spnego registry update
      
      * tag '6.6-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
        spnego: add missing OID to oid registry
        smb3: fix minor typo in SMB2_GLOBAL_CAP_LARGE_MTU
        cifs: update internal module version number for cifs.ko
        smb3: allow controlling maximum number of cached directories
        smb3: add trace point for queryfs (statfs)
        nls: Hide new NLS_UCS2_UTILS
        smb3: allow controlling length of time directory entries are cached with dir leases
        smb: propagate error code of extract_sharename()
      fd3a5940
  5. 09 Sep, 2023 15 commits