1. 22 Feb, 2024 22 commits
    • Darrick J. Wong's avatar
      xfs: teach repair to fix file nlinks · 6b631c60
      Darrick J. Wong authored
      Fix the file link counts since we just computed the correct ones.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      6b631c60
    • Darrick J. Wong's avatar
      xfs: track directory entry updates during live nlinks fsck · 86a1746e
      Darrick J. Wong authored
      Create the necessary hooks in the directory operations
      (create/link/unlink/rename) code so that our live nlink scrub code can
      stay up to date with link count updates in the rest of the filesystem.
      This will be the means to keep our shadow link count information up to
      date while the scan runs in real time.
      
      In online fsck part 2, we'll use these same hooks to handle repairs
      to directories and parent pointer information.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      86a1746e
    • Darrick J. Wong's avatar
      xfs: teach scrub to check file nlinks · f1184081
      Darrick J. Wong authored
      Create the necessary scrub code to walk the filesystem's directory tree
      so that we can compute file link counts.  Similar to quotacheck, we
      create an incore shadow array of link count information and then we walk
      the filesystem a second time to compare the link counts.  We need live
      updates to keep the information up to date during the lengthy scan, so
      this scrubber remains disabled until the next patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      f1184081
    • Darrick J. Wong's avatar
      xfs: report health of inode link counts · 93687ee2
      Darrick J. Wong authored
      Report on the health of the inode link counts.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      93687ee2
    • Darrick J. Wong's avatar
      xfs: repair dquots based on live quotacheck results · 96ed2ae4
      Darrick J. Wong authored
      Use the shadow quota counters that live quotacheck creates to reset the
      incore dquot counters.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      96ed2ae4
    • Darrick J. Wong's avatar
      xfs: repair cannot update the summary counters when logging quota flags · 7038c6e5
      Darrick J. Wong authored
      While running xfs/804 (quota repairs racing with fsstress), I observed a
      filesystem shutdown in the primary sb write verifier:
      
      run fstests xfs/804 at 2022-05-23 18:43:48
      XFS (sda4): Mounting V5 Filesystem
      XFS (sda4): Ending clean mount
      XFS (sda4): Quotacheck needed: Please wait.
      XFS (sda4): Quotacheck: Done.
      XFS (sda4): EXPERIMENTAL online scrub feature in use. Use at your own risk!
      XFS (sda4): SB ifree sanity check failed 0xb5 > 0x80
      XFS (sda4): Metadata corruption detected at xfs_sb_write_verify+0x5e/0x100 [xfs], xfs_sb block 0x0
      XFS (sda4): Unmount and run xfs_repair
      
      The "SB ifree sanity check failed" message was a debugging printk that I
      added to the kernel; observe that 0xb5 - 0x80 = 53, which is less than
      one inode chunk.
      
      I traced this to the xfs_log_sb calls from the online quota repair code,
      which tries to clear the CHKD flags from the superblock to force a
      mount-time quotacheck if the repair fails.  On a V5 filesystem,
      xfs_log_sb updates the ondisk sb summary counters with the current
      contents of the percpu counters.  This is done without quiescing other
      writer threads, which means it could be racing with a thread that has
      updated icount and is about to update ifree.
      
      If the other write thread had incremented ifree before updating icount,
      the repair thread will write icount > ifree into the logged update.  If
      the AIL writes the logged superblock back to disk before anyone else
      fixes this siutation, this will lead to a write verifier failure, which
      causes a filesystem shutdown.
      
      Resolve this problem by updating the quota flags and calling
      xfs_sb_to_disk directly, which does not touch the percpu counters.
      While we're at it, we can elide the entire update if the selected qflags
      aren't set.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      7038c6e5
    • Darrick J. Wong's avatar
      xfs: track quota updates during live quotacheck · 20049187
      Darrick J. Wong authored
      Create a shadow dqtrx system in the quotacheck code that hooks the
      regular dquot counter update code.  This will be the means to keep our
      copy of the dquot counters up to date while the scan runs in real time.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      20049187
    • Darrick J. Wong's avatar
      xfs: implement live quotacheck inode scan · 48dd9117
      Darrick J. Wong authored
      Create a new trio of scrub functions to check quota counters.  While the
      dquots themselves are filesystem metadata and should be checked early,
      the dquot counter values are computed from other metadata and are
      therefore summary counters.  We don't plug these into the scrub dispatch
      just yet, because we still need to be able to watch quota updates while
      doing our scan.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      48dd9117
    • Darrick J. Wong's avatar
      xfs: create a sparse load xfarray function · 5a3ab584
      Darrick J. Wong authored
      Create a new method to load an xfarray element from the xfile, but with
      a twist.  If we've never stored to the array index, zero the caller's
      buffer.  This will facilitate RMWs updates of records in a sparse array
      without fuss, since the sparse xfarray convention is that uninitialized
      array elements default to zeroes.
      
      This is a separate patch to reduce the size of the upcoming quotacheck
      patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5a3ab584
    • Darrick J. Wong's avatar
      xfs: create a helper to count per-device inode block usage · ebd610fe
      Darrick J. Wong authored
      Create a helper to compute the number of blocks that a file has
      allocated from the data realtime volumes.  This patch was
      split out to reduce the size of the upcoming quotacheck patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      ebd610fe
    • Darrick J. Wong's avatar
      xfs: create a xchk_trans_alloc_empty helper for scrub · 564fee6d
      Darrick J. Wong authored
      Create a helper to initialize empty transactions on behalf of a scrub
      operation.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      564fee6d
    • Darrick J. Wong's avatar
      xfs: report the health of quota counts · 3d8f1426
      Darrick J. Wong authored
      Report the health of quota counts.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3d8f1426
    • Darrick J. Wong's avatar
      xfs: repair file modes by scanning for a dirent pointing to us · 5385f1a6
      Darrick J. Wong authored
      Repair might encounter an inode with a totally garbage i_mode.  To fix
      this problem, we have to figure out if the file was a regular file, a
      directory, or a special file.  One way to figure this out is to check if
      there are any directories with entries pointing down to the busted file.
      
      This patch recovers the file mode by scanning every directory entry on
      the filesystem to see if there are any that point to the busted file.
      If the ftype of all such dirents are consistent, the mode is recovered
      from the ftype.  If no dirents are found, the file becomes a regular
      file.  In all cases, ACLs are canceled and the file is made accessible
      only by root.
      
      A previous patch attempted to guess the mode by reading the beginning of
      the file data.  This was rejected by Christoph on the grounds that we
      cannot trust user-controlled data blocks.  Users do not have direct
      control over the ondisk contents of directory entries, so this method
      should be much safer.
      
      If all the dirents have the same ftype, then we can translate that back
      into an S_IFMT flag and fix the file.  If not, reset the mode to
      S_IFREG.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5385f1a6
    • Darrick J. Wong's avatar
      xfs: create a macro for decoding ftypes in tracepoints · 3c79e6a8
      Darrick J. Wong authored
      Create the XFS_DIR3_FTYPE_STR macro so that we can report ftype as
      strings instead of numbers in tracepoints.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3c79e6a8
    • Darrick J. Wong's avatar
      xfs: create a predicate to determine if two xfs_names are the same · d9c07758
      Darrick J. Wong authored
      Create a simple predicate to determine if two xfs_names are the same
      objects or have the exact same name.  The comparison is always case
      sensitive.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d9c07758
    • Darrick J. Wong's avatar
      xfs: create a static name for the dot entry too · e99bfc9e
      Darrick J. Wong authored
      Create an xfs_name_dot object so that upcoming scrub code can compare
      against that.  Offline repair already has such an object, so we're
      really just hoisting it to the kernel.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e99bfc9e
    • Darrick J. Wong's avatar
      xfs: iscan batching should handle unallocated inodes too · 82334a79
      Darrick J. Wong authored
      The inode scanner tries to reduce contention on the AGI header buffer
      lock by grabbing references to consecutive allocated inodes.  Batching
      stops as soon as we encounter an unallocated inode.  This is unfortunate
      because in the worst case performance collapses to the old "one at a
      time" behavior if every other inode is free.
      
      This is correct behavior, but we could do better.  Unallocated inodes by
      definition have nothing to scan, which means the iscan can ignore them
      as long as someone ensures that the scan data will reflect another
      thread allocating the inode and adding interesting metadata to that
      inode.  That mechanism is, of course, the live update hooks.
      
      Therefore, extend the batching mechanism to track unallocated inodes
      adjacent to the scan cursor.  The _want_live_update predicate can tell
      the caller's live update hook to incorporate all live updates to what
      the scanner thinks is an unallocated inode if (after dropping the AGI)
      some other thread allocates one of those inodes and begins using it.
      
      Note that we cannot just copy the ir_free bitmap into the scan cursor
      because the batching stops if iget says the inode is in an intermediate
      state (e.g. on the inactivation list) and cannot be igrabbed.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      82334a79
    • Darrick J. Wong's avatar
      xfs: cache a bunch of inodes for repair scans · a7a686cb
      Darrick J. Wong authored
      After observing xfs_scrub taking forever to rebuild parent pointers on a
      pptrs enabled filesystem, I decided to profile what the system was
      doing.  It turns out that when there are a lot of threads trying to scan
      the filesystem, most of our time is spent contending on AGI buffer
      locks.  Given that we're walking the inobt records anyway, we can often
      tell ahead of time when there's a bunch of (up to 64) consecutive inodes
      that we could grab all at once.
      
      Do this to amortize the cost of taking the AGI lock across as many
      inodes as we possibly can.  On the author's system this seems to improve
      parallel throughput from barely one and a half cores to slightly
      sublinear scaling.  The obvious antipattern here of course is where the
      freemask has every other bit set (e.g. all 0xA's)
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a7a686cb
    • Darrick J. Wong's avatar
      xfs: stagger the starting AG of scrub iscans to reduce contention · c473a332
      Darrick J. Wong authored
      Online directory and parent repairs on parent-pointer equipped
      filesystems have shown that starting a large number of parallel iscans
      causes a lot of AGI buffer contention.  Try to reduce this by making it
      so that iscans scan wrap around the end of the filesystem, and using a
      rotor to stagger where each scanner begins.  Surprisingly, this boosts
      CPU utilization (on the author's test machines) from effectively
      single-threaded to 160%.  Not great, but see the next patch.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      c473a332
    • Darrick J. Wong's avatar
      xfs: allow scrub to hook metadata updates in other writers · 4e98cc90
      Darrick J. Wong authored
      Certain types of filesystem metadata can only be checked by scanning
      every file in the entire filesystem.  Specific examples of this include
      quota counts, file link counts, and reverse mappings of file extents.
      Directory and parent pointer reconstruction may also fall into this
      category.  File scanning is much trickier than scanning AG metadata
      because we have to take inode locks in the same order as the rest of
      [VX]FS, we can't be holding buffer locks when we do that, and scanning
      the whole filesystem takes time.
      
      Earlier versions of the online repair patchset relied heavily on
      fsfreeze as a means to quiesce the filesystem so that we could take
      locks in the proper order without worrying about concurrent updates from
      other writers.  Reviewers of those patches opined that freezing the
      entire fs to check and repair something was not sufficiently better than
      unmounting to run fsck offline.  I don't agree with that 100%, but the
      message was clear: find a way to repair things that minimizes the
      quiet period where nobody can write to the filesystem.
      
      Generally, building btree indexes online can be split into two phases: a
      collection phase where we compute the records that will be put into the
      new btree; and a construction phase, where we construct the physical
      btree blocks and persist them.  While it's simple to hold resource locks
      for the entirety of the two phases to ensure that the new index is
      consistent with the rest of the system, we don't need to hold resource
      locks during the collection phase if we have a means to receive live
      updates of other work going on elsewhere in the system.
      
      The goal of this patch, then, is to enable online fsck to learn about
      metadata updates going on in other threads while it constructs a shadow
      copy of the metadata records to verify or correct the real metadata.  To
      minimize the overhead when online fsck isn't running, we use srcu
      notifiers because they prioritize fast access to the notifier call chain
      (particularly when the chain is empty) at a cost to configuring
      notifiers.  Online fsck should be relatively infrequent, so this is
      acceptable.
      
      The intended usage model is fairly simple.  Code that modifies a
      metadata structure of interest should declare a xfs_hook_chain structure
      in some well defined place, and call xfs_hook_call whenever an update
      happens.  Online fsck code should define a struct notifier_block and use
      xfs_hook_add to attach the block to the chain, along with a function to
      be called.  This function should synchronize with the fsck scanner to
      update whatever in-memory data the scanner is collecting.  When
      finished, xfs_hook_del removes the notifier from the list and waits for
      them all to complete.
      
      Originally, I selected srcu notifiers over blocking notifiers to
      implement live hooks because they seemed to have fewer impacts to
      scalability.  The per-call cost of srcu_notifier_call_chain is higher
      (19ns) than blocking_notifier_ (4ns) in the single threaded case, but
      blocking notifiers use an rwsem to stabilize the list.  Cacheline
      bouncing for that rwsem is costly to runtime code when there are a lot
      of CPUs running regular filesystem operations.  If there are no hooks
      installed, this is a total waste of CPU time.
      
      Therefore, I stuck with srcu notifiers, despite trading off single
      threaded performance for multithreaded performance.  I also wasn't
      thrilled with the very high teardown time for srcu notifiers, since the
      caller has to wait for the next rcu grace period.  This can take a long
      time if there are a lot of CPUs.
      
      Then I discovered the jump label implementation of static keys.
      
      Jump labels use kernel code patching to replace a branch with a nop sled
      when the key is disabled.  IOWs, they can eliminate the overhead of
      _call_chain when there are no hooks enabled.  This makes blocking
      notifiers competitive again -- scrub runs faster because teardown of the
      chain is a lot cheaper, and runtime code only pays the rwsem locking
      overhead when scrub is actually running.
      
      With jump labels enabled, calls to empty notifier chains are elided from
      the call sites when there are no hooks registered, which means that the
      overhead is 0.36ns when fsck is not running.  This is perfect for most
      of the architectures that XFS is expected to run on (e.g. x86, powerpc,
      arm64, s390x, riscv).
      
      For architectures that don't support jump labels (e.g. m68k) the runtime
      overhead of checking the static key is an atomic counter read.  This
      isn't great, but it's still cheaper than taking a shared rwsem.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4e98cc90
    • Darrick J. Wong's avatar
      xfs: implement live inode scan for scrub · 8660c7b7
      Darrick J. Wong authored
      This patch implements a live file scanner for online fsck functions that
      require the ability to walk a filesystem to gather metadata records and
      stay informed about metadata changes to files that have already been
      visited.
      
      The iscan structure consists of two inode number cursors: one to track
      which inode we want to visit next, and a second one to track which
      inodes have already been visited.  This second cursor is key to
      capturing live updates to files previously scanned while the main thread
      continues scanning -- any inode greater than this value hasn't been
      scanned and can go on its way; any other update must be incorporated
      into the collected data.  It is critical for the scanning thraad to hold
      exclusive access on the inode until after marking the inode visited.
      
      This new code is a separate patch from the patchsets adding callers for
      the sake of enabling the author to move patches around his tree with
      ease.  The intended usage model for this code is roughly:
      
      	xchk_iscan_start(iscan, 0, 0);
      	while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
      		xfs_ilock(ip, ...);
      		/* capture inode metadata */
      		xchk_iscan_mark_visited(iscan, ip);
      		xfs_iunlock(ip, ...);
      
      		xfs_irele(ip);
      	}
      	xchk_iscan_stop(iscan);
      	if (error)
      		return error;
      
      Hook functions for live updates can then do:
      
      	if (xchk_iscan_want_live_update(...))
      		/* update the captured inode metadata */
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      8660c7b7
    • Darrick J. Wong's avatar
      xfs: speed up xfs_iwalk_adjust_start a little bit · ae05eb11
      Darrick J. Wong authored
      Replace the open-coded loop that recomputes freecount with a single call
      to a bit weight function.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      ae05eb11
  2. 21 Feb, 2024 18 commits