1. 21 Dec, 2021 4 commits
    • Darrick J. Wong's avatar
      xfs: only run COW extent recovery when there are no live extents · 7993f1a4
      Darrick J. Wong authored
      As part of multiple customer escalations due to file data corruption
      after copy on write operations, I wrote some fstests that use fsstress
      to hammer on COW to shake things loose.  Regrettably, I caught some
      filesystem shutdowns due to incorrect rmap operations with the following
      loop:
      
      mount <filesystem>				# (0)
      fsstress <run only readonly ops> &		# (1)
      while true; do
      	fsstress <run all ops>
      	mount -o remount,ro			# (2)
      	fsstress <run only readonly ops>
      	mount -o remount,rw			# (3)
      done
      
      When (2) happens, notice that (1) is still running.  xfs_remount_ro will
      call xfs_blockgc_stop to walk the inode cache to free all the COW
      extents, but the blockgc mechanism races with (1)'s reader threads to
      take IOLOCKs and loses, which means that it doesn't clean them all out.
      Call such a file (A).
      
      When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
      walks the ondisk refcount btree and frees any COW extent that it finds.
      This function does not check the inode cache, which means that incore
      COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
      one of those former COW extents are allocated and mapped into another
      file (B) and someone triggers a COW to the stale reservation in (A), A's
      dirty data will be written into (B) and once that's done, those blocks
      will be transferred to (A)'s data fork without bumping the refcount.
      
      The results are catastrophic -- file (B) and the refcount btree are now
      corrupt.  In the first patch, we fixed the race condition in (2) so that
      (A) will always flush the COW fork.  In this second patch, we move the
      _recover_cow call to the initial mount call in (0) for safety.
      
      As mentioned previously, xfs_reflink_recover_cow walks the refcount
      btree looking for COW staging extents, and frees them.  This was
      intended to be run at mount time (when we know there are no live inodes)
      to clean up any leftover staging events that may have been left behind
      during an unclean shutdown.  As a time "optimization" for readonly
      mounts, we deferred this to the ro->rw transition, not realizing that
      any failure to clean all COW forks during a rw->ro transition would
      result in catastrophic corruption.
      
      Therefore, remove this optimization and only run the recovery routine
      when we're guaranteed not to have any COW staging extents anywhere,
      which means we always run this at mount time.  While we're at it, move
      the callsite to xfs_log_mount_finish because any refcount btree
      expansion (however unlikely given that we're removing records from the
      right side of the index) must be fed by a per-AG reservation, which
      doesn't exist in its current location.
      
      Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7993f1a4
    • Darrick J. Wong's avatar
      xfs: don't expose internal symlink metadata buffers to the vfs · 7b7820b8
      Darrick J. Wong authored
      Ian Kent reported that for inline symlinks, it's possible for
      vfs_readlink to hang on to the target buffer returned by
      _vn_get_link_inline long after it's been freed by xfs inode reclaim.
      This is a layering violation -- we should never expose XFS internals to
      the VFS.
      
      When the symlink has a remote target, we allocate a separate buffer,
      copy the internal information, and let the VFS manage the new buffer's
      lifetime.  Let's adapt the inline code paths to do this too.  It's
      less efficient, but fixes the layering violation and avoids the need to
      adapt the if_data lifetime to rcu rules.  Clearly I don't care about
      readlink benchmarks.
      
      As a side note, this fixes the minor locking violation where we can
      access the inode data fork without taking any locks; proper locking (and
      eliminating the possibility of having to switch inode_operations on a
      live inode) is essential to online repair coordinating repairs
      correctly.
      Reported-by: default avatarIan Kent <raven@themaw.net>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      7b7820b8
    • Darrick J. Wong's avatar
      xfs: fix quotaoff mutex usage now that we don't support disabling it · 59d7fab2
      Darrick J. Wong authored
      Prior to commit 40b52225 ("xfs: remove support for disabling quota
      accounting on a mounted file system"), we used the quotaoff mutex to
      protect dquot operations against quotaoff trying to pull down dquots as
      part of disabling quota.
      
      Now that we only support turning off quota enforcement, the quotaoff
      mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
      protect dquots, which means we can remove it from setqlimits and the
      dquot scrub code.  While we're at it, fix the function that forces
      quotacheck, since it should have been taking the quotaoff mutex.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      59d7fab2
    • Darrick J. Wong's avatar
      xfs: shut down filesystem if we xfs_trans_cancel with deferred work items · 47a6df7c
      Darrick J. Wong authored
      While debugging some very strange rmap corruption reports in connection
      with the online directory repair code.  I root-caused the error to the
      following incorrect sequence:
      
      <start repair transaction>
      <expand directory, causing a deferred rmap to be queued>
      <roll transaction>
      <cancel transaction>
      
      Obviously, we should have committed the transaction instead of
      cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
      have warned us that we were throwing away work item that we already
      committed to performing.  This is not correct, and we need to shut down
      the filesystem.
      
      Change xfs_trans_cancel to complain in the loudest manner if we're
      cancelling any transaction with deferred work items attached.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      47a6df7c
  2. 12 Dec, 2021 14 commits
  3. 11 Dec, 2021 22 commits