An error occurred fetching the project authors.
  1. 16 Aug, 2021 13 commits
    • Dave Chinner's avatar
      xfs: order CIL checkpoint start records · 68a74dca
      Dave Chinner authored
      Because log recovery depends on strictly ordered start records as
      well as strictly ordered commit records.
      
      This is a zero day bug in the way XFS writes pipelined transactions
      to the journal which is exposed by fixing the zero day bug that
      prevents the CIL from pipelining checkpoints. This re-introduces
      explicit concurrent commits back into the on-disk journal and hence
      out of order start records.
      
      The XFS journal commit code has never ordered start records and we
      have relied on strict commit record ordering for correct recovery
      ordering of concurrently written transactions. Unfortunately, root
      cause analysis uncovered the fact that log recovery uses the LSN of
      the start record for transaction commit processing. Hence, whilst
      the commits are processed in strict order by recovery, the LSNs
      associated with the commits can be out of order and so recovery may
      stamp incorrect LSNs into objects and/or misorder intents in the AIL
      for later processing. This can result in log recovery failures
      and/or on disk corruption, sometimes silent.
      
      Because this is a long standing log recovery issue, we can't just
      fix log recovery and call it good. This still leaves older kernels
      susceptible to recovery failures and corruption when replaying a log
      from a kernel that pipelines checkpoints. There is also the issue
      that in-memory ordering for AIL pushing and data integrity
      operations are based on checkpoint start LSNs, and if the start LSN
      is incorrect in the journal, it is also incorrect in memory.
      
      Hence there's really only one choice for fixing this zero-day bug:
      we need to strictly order checkpoint start records in ascending
      sequence order in the log, the same way we already strictly order
      commit records.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      68a74dca
    • Dave Chinner's avatar
      xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state() · caa80090
      Dave Chinner authored
      Now that we have a mechanism to guarantee that the callbacks
      attached to an iclog are owned by the context that attaches them
      until they drop their reference to the iclog via
      xlog_state_release_iclog(), we can attach callbacks to the iclog at
      any time we have an active reference to the iclog.
      
      xlog_state_get_iclog_space() always guarantees that the commit
      record will fit in the iclog it returns, so we can move this IO
      callback setting to xlog_cil_set_ctx_write_state(), record the
      commit iclog in the context and remove the need for the commit iclog
      to be returned by xlog_write() altogether.
      
      This, in turn, allows us to move the wakeup for ordered commit
      record writes up into xlog_cil_set_ctx_write_state(), too, because
      we have been guaranteed that this commit record will be physically
      located in the iclog before any waiting commit record at a higher
      sequence number will be granted iclog space.
      
      This further cleans up the post commit record write processing in
      the CIL push code, especially as xlog_state_release_iclog() will now
      clean up the context when shutdown errors occur.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      caa80090
    • Dave Chinner's avatar
      xfs: pass a CIL context to xlog_write() · c45aba40
      Dave Chinner authored
      Pass the CIL context to xlog_write() rather than a pointer to a LSN
      variable. Only the CIL checkpoint calls to xlog_write() need to know
      about the start LSN of the writes, so rework xlog_write to directly
      write the LSNs into the CIL context structure.
      
      This removes the commit_lsn variable from xlog_cil_push_work(), so
      now we only have to issue the commit record ordering wakeup from
      there.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      c45aba40
    • Dave Chinner's avatar
      xfs: move xlog_commit_record to xfs_log_cil.c · 2ce82b72
      Dave Chinner authored
      It is only used by the CIL checkpoints, and is the counterpart to
      start record formatting and writing that is already local to
      xfs_log_cil.c.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2ce82b72
    • Dave Chinner's avatar
      xfs: log head and tail aren't reliable during shutdown · 2562c322
      Dave Chinner authored
      I'm seeing assert failures from xlog_space_left() after a shutdown
      has begun that look like:
      
      XFS (dm-0): log I/O error -5
      XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1338 of file fs/xfs/xfs_log.c. Return address = xlog_ioend_work+0x64/0xc0
      XFS (dm-0): Log I/O Error Detected.
      XFS (dm-0): Shutting down filesystem. Please unmount the filesystem and rectify the problem(s)
      XFS (dm-0): xlog_space_left: head behind tail
      XFS (dm-0):   tail_cycle = 6, tail_bytes = 2706944
      XFS (dm-0):   GH   cycle = 6, GH   bytes = 1633867
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1310
      ------------[ cut here ]------------
      Call Trace:
       xlog_space_left+0xc3/0x110
       xlog_grant_push_threshold+0x3f/0xf0
       xlog_grant_push_ail+0x12/0x40
       xfs_log_reserve+0xd2/0x270
       ? __might_sleep+0x4b/0x80
       xfs_trans_reserve+0x18b/0x260
      .....
      
      There are two things here. Firstly, after a shutdown, the log head
      and tail can be out of whack as things abort and release (or don't
      release) resources, so checking them for sanity doesn't make much
      sense. Secondly, xfs_log_reserve() can race with shutdown and so it
      can still fail like this even though it has already checked for a
      log shutdown before calling xlog_grant_push_ail().
      
      So, before ASSERT failing in xlog_space_left(), make sure we haven't
      already shut down....
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2562c322
    • Dave Chinner's avatar
      xfs: don't run shutdown callbacks on active iclogs · 502a01fa
      Dave Chinner authored
      When the log is shutdown, it currently walks all the iclogs and runs
      callbacks that are attached to the iclogs, regardless of whether the
      iclog is queued for IO completion or not. This creates a problem for
      contexts attaching callbacks to iclogs in that a racing shutdown can
      run the callbacks even before the attaching context has finished
      processing the iclog and releasing it for IO submission.
      
      If the callback processing of the iclog frees the structure that is
      attached to the iclog, then this leads to an UAF scenario that can
      only be protected against by holding the icloglock from the point
      callbacks are attached through to the release of the iclog. While we
      currently do this, it is not practical or sustainable.
      
      Hence we need to make shutdown processing the responsibility of the
      context that holds active references to the iclog. We know that the
      contexts attaching callbacks to the iclog must have active
      references to the iclog, and that means they must be in either
      ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over
      iclogs in these states -except- when the log is shut down.
      
      xlog_state_do_callback() checks the state of the iclogs while
      holding the icloglock, therefore the reference count/state change
      that occurs in xlog_state_release_iclog() after the callbacks are
      atomic w.r.t. shutdown processing.
      
      We can't push the responsibility of callback cleanup onto the CIL
      context because we can have ACTIVE iclogs that have callbacks
      attached that have already been released. Hence we really need to
      internalise the cleanup of callbacks into xlog_state_release_iclog()
      processing.
      
      Indeed, we already have that internalisation via:
      
      xlog_state_release_iclog
        drop last reference
          ->SYNCING
        xlog_sync
          xlog_write_iclog
            if (log_is_shutdown)
              xlog_state_done_syncing()
      	  xlog_state_do_callback()
      	    <process shutdown on iclog that is now in SYNCING state>
      
      The problem is that xlog_state_release_iclog() aborts before doing
      anything if the log is already shut down. It assumes that the
      callbacks have already been cleaned up, and it doesn't need to do
      any cleanup.
      
      Hence the fix is to remove the xlog_is_shutdown() check from
      xlog_state_release_iclog() so that reference counts are correctly
      released from the iclogs, and when the reference count is zero we
      always transition to SYNCING if the log is shut down. Hence we'll
      always enter the xlog_sync() path in a shutdown and eventually end
      up erroring out the iclog IO and running xlog_state_do_callback() to
      process the callbacks attached to the iclog.
      
      This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs
      directly in the shutdown code, and in doing so gets rid of the UAF
      vector that currently exists. This then decouples the adding of
      callbacks to the iclogs from xlog_state_release_iclog() as we
      guarantee that xlog_state_release_iclog() will process the callbacks
      if the log has been shut down before xlog_state_release_iclog() has
      been called.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      502a01fa
    • Dave Chinner's avatar
      xfs: separate out log shutdown callback processing · aad7272a
      Dave Chinner authored
      The iclog callback processing done during a forced log shutdown has
      different logic to normal runtime IO completion callback processing.
      Separate out the shutdown callbacks into their own function and call
      that from the shutdown code instead.
      
      We don't need this shutdown specific logic in the normal runtime
      completion code - we'll always run the shutdown version on shutdown,
      and it will do what shutdown needs regardless of whether there are
      racing IO completion callbacks scheduled or in progress. Hence we
      can also simplify the normal IO completion callpath and only abort
      if shutdown occurred while we actively were processing callbacks.
      
      Further, separating out the IO completion logic from the shutdown
      logic avoids callback race conditions from being triggered by log IO
      completion after a shutdown. IO completion will now only run
      callbacks on iclogs that are in the correct state for a callback to
      be run, avoiding the possibility of running callbacks on a
      referenced iclog that hasn't yet been submitted for IO.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      aad7272a
    • Dave Chinner's avatar
      xfs: rework xlog_state_do_callback() · 8bb92005
      Dave Chinner authored
      Clean it up a bit by factoring and rearranging some of the code.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8bb92005
    • Dave Chinner's avatar
      xfs: make forced shutdown processing atomic · b36d4651
      Dave Chinner authored
      The running of a forced shutdown is a bit of a mess. It does racy
      checks for XFS_MOUNT_SHUTDOWN in xfs_do_force_shutdown(), then
      does more racy checks in xfs_log_force_unmount() before finally
      setting XFS_MOUNT_SHUTDOWN and XLOG_IO_ERROR under the
      log->icloglock.
      
      Move the checking and setting of XFS_MOUNT_SHUTDOWN into
      xfs_do_force_shutdown() so we only process a shutdown once and once
      only. Serialise this with the mp->m_sb_lock spinlock so that the
      state change is atomic and won't race. Move all the mount specific
      shutdown state changes from xfs_log_force_unmount() to
      xfs_do_force_shutdown() so they are done atomically with setting
      XFS_MOUNT_SHUTDOWN.
      
      Then get rid of the racy xlog_is_shutdown() check from
      xlog_force_shutdown(), and gate the log shutdown on the
      test_and_set_bit(XLOG_IO_ERROR) test under the icloglock. This
      means that the log is shutdown once and once only, and code that
      needs to prevent races with shutdown can do so by holding the
      icloglock and checking the return value of xlog_is_shutdown().
      
      This results in a predictable shutdown execution process - we set the
      shutdown flags once and process the shutdown once rather than the
      current "as many concurrent shutdowns as can race to the flag
      setting" situation we have now.
      
      Also, now that shutdown is atomic, alway emit a stack trace when the
      error level for the filesystem is high enough. This means that we
      always get a stack trace when trying to diagnose the cause of
      shutdowns in the field, rather than just for SHUTDOWN_CORRUPT_INCORE
      cases.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b36d4651
    • Dave Chinner's avatar
      xfs: convert log flags to an operational state field · e1d06e5f
      Dave Chinner authored
      log->l_flags doesn't actually contain "flags" as such, it contains
      operational state information that can change at runtime. For the
      shutdown state, this at least should be an atomic bit because
      it is read without holding locks in many places and so using atomic
      bitops for the state field modifications makes sense.
      
      This allows us to use things like test_and_set_bit() on state
      changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the
      state when we aren't holding locks.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      e1d06e5f
    • Dave Chinner's avatar
      xfs: move recovery needed state updates to xfs_log_mount_finish · fd67d8a0
      Dave Chinner authored
      xfs_log_mount_finish() needs to know if recovery is needed or not to
      make decisions on whether to flush the log and AIL.  Move the
      handling of the NEED_RECOVERY state out to this function rather than
      needing a temporary variable to store this state over the call to
      xlog_recover_finish().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      fd67d8a0
    • Dave Chinner's avatar
      xfs: XLOG_STATE_IOERROR must die · 5112e206
      Dave Chinner authored
      We don't need an iclog state field to tell us the log has been shut
      down. We can just check the xlog_is_shutdown() instead. The avoids
      the need to have shutdown overwrite the current iclog state while
      being active used by the log code and so having to ensure that every
      iclog state check handles XLOG_STATE_IOERROR appropriately.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5112e206
    • Dave Chinner's avatar
      xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown() · 2039a272
      Dave Chinner authored
      Make it less shouty and a static inline before adding more calls
      through the log code.
      
      Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
      to use xlog_is_shutdown(log) as well.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2039a272
  2. 09 Aug, 2021 4 commits
  3. 29 Jul, 2021 8 commits
    • Dave Chinner's avatar
      xfs: limit iclog tail updates · 9d110014
      Dave Chinner authored
      From the department of "generic/482 keeps on giving", we bring you
      another tail update race condition:
      
      iclog:
      	S1			C1
      	+-----------------------+-----------------------+
      				 S2			EOIC
      
      Two checkpoints in a single iclog. One is complete, the other just
      contains the start record and overruns into a new iclog.
      
      Timeline:
      
      Before S1:	Cache flush, log tail = X
      At S1:		Metadata stable, write start record and checkpoint
      At C1:		Write commit record, set NEED_FUA
      		Single iclog checkpoint, so no need for NEED_FLUSH
      		Log tail still = X, so no need for NEED_FLUSH
      
      After C1,
      Before S2:	Cache flush, log tail = X
      At S2:		Metadata stable, write start record and checkpoint
      After S2:	Log tail moves to X+1
      At EOIC:	End of iclog, more journal data to write
      		Releases iclog
      		Not a commit iclog, so no need for NEED_FLUSH
      		Writes log tail X+1 into iclog.
      
      At this point, the iclog has tail X+1 and NEED_FUA set. There has
      been no cache flush for the metadata between X and X+1, and the
      iclog writes the new tail permanently to the log. THis is sufficient
      to violate on disk metadata/journal ordering.
      
      We have two options here. The first is to detect this case in some
      manner and ensure that the partial checkpoint write sets NEED_FLUSH
      when the iclog is already marked NEED_FUA and the log tail changes.
      This seems somewhat fragile and quite complex to get right, and it
      doesn't actually make it obvious what underlying problem it is
      actually addressing from reading the code.
      
      The second option seems much cleaner to me, because it is derived
      directly from the requirements of the C1 commit record in the iclog.
      That is, when we write this commit record to the iclog, we've
      guaranteed that the metadata/data ordering is correct for tail
      update purposes. Hence if we only write the log tail into the iclog
      for the *first* commit record rather than the log tail at the last
      release, we guarantee that the log tail does not move past where the
      the first commit record in the log expects it to be.
      
      IOWs, taking the first option means that replay of C1 becomes
      dependent on future operations doing the right thing, not just the
      C1 checkpoint itself doing the right thing. This makes log recovery
      almost impossible to reason about because now we have to take into
      account what might or might not have happened in the future when
      looking at checkpoints in the log rather than just having to
      reconstruct the past...
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      9d110014
    • Dave Chinner's avatar
      xfs: avoid unnecessary waits in xfs_log_force_lsn() · 8191d822
      Dave Chinner authored
      Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
      see if the iclog has already been completed and the contents on
      stable storage. We check for completed iclogs in xfs_log_force(), so
      we should do the same thing for xfs_log_force_lsn().
      
      This fixed some random up-to-30s pauses seen in unmounting
      filesystems in some tests. A log force ends up waiting on completed
      iclog, and that doesn't then get flushed (and hence the log force
      get completed) until the background log worker issues a log force
      that flushes the iclog in question. Then the unmount unblocks and
      continues.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      8191d822
    • Dave Chinner's avatar
      xfs: log forces imply data device cache flushes · 2bf1ec0f
      Dave Chinner authored
      After fixing the tail_lsn vs cache flush race, generic/482 continued
      to fail in a similar way where cache flushes were missing before
      iclog FUA writes. Tracing of iclog state changes during the fsstress
      workload portion of the test (via xlog_iclog* events) indicated that
      iclog writes were coming from two sources - CIL pushes and log
      forces (due to fsync/O_SYNC operations). All of the cases where a
      recovery problem was triggered indicated that the log force was the
      source of the iclog write that was not preceeded by a cache flush.
      
      This was an oversight in the modifications made in commit
      eef983ff ("xfs: journal IO cache flush reductions"). Log forces
      for fsync imply a data device cache flush has been issued if an
      iclog was flushed to disk and is indicated to the caller via the
      log_flushed parameter so they can elide the device cache flush if
      the journal issued one.
      
      The change in eef983ff results in iclogs only issuing a cache
      flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
      added to the iclogs that the log force code flushes to disk. Hence
      log forces are no longer guaranteeing that a cache flush is issued,
      hence opening up a potential on-disk ordering failure.
      
      Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
      the actual iclogs it forces to the journal are also on stable
      storage before it returns to the caller.
      
      This patch introduces the xlog_force_iclog() helper function to
      encapsulate the process of taking a reference to an iclog, switching
      its state if WANT_SYNC and flushing it to stable storage correctly.
      
      Both xfs_log_force() and xfs_log_force_lsn() are converted to use
      it, as is xlog_unmount_write() which has an elaborate method of
      doing exactly the same "write this iclog to stable storage"
      operation.
      
      Further, if the log force code needs to wait on a iclog in the
      WANT_SYNC state, it needs to ensure that iclog also results in a
      cache flush being issued. This covers the case where the iclog
      contains the commit record of the CIL flush that the log force
      triggered, but it hasn't been written yet because there is still an
      active reference to the iclog.
      
      Note: this whole cache flush whack-a-mole patch is a result of log
      forces still being iclog state centric rather than being CIL
      sequence centric. Most of this nasty code will go away in future
      when log forces are converted to wait on CIL sequence push
      completion rather than iclog completion. With the CIL push algorithm
      guaranteeing that the CIL checkpoint is fully on stable storage when
      it completes, we no longer need to iterate iclogs and push them to
      ensure a CIL sequence push has completed and so all this nasty iclog
      iteration and flushing code will go away.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      2bf1ec0f
    • Dave Chinner's avatar
      xfs: factor out forced iclog flushes · 45eddb41
      Dave Chinner authored
      We force iclogs in several places - we need them all to have the
      same cache flush semantics, so start by factoring out the iclog
      force into a common helper.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      45eddb41
    • Dave Chinner's avatar
      xfs: fix ordering violation between cache flushes and tail updates · 0dc8f7f1
      Dave Chinner authored
      There is a race between the new CIL async data device metadata IO
      completion cache flush and the log tail in the iclog the flush
      covers being updated. This can be seen by repeating generic/482 in a
      loop and eventually log recovery fails with a failures such as this:
      
      XFS (dm-3): Starting recovery (logdev: internal)
      XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
      XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
      
      Analysis of the logwrite replay shows that there were no writes to
      the data device between the FUA @ write 124 and the FUA at write @
      125, but log recovery @ 125 failed. The difference was the one log
      write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
      and so the inode create intent in (1,8) was not replayed and so the
      inode cluster was zero on disk when replay of the first inode item
      in (1,32) was attempted.
      
      What this meant was that the journal write that occurred at @ 125
      did not ensure that metadata completed before the iclog was written
      was correctly on stable storage. The tail of the log moved forward,
      so IO must have been completed between the two iclog writes. This
      means that there is a race condition between the unconditional async
      cache flush in the CIL push work and the tail LSN that is written to
      the iclog. This happens like so:
      
      CIL push work				AIL push work
      -------------				-------------
      Add to committing list
      start async data dev cache flush
      .....
      <flush completes>
      <all writes to old tail lsn are stable>
      xlog_write
        ....					push inode create buffer
      					<start IO>
      					.....
      xlog_write(commit record)
        ....					<IO completes>
        					log tail moves
        					  xlog_assign_tail_lsn()
      start_lsn == commit_lsn
        <no iclog preflush!>
      xlog_state_release_iclog
        __xlog_state_release_iclog()
          <writes *new* tail_lsn into iclog>
        xlog_sync()
          ....
          submit_bio()
      <tail in log moves forward without flushing written metadata>
      
      Essentially, this can only occur if the commit iclog is issued
      without a cache flush. If the iclog bio is submitted with
      REQ_PREFLUSH, then it will guarantee that all the completed IO is
      one stable storage before the iclog bio with the new tail LSN in it
      is written to the log.
      
      IOWs, the tail lsn that is written to the iclog needs to be sampled
      *before* we issue the cache flush that guarantees all IO up to that
      LSN has been completed.
      
      To fix this without giving up the performance advantage of the
      flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
      compared to 5.13), we need to ensure that we always issue a cache
      flush if the tail LSN changes between the initial async flush and
      the commit record being written. THis requires sampling the tail_lsn
      before we start the flush, and then passing the sampled tail LSN to
      xlog_state_release_iclog() so it can determine if the the tail LSN
      has changed while writing the checkpoint. If the tail LSN has
      changed, then it needs to set the NEED_FLUSH flag on the iclog and
      we'll issue another cache flush before writing the iclog.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0dc8f7f1
    • Dave Chinner's avatar
      xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog · 9d392064
      Dave Chinner authored
      Fold __xlog_state_release_iclog into its only caller to prepare
      make an upcoming fix easier.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      [hch: split from a larger patch]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      9d392064
    • Dave Chinner's avatar
      xfs: external logs need to flush data device · b5d721ea
      Dave Chinner authored
      The recent journal flush/FUA changes replaced the flushing of the
      data device on every iclog write with an up-front async data device
      cache flush. Unfortunately, the assumption of which this was based
      on has been proven incorrect by the flush vs log tail update
      ordering issue. As the fix for that issue uses the
      XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
      flush, we now need to (once again) ensure that an iclog write to
      external logs that need a cache flush to be issued actually issue a
      cache flush to the data device as well as the log device.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b5d721ea
    • Dave Chinner's avatar
      xfs: flush data dev on external log write · b1e27239
      Dave Chinner authored
      We incorrectly flush the log device instead of the data device when
      trying to ensure metadata is correctly on disk before writing the
      unmount record.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b1e27239
  4. 25 Jun, 2021 3 commits
    • Dave Chinner's avatar
      xfs: Fix a CIL UAF by getting get rid of the iclog callback lock · a1bb8505
      Dave Chinner authored
      The iclog callback chain has it's own lock. That was added way back
      in 2008 by myself to alleviate severe lock contention on the
      icloglock in commit 114d23aa ("[XFS] Per iclog callback chain
      lock"). This was long before delayed logging took the icloglock out
      of the hot transaction commit path and removed all contention on it.
      Hence the separate ic_callback_lock doesn't serve any scalability
      purpose anymore, and hasn't for close on a decade.
      
      Further, we only attach callbacks to iclogs in one place where we
      are already taking the icloglock soon after attaching the callbacks.
      We also have to drop the icloglock to run callbacks and grab it
      immediately afterwards again. So given that the icloglock is no
      longer hot, making it cover callbacks again doesn't really change
      the locking patterns very much at all.
      
      We also need to extend the icloglock to cover callback addition to
      fix a zero-day UAF in the CIL push code. This occurs when shutdown
      races with xlog_cil_push_work() and the shutdown runs the callbacks
      before the push releases the iclog. This results in the CIL context
      structure attached to the iclog being freed by the callback before
      the CIL push has finished referencing it, leading to UAF bugs.
      
      Hence, to avoid this UAF, we need the callback attachment to be
      atomic with post processing of the commit iclog and references to
      the structures being attached to the iclog. This requires holding
      the icloglock as that's the only way to serialise iclog state
      against a shutdown in progress.
      
      The result is we need to be using the icloglock to protect the
      callback list addition and removal and serialise them with shutdown.
      That makes the ic_callback_lock redundant and so it can be removed.
      
      Fixes: 71e330b5 ("xfs: Introduce delayed logging core code")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      a1bb8505
    • Dave Chinner's avatar
      xfs: remove callback dequeue loop from xlog_state_do_iclog_callbacks · b6903358
      Dave Chinner authored
      If we are processing callbacks on an iclog, nothing can be
      concurrently adding callbacks to the loop. We only add callbacks to
      the iclog when they are in ACTIVE or WANT_SYNC state, and we
      explicitly do not add callbacks if the iclog is already in IOERROR
      state.
      
      The only way to have a dequeue racing with an enqueue is to be
      processing a shutdown without a direct reference to an iclog in
      ACTIVE or WANT_SYNC state. As the enqueue avoids this race
      condition, we only ever need a single dequeue operation in
      xlog_state_do_iclog_callbacks(). Hence we can remove the loop.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b6903358
    • Dave Chinner's avatar
      xfs: don't nest icloglock inside ic_callback_lock · 6be00102
      Dave Chinner authored
      It's completely unnecessary because callbacks are added to iclogs
      without holding the icloglock, hence no amount of ordering between
      the icloglock and ic_callback_lock will order the removal of
      callbacks from the iclog.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      6be00102
  5. 21 Jun, 2021 6 commits
    • Darrick J. Wong's avatar
      xfs: force the log offline when log intent item recovery fails · 4e6b8270
      Darrick J. Wong authored
      If any part of log intent item recovery fails, we should shut down the
      log immediately to stop the log from writing a clean unmount record to
      disk, because the metadata is not consistent.  The inability to cancel a
      dirty transaction catches most of these cases, but there are a few
      things that have slipped through the cracks, such as ENOSPC from a
      transaction allocation, or runtime errors that result in cancellation of
      a non-dirty transaction.
      
      This solves some weird behaviors reported by customers where a system
      goes down, the first mount fails, the second succeeds, but then the fs
      goes down later because of inconsistent metadata.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4e6b8270
    • Dave Chinner's avatar
      xfs: add iclog state trace events · 956f6daa
      Dave Chinner authored
      For the DEBUGS!
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      956f6daa
    • Dave Chinner's avatar
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner authored
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
    • Dave Chinner's avatar
      xfs: journal IO cache flush reductions · eef983ff
      Dave Chinner authored
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
      causes the journal IO to issue a cache flush and wait for it to
      complete before issuing the write IO to the journal. Hence all
      completed metadata IO is guaranteed to be stable before the journal
      overwrites the old metadata.
      
      The ordering guarantees of #2 are provided by the REQ_FUA, which
      ensures the journal writes do not complete until they are on stable
      storage. Hence by the time the last journal IO in a checkpoint
      completes, we know that the entire checkpoint is on stable storage
      and we can unpin the dirty metadata and allow it to be written back.
      
      This is the mechanism by which ordering was first implemented in XFS
      way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
      ("Add support for drive write cache flushing") in the xfs-archive
      tree.
      
      A lot has changed since then, most notably we now use delayed
      logging to checkpoint the filesystem to the journal rather than
      write each individual transaction to the journal. Cache flushes on
      journal IO are necessary when individual transactions are wholly
      contained within a single iclog. However, CIL checkpoints are single
      transactions that typically span hundreds to thousands of individual
      journal writes, and so the requirements for device cache flushing
      have changed.
      
      That is, the ordering rules I state above apply to ordering of
      atomic transactions recorded in the journal, not to the journal IO
      itself. Hence we need to ensure metadata is stable before we start
      writing a new transaction to the journal (guarantee #1), and we need
      to ensure the entire transaction is stable in the journal before we
      start metadata writeback (guarantee #2).
      
      Hence we only need a REQ_PREFLUSH on the journal IO that starts a
      new journal transaction to provide #1, and it is not on any other
      journal IO done within the context of that journal transaction.
      
      The CIL checkpoint already issues a cache flush before it starts
      writing to the log, so we no longer need the iclog IO to issue a
      REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
      to xlog_write(), we no longer need to mark the first iclog in
      the log write with REQ_PREFLUSH for this case. As an added bonus,
      this ordering mechanism works for both internal and external logs,
      meaning we can remove the explicit data device cache flushes from
      the iclog write code when using external logs.
      
      Given the new ordering semantics of commit records for the CIL, we
      need iclogs containing commit records to issue a REQ_PREFLUSH. We
      also require unmount records to do this. Hence for both
      XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
      to mark the first iclog being written with REQ_PREFLUSH.
      
      For both commit records and unmount records, we also want them
      immediately on stable storage, so we want to also mark the iclogs
      that contain these records to be marked REQ_FUA. That means if a
      record is split across multiple iclogs, they are all marked REQ_FUA
      and not just the last one so that when the transaction is completed
      all the parts of the record are on stable storage.
      
      And for external logs, unmount records need a pre-write data device
      cache flush similar to the CIL checkpoint cache pre-flush as the
      internal iclog write code does not do this implicitly anymore.
      
      As an optimisation, when the commit record lands in the same iclog
      as the journal transaction starts, we don't need to wait for
      anything and can simply use REQ_FUA to provide guarantee #2.  This
      means that for fsync() heavy workloads, the cache flush behaviour is
      completely unchanged and there is no degradation in performance as a
      result of optimise the multi-IO transaction case.
      
      The most notable sign that there is less IO latency on my test
      machine (nvme SSDs) is that the "noiclogs" rate has dropped
      substantially. This metric indicates that the CIL push is blocking
      in xlog_get_iclog_space() waiting for iclog IO completion to occur.
      With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
      every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
      is blocking waiting for log IO. With the changes in this patch, this
      drops to 1 noiclog event for every 100 iclog writes. Hence it is
      clear that log IO is completing much faster than it was previously,
      but it is also clear that for large iclog sizes, this isn't the
      performance limiting factor on this hardware.
      
      With smaller iclogs (32kB), however, there is a substantial
      difference. With the cache flush modifications, the journal is now
      running at over 4000 write IOPS, and the journal throughput is
      largely identical to the 256kB iclogs and the noiclog event rate
      stays low at about 1:50 iclog writes. The existing code tops out at
      about 2500 IOPS as the number of cache flushes dominate performance
      and latency. The noiclog event rate is about 1:4, and the
      performance variance is quite large as the journal throughput can
      fall to less than half the peak sustained rate when the cache flush
      rate prevents metadata writeback from keeping up and the log runs
      out of space and throttles reservations.
      
      As a result:
      
      	logbsize	fsmark create rate	rm -rf
      before	32kb		152851+/-5.3e+04	5m28s
      patched	32kb		221533+/-1.1e+04	5m24s
      
      before	256kb		220239+/-6.2e+03	4m58s
      patched	256kb		228286+/-9.2e+03	5m06s
      
      The rm -rf times are included because I ran them, but the
      differences are largely noise. This workload is largely metadata
      read IO latency bound and the changes to the journal cache flushing
      doesn't really make any noticable difference to behaviour apart from
      a reduction in noiclog events from background CIL pushing.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      eef983ff
    • Dave Chinner's avatar
      xfs: remove need_start_rec parameter from xlog_write() · 3468bb1c
      Dave Chinner authored
      The CIL push is the only call to xlog_write that sets this variable
      to true. The other callers don't need a start rec, and they tell
      xlog_write what to do by passing the type of ophdr they need written
      in the flags field. The need_start_rec parameter essentially tells
      xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
      so get rid of the variable to do this and pass XLOG_START_TRANS as
      the flag value into xlog_write() from the CIL push.
      
      $ size fs/xfs/xfs_log.o*
        text	   data	    bss	    dec	    hex	filename
       27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
       27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      3468bb1c
    • Dave Chinner's avatar
      xfs: remove xfs_blkdev_issue_flush · b5071ada
      Dave Chinner authored
      It's a one line wrapper around blkdev_issue_flush(). Just replace it
      with direct calls to blkdev_issue_flush().
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      b5071ada
  6. 18 Jun, 2021 3 commits
  7. 26 May, 2021 1 commit
    • Gustavo A. R. Silva's avatar
      xfs: Fix fall-through warnings for Clang · 53004ee7
      Gustavo A. R. Silva authored
      In preparation to enable -Wimplicit-fallthrough for Clang, fix
      the following warnings by replacing /* fall through */ comments,
      and its variants, with the new pseudo-keyword macro fallthrough:
      
      fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      
      Notice that Clang doesn't recognize /* fall through */ comments as
      implicit fall-through markings, so in order to globally enable
      -Wimplicit-fallthrough for Clang, these comments need to be
      replaced with fallthrough; in the whole codebase.
      
      Link: https://github.com/KSPP/linux/issues/115Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      53004ee7
  8. 04 May, 2021 1 commit
    • Darrick J. Wong's avatar
      xfs: don't allow log writes if the data device is readonly · 8e9800f9
      Darrick J. Wong authored
      While running generic/050 with an external log, I observed this warning
      in dmesg:
      
      Trying to write to read-only block-device sda4 (partno 4)
      WARNING: CPU: 2 PID: 215677 at block/blk-core.c:704 submit_bio_checks+0x256/0x510
      Call Trace:
       submit_bio_noacct+0x2c/0x430
       _xfs_buf_ioapply+0x283/0x3c0 [xfs]
       __xfs_buf_submit+0x6a/0x210 [xfs]
       xfs_buf_delwri_submit_buffers+0xf8/0x270 [xfs]
       xfsaild+0x2db/0xc50 [xfs]
       kthread+0x14b/0x170
      
      I think this happened because we tried to cover the log after a readonly
      mount, and the AIL tried to write the primary superblock to the data
      device.  The test marks the data device readonly, but it doesn't do the
      same to the external log device.  Therefore, XFS thinks that the log is
      writable, even though AIL writes whine to dmesg because the data device
      is read only.
      
      Fix this by amending xfs_log_writable to prevent writes when the AIL
      can't possible write anything into the filesystem.
      
      Note: As for the external log or the rt devices being readonly--
      xfs_blkdev_get will complain about that if we aren't doing a norecovery
      mount.
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      8e9800f9
  9. 11 Feb, 2021 1 commit