1. 13 Oct, 2020 1 commit
    • Darrick J. Wong's avatar
      xfs: fix realtime bitmap/summary file truncation when growing rt volume · f4c32e87
      Darrick J. Wong authored
      The realtime bitmap and summary files are regular files that are hidden
      away from the directory tree.  Since they're regular files, inode
      inactivation will try to purge what it thinks are speculative
      preallocations beyond the incore size of the file.  Unfortunately,
      xfs_growfs_rt forgets to update the incore size when it resizes the
      inodes, with the result that inactivating the rt inodes at unmount time
      will cause their contents to be truncated.
      
      Fix this by updating the incore size when we change the ondisk size as
      part of updating the superblock.  Note that we don't do this when we're
      allocating blocks to the rt inodes because we actually want those blocks
      to get purged if the growfs fails.
      
      This fixes corruption complaints from the online rtsummary checker when
      running xfs/233.  Since that test requires rmap, one can also trigger
      this by growing an rt volume, cycling the mount, and creating rt files.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      f4c32e87
  2. 07 Oct, 2020 16 commits
    • Kaixu Xia's avatar
      xfs: fix the indent in xfs_trans_mod_dquot · e5b23740
      Kaixu Xia authored
      The formatting is strange in xfs_trans_mod_dquot, so do a reindent.
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e5b23740
    • Kaixu Xia's avatar
      xfs: do the ASSERT for the arguments O_{u,g,p}dqpp · 97611f93
      Kaixu Xia authored
      If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid
      than them currently associated with the inode, the arguments
      O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them.
      Signed-off-by: default avatarKaixu Xia <kaixuxia@tencent.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      97611f93
    • Darrick J. Wong's avatar
      xfs: fix deadlock and streamline xfs_getfsmap performance · 8ffa90e1
      Darrick J. Wong authored
      Refactor xfs_getfsmap to improve its performance: instead of indirectly
      calling a function that copies one record to userspace at a time, create
      a shadow buffer in the kernel and copy the whole array once at the end.
      On the author's computer, this reduces the runtime on his /home by ~20%.
      
      This also eliminates a deadlock when running GETFSMAP against the
      realtime device.  The current code locks the rtbitmap to create
      fsmappings and copies them into userspace, having not released the
      rtbitmap lock.  If the userspace buffer is an mmap of a sparse file that
      itself resides on the realtime device, the write page fault will recurse
      into the fs for allocation, which will deadlock on the rtbitmap lock.
      
      Fixes: 4c934c7d ("xfs: report realtime space information via the rtbitmap")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      8ffa90e1
    • Darrick J. Wong's avatar
      xfs: limit entries returned when counting fsmap records · acd1ac3a
      Darrick J. Wong authored
      If userspace asked fsmap to count the number of entries, we cannot
      return more than UINT_MAX entries because fmh_entries is u32.
      Therefore, stop counting if we hit this limit or else we will waste time
      to return truncated results.
      
      Fixes: e89c0413 ("xfs: implement the GETFSMAP ioctl")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      acd1ac3a
    • Darrick J. Wong's avatar
      xfs: only relog deferred intent items if free space in the log gets low · 74f4d6a1
      Darrick J. Wong authored
      Now that we have the ability to ask the log how far the tail needs to be
      pushed to maintain its free space targets, augment the decision to relog
      an intent item so that we only do it if the log has hit the 75% full
      threshold.  There's no point in relogging an intent into the same
      checkpoint, and there's no need to relog if there's plenty of free space
      in the log.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      74f4d6a1
    • Darrick J. Wong's avatar
      xfs: expose the log push threshold · ed1575da
      Darrick J. Wong authored
      Separate the computation of the log push threshold and the push logic in
      xlog_grant_push_ail.  This enables higher level code to determine (for
      example) that it is holding on to a logged intent item and the log is so
      busy that it is more than 75% full.  In that case, it would be desirable
      to move the log item towards the head to release the tail, which we will
      cover in the next patch.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      ed1575da
    • Darrick J. Wong's avatar
      xfs: periodically relog deferred intent items · 4e919af7
      Darrick J. Wong authored
      There's a subtle design flaw in the deferred log item code that can lead
      to pinning the log tail.  Taking up the defer ops chain examples from
      the previous commit, we can get trapped in sequences like this:
      
      Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
      chain will look like the following if the transaction rolls succeed:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      In transaction 9, we finish d9 and try to roll to t10 while holding onto
      an intent item for D3 that we logged in t0.
      
      The previous commit changed the order in which we place new defer ops in
      the defer ops processing chain to reduce the maximum chain length.  Now
      make xfs_defer_finish_noroll capable of relogging the entire chain
      periodically so that we can always move the log tail forward.  Most
      chains will never get relogged, except for operations that generate very
      long chains (large extents containing many blocks with different sharing
      levels) or are on filesystems with small logs and a lot of ongoing
      metadata updates.
      
      Callers are now required to ensure that the transaction reservation is
      large enough to handle logging done items and new intent items for the
      maximum possible chain length.  Most callers are careful to keep the
      chain lengths low, so the overhead should be minimal.
      
      The decision to relog an intent item is made based on whether the intent
      was logged in a previous checkpoint, since there's no point in relogging
      an intent into the same checkpoint.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      4e919af7
    • Darrick J. Wong's avatar
      xfs: change the order in which child and parent defer ops are finished · 27dada07
      Darrick J. Wong authored
      The defer ops code has been finishing items in the wrong order -- if a
      top level defer op creates items A and B, and finishing item A creates
      more defer ops A1 and A2, we'll put the new items on the end of the
      chain and process them in the order A B A1 A2.  This is kind of weird,
      since it's convenient for programmers to be able to think of A and B as
      an ordered sequence where all the sub-tasks for A must finish before we
      move on to B, e.g. A A1 A2 D.
      
      Right now, our log intent items are not so complex that this matters,
      but this will become important for the atomic extent swapping patchset.
      In order to maintain correct reference counting of extents, we have to
      unmap and remap extents in that order, and we want to complete that work
      before moving on to the next range that the user wants to swap.  This
      patch fixes defer ops to satsify that requirement.
      
      The primary symptom of the incorrect order was noticed in an early
      performance analysis of the atomic extent swap code.  An astonishingly
      large number of deferred work items accumulated when userspace requested
      an atomic update of two very fragmented files.  The cause of this was
      traced to the same ordering bug in the inner loop of
      xfs_defer_finish_noroll.
      
      If the ->finish_item method of a deferred operation queues new deferred
      operations, those new deferred ops are appended to the tail of the
      pending work list.  To illustrate, say that a caller creates a
      transaction t0 with four deferred operations D0-D3.  The first thing
      defer ops does is roll the transaction to t1, leaving us with:
      
      t1: D0(t0), D1(t0), D2(t0), D3(t0)
      
      Let's say that finishing each of D0-D3 will create two new deferred ops.
      After finish D0 and roll, we'll have the following chain:
      
      t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)
      
      d4 and d5 were logged to t1.  Notice that while we're about to start
      work on D1, we haven't actually completed all the work implied by D0
      being finished.  So far we've been careful (or lucky) to structure the
      dfops callers such that D1 doesn't depend on d4 or d5 being finished,
      but this is a potential logic bomb.
      
      There's a second problem lurking.  Let's see what happens as we finish
      D1-D3:
      
      t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
      t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
      t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      
      Let's say that d4-d11 are simple work items that don't queue any other
      operations, which means that we can complete each d4 and roll to t6:
      
      t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
      ...
      t11: d10(t4), d11(t4)
      t12: d11(t4)
      <done>
      
      When we try to roll to transaction #12, we're holding defer op d11,
      which we logged way back in t4.  This means that the tail of the log is
      pinned at t4.  If the log is very small or there are a lot of other
      threads updating metadata, this means that we might have wrapped the log
      and cannot get roll to t11 because there isn't enough space left before
      we'd run into t4.
      
      Let's shift back to the original failure.  I mentioned before that I
      discovered this flaw while developing the atomic file update code.  In
      that scenario, we have a defer op (D0) that finds a range of file blocks
      to remap, creates a handful of new defer ops to do that, and then asks
      to be continued with however much work remains.
      
      So, D0 is the original swapext deferred op.  The first thing defer ops
      does is rolls to t1:
      
      t1: D0(t0)
      
      We try to finish D0, logging d1 and d2 in the process, but can't get all
      the work done.  We log a done item and a new intent item for the work
      that D0 still has to do, and roll to t2:
      
      t2: D0'(t1), d1(t1), d2(t1)
      
      We roll and try to finish D0', but still can't get all the work done, so
      we log a done item and a new intent item for it, requeue D0 a second
      time, and roll to t3:
      
      t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)
      
      If it takes 48 more rolls to complete D0, then we'll finally dispense
      with D0 in t50:
      
      t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)
      
      We then try to roll again to get a chain like this:
      
      t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
      ...
      t152: d102(t50)
      <done>
      
      Notice that in rolling to transaction #51, we're holding on to a log
      intent item for d1 that was logged in transaction #1.  This means that
      the tail of the log is pinned at t1.  If the log is very small or there
      are a lot of other threads updating metadata, this means that we might
      have wrapped the log and cannot roll to t51 because there isn't enough
      space left before we'd run into t1.  This is of course problem #2 again.
      
      But notice the third problem with this scenario: we have 102 defer ops
      tied to this transaction!  Each of these items are backed by pinned
      kernel memory, which means that we risk OOM if the chains get too long.
      
      Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
      future; problem #2 applies (rarely) to the current upstream, and problem
      #3 applies to work under development.
      
      This is not how incremental deferred operations were supposed to work.
      The dfops design of logging in the same transaction an intent-done item
      and a new intent item for the work remaining was to make it so that we
      only have to juggle enough deferred work items to finish that one small
      piece of work.  Deferred log item recovery will find that first
      unfinished work item and restart it, no matter how many other intent
      items might follow it in the log.  Therefore, it's ok to put the new
      intents at the start of the dfops chain.
      
      For the first example, the chains look like this:
      
      t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
      t3: d5(t1), D1(t0), D2(t0), D3(t0)
      ...
      t9: d9(t7), D3(t0)
      t10: D3(t0)
      t11: d10(t10), d11(t10)
      t12: d11(t10)
      
      For the second example, the chains look like this:
      
      t1: D0(t0)
      t2: d1(t1), d2(t1), D0'(t1)
      t3: d2(t1), D0'(t1)
      t4: D0'(t1)
      t5: d1(t4), d2(t4), D0''(t4)
      ...
      t148: D0<50 primes>(t147)
      t149: d101(t148), d102(t148)
      t150: d102(t148)
      <done>
      
      This actually sucks more for pinning the log tail (we try to roll to t10
      while holding an intent item that was logged in t1) but we've solved
      problem #1.  We've also reduced the maximum chain length from:
      
          sum(all the new items) + nr_original_items
      
      to:
      
          max(new items that each original item creates) + nr_original_items
      
      This solves problem #3 by sharply reducing the number of defer ops that
      can be attached to a transaction at any given time.  The change makes
      the problem of log tail pinning worse, but is improvement we need to
      solve problem #2.  Actually solving #2, however, is left to the next
      patch.
      
      Note that a subsequent analysis of some hard-to-trigger reflink and COW
      livelocks on extremely fragmented filesystems (or systems running a lot
      of IO threads) showed the same symptoms -- uncomfortably large numbers
      of incore deferred work items and occasional stalls in the transaction
      grant code while waiting for log reservations.  I think this patch and
      the next one will also solve these problems.
      
      As originally written, the code used list_splice_tail_init instead of
      list_splice_init, so change that, and leave a short comment explaining
      our actions.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      27dada07
    • Darrick J. Wong's avatar
      xfs: fix an incore inode UAF in xfs_bui_recover · ff4ab5e0
      Darrick J. Wong authored
      In xfs_bui_item_recover, there exists a use-after-free bug with regards
      to the inode that is involved in the bmap replay operation.  If the
      mapping operation does not complete, we call xfs_bmap_unmap_extent to
      create a deferred op to finish the unmapping work, and we retain a
      pointer to the incore inode.
      
      Unfortunately, the very next thing we do is commit the transaction and
      drop the inode.  If reclaim tears down the inode before we try to finish
      the defer ops, we dereference garbage and blow up.  Therefore, create a
      way to join inodes to the defer ops freezer so that we can maintain the
      xfs_inode reference until we're done with the inode.
      
      Note: This imposes the requirement that there be enough memory to keep
      every incore inode in memory throughout recovery.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      ff4ab5e0
    • Darrick J. Wong's avatar
      xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering · 64a3f331
      Darrick J. Wong authored
      In most places in XFS, we have a specific order in which we gather
      resources: grab the inode, allocate a transaction, then lock the inode.
      xfs_bui_item_recover doesn't do it in that order, so fix it to be more
      consistent.  This also makes the error bailout code a bit less weird.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      64a3f331
    • Darrick J. Wong's avatar
      xfs: clean up bmap intent item recovery checking · 919522e8
      Darrick J. Wong authored
      The bmap intent item checking code in xfs_bui_item_recover is spread all
      over the function.  We should check the recovered log item at the top
      before we allocate any resources or do anything else, so do that.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      919522e8
    • Darrick J. Wong's avatar
      xfs: xfs_defer_capture should absorb remaining transaction reservation · 929b92f6
      Darrick J. Wong authored
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the transaction reservation type
      from the old transaction so that when we continue the dfops chain, we
      still use the same reservation parameters.
      
      Doing this means that the log item recovery functions get to determine
      the transaction reservation instead of abusing tr_itruncate in yet
      another part of xfs.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      929b92f6
    • Darrick J. Wong's avatar
      xfs: xfs_defer_capture should absorb remaining block reservations · 4f9a60c4
      Darrick J. Wong authored
      When xfs_defer_capture extracts the deferred ops and transaction state
      from a transaction, it should record the remaining block reservations so
      that when we continue the dfops chain, we can reserve the same number of
      blocks to use.  We capture the reservations for both data and realtime
      volumes.
      
      This adds the requirement that every log intent item recovery function
      must be careful to reserve enough blocks to handle both itself and all
      defer ops that it can queue.  On the other hand, this enables us to do
      away with the handwaving block estimation nonsense that was going on in
      xlog_finish_defer_ops.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      4f9a60c4
    • Darrick J. Wong's avatar
      xfs: proper replay of deferred ops queued during log recovery · e6fff81e
      Darrick J. Wong authored
      When we replay unfinished intent items that have been recovered from the
      log, it's possible that the replay will cause the creation of more
      deferred work items.  As outlined in commit 50995582 ("xfs: log
      recovery should replay deferred ops in order"), later work items have an
      implicit ordering dependency on earlier work items.  Therefore, recovery
      must replay the items (both recovered and created) in the same order
      that they would have been during normal operation.
      
      For log recovery, we enforce this ordering by using an empty transaction
      to collect deferred ops that get created in the process of recovering a
      log intent item to prevent them from being committed before the rest of
      the recovered intent items.  After we finish committing all the
      recovered log items, we allocate a transaction with an enormous block
      reservation, splice our huge list of created deferred ops into that
      transaction, and commit it, thereby finishing all those ops.
      
      This is /really/ hokey -- it's the one place in XFS where we allow
      nested transactions; the splicing of the defer ops list is is inelegant
      and has to be done twice per recovery function; and the broken way we
      handle inode pointers and block reservations cause subtle use-after-free
      and allocator problems that will be fixed by this patch and the two
      patches after it.
      
      Therefore, replace the hokey empty transaction with a structure designed
      to capture each chain of deferred ops that are created as part of
      recovering a single unfinished log intent.  Finally, refactor the loop
      that replays those chains to do so using one transaction per chain.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      e6fff81e
    • Darrick J. Wong's avatar
      xfs: remove XFS_LI_RECOVERED · 901219bb
      Darrick J. Wong authored
      The ->iop_recover method of a log intent item removes the recovered
      intent item from the AIL by logging an intent done item and committing
      the transaction, so it's superfluous to have this flag check.  Nothing
      else uses it, so get rid of the flag entirely.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      901219bb
    • Darrick J. Wong's avatar
      xfs: remove xfs_defer_reset · b80b29d6
      Darrick J. Wong authored
      Remove this one-line helper since the assert is trivially true in one
      call site and the rest obscures a bitmask operation.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      b80b29d6
  3. 30 Sep, 2020 1 commit
  4. 25 Sep, 2020 13 commits
  5. 23 Sep, 2020 8 commits
    • Gao Xiang's avatar
      xfs: clean up calculation of LR header blocks · 0c771b99
      Gao Xiang authored
      Let's use DIV_ROUND_UP() to calculate log record header
      blocks as what did in xlog_get_iclog_buffer_size() and
      wrap up a common helper for log recovery.
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0c771b99
    • Gao Xiang's avatar
      xfs: avoid LR buffer overrun due to crafted h_len · f692d09e
      Gao Xiang authored
      Currently, crafted h_len has been blocked for the log
      header of the tail block in commit a70f9fe5 ("xfs:
      detect and handle invalid iclog size set by mkfs").
      
      However, each log record could still have crafted h_len
      and cause log record buffer overrun. So let's check
      h_len vs buffer size for each log record as well.
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      f692d09e
    • Darrick J. Wong's avatar
      xfs: don't release log intent items when recovery fails · 384ff09b
      Darrick J. Wong authored
      Nowadays, log recovery will call ->release on the recovered intent items
      if recovery fails.  Therefore, it's redundant to release them from
      inside the ->recover functions when they're about to return an error.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      384ff09b
    • Darrick J. Wong's avatar
      xfs: attach inode to dquot in xfs_bui_item_recover · 2dbf872c
      Darrick J. Wong authored
      In the bmap intent item recovery code, we must be careful to attach the
      inode to its dquots (if quotas are enabled) so that a change in the
      shape of the bmap btree doesn't cause the quota counters to be
      incorrect.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      2dbf872c
    • Darrick J. Wong's avatar
      xfs: log new intent items created as part of finishing recovered intent items · 93293bcb
      Darrick J. Wong authored
      During a code inspection, I found a serious bug in the log intent item
      recovery code when an intent item cannot complete all the work and
      decides to requeue itself to get that done.  When this happens, the
      item recovery creates a new incore deferred op representing the
      remaining work and attaches it to the transaction that it allocated.  At
      the end of _item_recover, it moves the entire chain of deferred ops to
      the dummy parent_tp that xlog_recover_process_intents passed to it, but
      fail to log a new intent item for the remaining work before committing
      the transaction for the single unit of work.
      
      xlog_finish_defer_ops logs those new intent items once recovery has
      finished dealing with the intent items that it recovered, but this isn't
      sufficient.  If the log is forced to disk after a recovered log item
      decides to requeue itself and the system goes down before we call
      xlog_finish_defer_ops, the second log recovery will never see the new
      intent item and therefore has no idea that there was more work to do.
      It will finish recovery leaving the filesystem in a corrupted state.
      
      The same logic applies to /any/ deferred ops added during intent item
      recovery, not just the one handling the remaining work.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      93293bcb
    • Darrick J. Wong's avatar
      xfs: check dabtree node hash values when loading child blocks · e581c939
      Darrick J. Wong authored
      When xchk_da_btree_block is loading a non-root dabtree block, we know
      that the parent block had to have a (hashval, address) pointer to the
      block that we just loaded.  Check that the hashval in the parent matches
      the block we just loaded.
      
      This was found by fuzzing nbtree[3].hashval = ones in xfs/394.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      e581c939
    • Darrick J. Wong's avatar
      xfs: don't free rt blocks when we're doing a REMAP bunmapi call · 8df0fa39
      Darrick J. Wong authored
      When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent
      to be unmapped from the given file fork without the extent being freed.
      We do this for non-rt files, but we forgot to do this for realtime
      files.  So far this isn't a big deal since nobody makes a bunmapi call
      to a rt file with the REMAP flag set, but don't leave a logic bomb.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      8df0fa39
    • Chandan Babu R's avatar
      xfs: Set xfs_buf's b_ops member when zeroing bitmap/summary files · c54e14d1
      Chandan Babu R authored
      In xfs_growfs_rt(), we enlarge bitmap and summary files by allocating
      new blocks for both files. For each of the new blocks allocated, we
      allocate an xfs_buf, zero the payload, log the contents and commit the
      transaction. Hence these buffers will eventually find themselves
      appended to list at xfs_ail->ail_buf_list.
      
      Later, xfs_growfs_rt() loops across all of the new blocks belonging to
      the bitmap inode to set the bitmap values to 1. In doing so, it
      allocates a new transaction and invokes the following sequence of
      functions,
        - xfs_rtfree_range()
          - xfs_rtmodify_range()
            - xfs_rtbuf_get()
              We pass '&xfs_rtbuf_ops' as the ops pointer to xfs_trans_read_buf().
              - xfs_trans_read_buf()
      	  We find the xfs_buf of interest in per-ag hash table, invoke
      	  xfs_buf_reverify() which ends up assigning '&xfs_rtbuf_ops' to
      	  xfs_buf->b_ops.
      
      On the other hand, if xfs_growfs_rt_alloc() had allocated a few blocks
      for the bitmap inode and returned with an error, all the xfs_bufs
      corresponding to the new bitmap blocks that have been allocated would
      continue to be on xfs_ail->ail_buf_list list without ever having a
      non-NULL value assigned to their b_ops members. An AIL flush operation
      would then trigger the following warning message to be printed on the
      console,
      
        XFS (loop0): _xfs_buf_ioapply: no buf ops on daddr 0x58 len 8
        00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        CPU: 3 PID: 449 Comm: xfsaild/loop0 Not tainted 5.8.0-rc4-chandan-00038-g4d8c2b9de9ab-dirty #37
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        Call Trace:
         dump_stack+0x57/0x70
         _xfs_buf_ioapply+0x37c/0x3b0
         ? xfs_rw_bdev+0x1e0/0x1e0
         ? xfs_buf_delwri_submit_buffers+0xd4/0x210
         __xfs_buf_submit+0x6d/0x1f0
         xfs_buf_delwri_submit_buffers+0xd4/0x210
         xfsaild+0x2c8/0x9e0
         ? __switch_to_asm+0x42/0x70
         ? xfs_trans_ail_cursor_first+0x80/0x80
         kthread+0xfe/0x140
         ? kthread_park+0x90/0x90
         ret_from_fork+0x22/0x30
      
      This message indicates that the xfs_buf had its b_ops member set to
      NULL.
      
      This commit fixes the issue by assigning "&xfs_rtbuf_ops" to b_ops
      member of each of the xfs_bufs logged by xfs_growfs_rt_alloc().
      Signed-off-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      c54e14d1
  6. 21 Sep, 2020 1 commit
    • Chandan Babu R's avatar
      xfs: Set xfs_buf type flag when growing summary/bitmap files · 72cc9513
      Chandan Babu R authored
      The following sequence of commands,
      
        mkfs.xfs -f -m reflink=0 -r rtdev=/dev/loop1,size=10M /dev/loop0
        mount -o rtdev=/dev/loop1 /dev/loop0 /mnt
        xfs_growfs  /mnt
      
      ... causes the following call trace to be printed on the console,
      
      XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 331
      Call Trace:
       xfs_buf_item_format+0x632/0x680
       ? kmem_alloc_large+0x29/0x90
       ? kmem_alloc+0x70/0x120
       ? xfs_log_commit_cil+0x132/0x940
       xfs_log_commit_cil+0x26f/0x940
       ? xfs_buf_item_init+0x1ad/0x240
       ? xfs_growfs_rt_alloc+0x1fc/0x280
       __xfs_trans_commit+0xac/0x370
       xfs_growfs_rt_alloc+0x1fc/0x280
       xfs_growfs_rt+0x1a0/0x5e0
       xfs_file_ioctl+0x3fd/0xc70
       ? selinux_file_ioctl+0x174/0x220
       ksys_ioctl+0x87/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x3e/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This occurs because the buffer being formatted has the value of
      XFS_BLFT_UNKNOWN_BUF assigned to the 'type' subfield of
      bip->bli_formats->blf_flags.
      
      This commit fixes the issue by assigning one of XFS_BLFT_RTSUMMARY_BUF
      and XFS_BLFT_RTBITMAP_BUF to the 'type' subfield of
      bip->bli_formats->blf_flags before committing the corresponding
      transaction.
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanrlinux@gmail.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      72cc9513