1. 01 Sep, 2017 15 commits
    • Christoph Hellwig's avatar
      iomap: return VM_FAULT_* codes from iomap_page_mkwrite · e7647fb4
      Christoph Hellwig authored
      All callers will need the VM_FAULT_* flags, so convert in the helper.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e7647fb4
    • Brian Foster's avatar
      xfs: relog dirty buffers during swapext bmbt owner change · 2dd3d709
      Brian Foster authored
      The owner change bmbt scan that occurs during extent swap operations
      does not handle ordered buffer failures. Buffers that cannot be
      marked ordered must be physically logged so previously dirty ranges
      of the buffer can be relogged in the transaction.
      
      Since the bmbt scan may need to process and potentially log a large
      number of blocks, we can't expect to complete this operation in a
      single transaction. Update extent swap to use a permanent
      transaction with enough log reservation to physically log a buffer.
      Update the bmbt scan to physically log any buffers that cannot be
      ordered and to terminate the scan with -EAGAIN. On -EAGAIN, the
      caller rolls the transaction and restarts the scan. Finally, update
      the bmbt scan helper function to skip bmbt blocks that already match
      the expected owner so they are not reprocessed after scan restarts.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      [darrick: fix the xfs_trans_roll call]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      2dd3d709
    • Brian Foster's avatar
      xfs: disallow marking previously dirty buffers as ordered · a5814bce
      Brian Foster authored
      Ordered buffers are used in situations where the buffer is not
      physically logged but must pass through the transaction/logging
      pipeline for a particular transaction. As a result, ordered buffers
      are not unpinned and written back until the transaction commits to
      the log. Ordered buffers have a strict requirement that the target
      buffer must not be currently dirty and resident in the log pipeline
      at the time it is marked ordered. If a dirty+ordered buffer is
      committed, the buffer is reinserted to the AIL but not physically
      relogged at the LSN of the associated checkpoint. The buffer log
      item is assigned the LSN of the latest checkpoint and the AIL
      effectively releases the previously logged buffer content from the
      active log before the buffer has been written back. If the tail
      pushes forward and a filesystem crash occurs while in this state, an
      inconsistent filesystem could result.
      
      It is currently the caller responsibility to ensure an ordered
      buffer is not already dirty from a previous modification. This is
      unclear and error prone when not used in situations where it is
      guaranteed a buffer has not been previously modified (such as new
      metadata allocations).
      
      To facilitate general purpose use of ordered buffers, update
      xfs_trans_ordered_buf() to conditionally order the buffer based on
      state of the log item and return the status of the result. If the
      bli is dirty, do not order the buffer and return false. The caller
      must either physically log the buffer (having acquired the
      appropriate log reservation) or push it from the AIL to clean it
      before it can be marked ordered in the current transaction.
      
      Note that ordered buffers are currently only used in two situations:
      1.) inode chunk allocation where previously logged buffers are not
      possible and 2.) extent swap which will be updated to handle ordered
      buffer failures in a separate patch.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      a5814bce
    • Brian Foster's avatar
      xfs: move bmbt owner change to last step of extent swap · 6fb10d6d
      Brian Foster authored
      The extent swap operation currently resets bmbt block owners before
      the inode forks are swapped. The bmbt buffers are marked as ordered
      so they do not have to be physically logged in the transaction.
      
      This use of ordered buffers is not safe as bmbt buffers may have
      been previously physically logged. The bmbt owner change algorithm
      needs to be updated to physically log buffers that are already dirty
      when/if they are encountered. This means that an extent swap will
      eventually require multiple rolling transactions to handle large
      btrees. In addition, all inode related changes must be logged before
      the bmbt owner change scan begins and can roll the transaction for
      the first time to preserve fs consistency via log recovery.
      
      In preparation for such fixes to the bmbt owner change algorithm,
      refactor the bmbt scan out of the extent fork swap code to the last
      operation before the transaction is committed. Update
      xfs_swap_extent_forks() to only set the inode log flags when an
      owner change scan is necessary. Update xfs_swap_extents() to trigger
      the owner change based on the inode log flags. Note that since the
      owner change now occurs after the extent fork swap, the inode btrees
      must be fixed up with the inode number of the current inode (similar
      to log recovery).
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      6fb10d6d
    • Brian Foster's avatar
      xfs: skip bmbt block ino validation during owner change · 99c794c6
      Brian Foster authored
      Extent swap uses xfs_btree_visit_blocks() to fix up bmbt block
      owners on v5 (!rmapbt) filesystems. The bmbt scan uses
      xfs_btree_lookup_get_block() to read bmbt blocks which verifies the
      current owner of the block against the parent inode of the bmbt.
      This works during extent swap because the bmbt owners are updated to
      the opposite inode number before the inode extent forks are swapped.
      
      The modified bmbt blocks are marked as ordered buffers which allows
      everything to commit in a single transaction. If the transaction
      commits to the log and the system crashes such that recovery of the
      extent swap is required, log recovery restarts the bmbt scan to fix
      up any bmbt blocks that may have not been written back before the
      crash. The log recovery bmbt scan occurs after the inode forks have
      been swapped, however. This causes the bmbt block owner verification
      to fail, leads to log recovery failure and requires xfs_repair to
      zap the log to recover.
      
      Define a new invalid inode owner flag to inform the btree block
      lookup mechanism that the current inode may be invalid with respect
      to the current owner of the bmbt block. Set this flag on the cursor
      used for change owner scans to allow this operation to work at
      runtime and during log recovery.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Fixes: bb3be7e7 ("xfs: check for bogus values in btree block headers")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      99c794c6
    • Brian Foster's avatar
      xfs: don't log dirty ranges for ordered buffers · 8dc518df
      Brian Foster authored
      Ordered buffers are attached to transactions and pushed through the
      logging infrastructure just like normal buffers with the exception
      that they are not actually written to the log. Therefore, we don't
      need to log dirty ranges of ordered buffers. xfs_trans_log_buf() is
      called on ordered buffers to set up all of the dirty state on the
      transaction, buffer and log item and prepare the buffer for I/O.
      
      Now that xfs_trans_dirty_buf() is available, call it from
      xfs_trans_ordered_buf() so the latter is now mutually exclusive with
      xfs_trans_log_buf(). This reflects the implementation of ordered
      buffers and helps eliminate confusion over the need to log ranges of
      ordered buffers just to set up internal log state.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      8dc518df
    • Brian Foster's avatar
      xfs: refactor buffer logging into buffer dirtying helper · 9684010d
      Brian Foster authored
      xfs_trans_log_buf() is responsible for logging the dirty segments of
      a buffer along with setting all of the necessary state on the
      transaction, buffer, bli, etc., to ensure that the associated items
      are marked as dirty and prepared for I/O. We have a couple use cases
      that need to to dirty a buffer in a transaction without actually
      logging dirty ranges of the buffer.  One existing use case is
      ordered buffers, which are currently logged with arbitrary ranges to
      accomplish this even though the content of ordered buffers is never
      written to the log. Another pending use case is to relog an already
      dirty buffer across rolled transactions within the deferred
      operations infrastructure. This is required to prevent a held
      (XFS_BLI_HOLD) buffer from pinning the tail of the log.
      
      Refactor xfs_trans_log_buf() into a new function that contains all
      of the logic responsible to dirty the transaction, lidp, buffer and
      bli. This new function can be used in the future for the use cases
      outlined above. This patch does not introduce functional changes.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      9684010d
    • Brian Foster's avatar
      xfs: ordered buffer log items are never formatted · e9385cc6
      Brian Foster authored
      Ordered buffers pass through the logging infrastructure without ever
      being written to the log. The way this works is that the ordered
      buffer status is transferred to the log vector at commit time via
      the ->iop_size() callback. In xlog_cil_insert_format_items(),
      ordered log vectors bypass ->iop_format() processing altogether.
      
      Therefore it is unnecessary for xfs_buf_item_format() to handle
      ordered buffers. Remove the unnecessary logic and assert that an
      ordered buffer never reaches this point.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e9385cc6
    • Brian Foster's avatar
      xfs: remove unnecessary dirty bli format check for ordered bufs · 6453c65d
      Brian Foster authored
      xfs_buf_item_unlock() historically checked the dirty state of the
      buffer by manually checking the buffer log formats for dirty
      segments. The introduction of ordered buffers invalidated this check
      because ordered buffers have dirty bli's but no dirty (logged)
      segments. The check was updated to accommodate ordered buffers by
      looking at the bli state first and considering the blf only if the
      bli is clean.
      
      This logic is safe but unnecessary. There is no valid case where the
      bli is clean yet the blf has dirty segments. The bli is set dirty
      whenever the blf is logged (via xfs_trans_log_buf()) and the blf is
      cleared in the only place BLI_DIRTY is cleared (xfs_trans_binval()).
      
      Remove the conditional blf dirty checks and replace with an assert
      that should catch any discrepencies between bli and blf dirty
      states. Refactor the old blf dirty check into a helper function to
      be used by the assert.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      6453c65d
    • Brian Foster's avatar
      xfs: open-code xfs_buf_item_dirty() · a4f6cf6b
      Brian Foster authored
      It checks a single flag and has one caller. It probably isn't worth
      its own function.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      a4f6cf6b
    • Christoph Hellwig's avatar
      xfs: remove the ip argument to xfs_defer_finish · 8ad7c629
      Christoph Hellwig authored
      And instead require callers to explicitly join the inode using
      xfs_defer_ijoin.  Also consolidate the defer error handling in
      a few places using a goto label.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      8ad7c629
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
      xfs: refactor xfs_trans_roll · 411350df
      Christoph Hellwig authored
      Split xfs_trans_roll into a low-level helper that just rolls the
      actual transaction and a new higher level xfs_trans_roll_inode
      that takes care of logging and rejoining the inode.  This gets
      rid of the NULL inode case, and allows to simplify the special
      cases in the deferred operation code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      411350df
    • Omar Sandoval's avatar
      xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster() · f2e9ad21
      Omar Sandoval authored
      After xfs_ifree_cluster() finds an inode in the radix tree and verifies
      that the inode number is what it expected, xfs_reclaim_inode() can swoop
      in and free it. xfs_ifree_cluster() will then happily continue working
      on the freed inode. Most importantly, it will mark the inode stale,
      which will probably be overwritten when the inode slab object is
      reallocated, but if it has already been reallocated then we can end up
      with an inode spuriously marked stale.
      
      In 8a17d7dd ("xfs: mark reclaimed inodes invalid earlier") we added
      a second check to xfs_iflush_cluster() to detect this race, but the
      similar RCU lookup in xfs_ifree_cluster() needs the same treatment.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      f2e9ad21
    • Darrick J. Wong's avatar
      xfs: evict all inodes involved with log redo item · 799ea9e9
      Darrick J. Wong authored
      When we introduced the bmap redo log items, we set MS_ACTIVE on the
      mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes
      from being truncated prematurely during log recovery.  This also had the
      effect of putting linked inodes on the lru instead of evicting them.
      
      Unfortunately, we neglected to find all those unreferenced lru inodes
      and evict them after finishing log recovery, which means that we leak
      them if anything goes wrong in the rest of xfs_mountfs, because the lru
      is only cleaned out on unmount.
      
      Therefore, evict unreferenced inodes in the lru list immediately
      after clearing MS_ACTIVE.
      
      Fixes: 17c12bcd ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: viro@ZenIV.linux.org.uk
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      799ea9e9
  2. 22 Aug, 2017 11 commits
    • Carlos Maiolino's avatar
      xfs: stop searching for free slots in an inode chunk when there are none · 2d32311c
      Carlos Maiolino authored
      In a filesystem without finobt, the Space manager selects an AG to alloc a new
      inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk.
      
      When the new inode is in the same AG as its parent, the btree will be searched
      starting on the parent's record, and then retried from the top if no slot is
      available beyond the parent's record.
      
      To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the
      btree must have a free slot available, once its callers relied on the
      agi->freecount when deciding how/where to allocate this new inode.
      
      In the case when the agi->freecount is corrupted, showing available inodes in an
      AG, when in fact there is none, this becomes an infinite loop.
      
      Add a way to stop the loop when a free slot is not found in the btree, making
      the function to fall into the whole AG scan which will then, be able to detect
      the corruption and shut the filesystem down.
      
      As pointed by Brian, this might impact performance, giving the fact we
      don't reset the search distance anymore when we reach the end of the
      tree, giving it fewer tries before falling back to the whole AG search, but
      it will only affect searches that start within 10 records to the end of the tree.
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      2d32311c
    • Brian Foster's avatar
      xfs: add log recovery tracepoint for head/tail · e67d3d42
      Brian Foster authored
      Torn write detection and tail overwrite detection can shift the log
      head and tail respectively in the event of CRC mismatch or
      corruption errors. Add a high-level log recovery tracepoint to dump
      the final log head/tail and make those values easily attainable in
      debug/diagnostic situations.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      e67d3d42
    • Brian Foster's avatar
      xfs: handle -EFSCORRUPTED during head/tail verification · a4c9b34d
      Brian Foster authored
      Torn write and tail overwrite detection both trigger only on
      -EFSBADCRC errors. While this is the most likely failure scenario
      for each condition, -EFSCORRUPTED is still possible in certain cases
      depending on what ends up on disk when a torn write or partial tail
      overwrite occurs. For example, an invalid log record h_len can lead
      to an -EFSCORRUPTED error when running the log recovery CRC pass.
      
      Therefore, update log head and tail verification to trigger the
      associated head/tail fixups in the event of -EFSCORRUPTED errors
      along with -EFSBADCRC. Also, -EFSCORRUPTED can currently be returned
      from xlog_do_recovery_pass() before rhead_blk is initialized if the
      first record encountered happens to be corrupted. This leads to an
      incorrect 'first_bad' return value. Initialize rhead_blk earlier in
      the function to address that problem as well.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      a4c9b34d
    • Brian Foster's avatar
      xfs: add log item pinning error injection tag · 7f4d01f3
      Brian Foster authored
      Add an error injection tag to force log items in the AIL to the
      pinned state. This option can be used by test infrastructure to
      induce head behind tail conditions. Specifically, this is intended
      to be used by xfstests to reproduce log recovery problems after
      failed/corrupted log writes overwrite the last good tail LSN in the
      log.
      
      When enabled, AIL push attempts see log items in the AIL in the
      pinned state. This stalls metadata writeback and thus prevents the
      current tail of the log from moving forward. When disabled,
      subsequent AIL pushes observe the log items in their appropriate
      state and filesystem operation continues as normal.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      7f4d01f3
    • Brian Foster's avatar
      xfs: fix log recovery corruption error due to tail overwrite · 4a4f66ea
      Brian Foster authored
      If we consider the case where the tail (T) of the log is pinned long
      enough for the head (H) to push and block behind the tail, we can
      end up blocked in the following state without enough free space (f)
      in the log to satisfy a transaction reservation:
      
      	0	phys. log	N
      	[-------HffT---H'--T'---]
      
      The last good record in the log (before H) refers to T. The tail
      eventually pushes forward (T') leaving more free space in the log
      for writes to H. At this point, suppose space frees up in the log
      for the maximum of 8 in-core log buffers to start flushing out to
      the log. If this pushes the head from H to H', these next writes
      overwrite the previous tail T. This is safe because the items logged
      from T to T' have been written back and removed from the AIL.
      
      If the next log writes (H -> H') happen to fail and result in
      partial records in the log, the filesystem shuts down having
      overwritten T with invalid data. Log recovery correctly locates H on
      the subsequent mount, but H still refers to the now corrupted tail
      T. This results in log corruption errors and recovery failure.
      
      Since the tail overwrite results from otherwise correct runtime
      behavior, it is up to log recovery to try and deal with this
      situation. Update log recovery tail verification to run a CRC pass
      from the first record past the tail to the head. This facilitates
      error detection at T and moves the recovery tail to the first good
      record past H' (similar to truncating the head on torn write
      detection). If corruption is detected beyond the range possibly
      affected by the max number of iclogs, the log is legitimately
      corrupted and log recovery failure is expected.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      4a4f66ea
    • Brian Foster's avatar
      xfs: always verify the log tail during recovery · 5297ac1f
      Brian Foster authored
      Log tail verification currently only occurs when torn writes are
      detected at the head of the log. This was introduced because a
      change in the head block due to torn writes can lead to a change in
      the tail block (each log record header references the current tail)
      and the tail block should be verified before log recovery proceeds.
      
      Tail corruption is possible outside of torn write scenarios,
      however. For example, partial log writes can be detected and cleared
      during the initial head/tail block discovery process. If the partial
      write coincides with a tail overwrite, the log tail is corrupted and
      recovery fails.
      
      To facilitate correct handling of log tail overwites, update log
      recovery to always perform tail verification. This is necessary to
      detect potential tail overwrite conditions when torn writes may not
      have occurred. This changes normal (i.e., no torn writes) recovery
      behavior slightly to detect and return CRC related errors near the
      tail before actual recovery starts.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      5297ac1f
    • Brian Foster's avatar
      xfs: fix recovery failure when log record header wraps log end · 284f1c2c
      Brian Foster authored
      The high-level log recovery algorithm consists of two loops that
      walk the physical log and process log records from the tail to the
      head. The first loop handles the case where the tail is beyond the
      head and processes records up to the end of the physical log. The
      subsequent loop processes records from the beginning of the physical
      log to the head.
      
      Because log records can wrap around the end of the physical log, the
      first loop mentioned above must handle this case appropriately.
      Records are processed from in-core buffers, which means that this
      algorithm must split the reads of such records into two partial
      I/Os: 1.) from the beginning of the record to the end of the log and
      2.) from the beginning of the log to the end of the record. This is
      further complicated by the fact that the log record header and log
      record data are read into independent buffers.
      
      The current handling of each buffer correctly splits the reads when
      either the header or data starts before the end of the log and wraps
      around the end. The data read does not correctly handle the case
      where the prior header read wrapped or ends on the physical log end
      boundary. blk_no is incremented to or beyond the log end after the
      header read to point to the record data, but the split data read
      logic triggers, attempts to read from an invalid log block and
      ultimately causes log recovery to fail. This can be reproduced
      fairly reliably via xfstests tests generic/047 and generic/388 with
      large iclog sizes (256k) and small (10M) logs.
      
      If the record header read has pushed beyond the end of the physical
      log, the subsequent data read is actually contiguous. Update the
      data read logic to detect the case where blk_no has wrapped, mod it
      against the log size to read from the correct address and issue one
      contiguous read for the log data buffer. The log record is processed
      as normal from the buffer(s), the loop exits after the current
      iteration and the subsequent loop picks up with the first new record
      after the start of the log.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      284f1c2c
    • Carlos Maiolino's avatar
      xfs: Properly retry failed inode items in case of error during buffer writeback · d3a304b6
      Carlos Maiolino authored
      When a buffer has been failed during writeback, the inode items into it
      are kept flush locked, and are never resubmitted due the flush lock, so,
      if any buffer fails to be written, the items in AIL are never written to
      disk and never unlocked.
      
      This causes unmount operation to hang due these items flush locked in AIL,
      but this also causes the items in AIL to never be written back, even when
      the IO device comes back to normal.
      
      I've been testing this patch with a DM-thin device, creating a
      filesystem larger than the real device.
      
      When writing enough data to fill the DM-thin device, XFS receives ENOSPC
      errors from the device, and keep spinning on xfsaild (when 'retry
      forever' configuration is set).
      
      At this point, the filesystem can not be unmounted because of the flush locked
      items in AIL, but worse, the items in AIL are never retried at all
      (once xfs_inode_item_push() will skip the items that are flush locked),
      even if the underlying DM-thin device is expanded to the proper size.
      
      This patch fixes both cases, retrying any item that has been failed
      previously, using the infra-structure provided by the previous patch.
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      d3a304b6
    • Carlos Maiolino's avatar
      xfs: Add infrastructure needed for error propagation during buffer IO failure · 0b80ae6e
      Carlos Maiolino authored
      With the current code, XFS never re-submit a failed buffer for IO,
      because the failed item in the buffer is kept in the flush locked state
      forever.
      
      To be able to resubmit an log item for IO, we need a way to mark an item
      as failed, if, for any reason the buffer which the item belonged to
      failed during writeback.
      
      Add a new log item callback to be used after an IO completion failure
      and make the needed clean ups.
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      0b80ae6e
    • Eric Sandeen's avatar
      xfs: toggle readonly state around xfs_log_mount_finish · 6f4a1eef
      Eric Sandeen authored
      When we do log recovery on a readonly mount, unlinked inode
      processing does not happen due to the readonly checks in
      xfs_inactive(), which are trying to prevent any I/O on a
      readonly mount.
      
      This is misguided - we do I/O on readonly mounts all the time,
      for consistency; for example, log recovery.  So do the same
      RDONLY flag twiddling around xfs_log_mount_finish() as we
      do around xfs_log_mount(), for the same reason.
      
      This all cries out for a big rework but for now this is a
      simple fix to an obvious problem.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      6f4a1eef
    • Eric Sandeen's avatar
      xfs: write unmount record for ro mounts · 757a69ef
      Eric Sandeen authored
      There are dueling comments in the xfs code about intent
      for log writes when unmounting a readonly filesystem.
      
      In xfs_mountfs, we see the intent:
      
      /*
       * Now the log is fully replayed, we can transition to full read-only
       * mode for read-only mounts. This will sync all the metadata and clean
       * the log so that the recovery we just performed does not have to be
       * replayed again on the next mount.
       */
      
      and it calls xfs_quiesce_attr(), but by the time we get to
      xfs_log_unmount_write(), it returns early for a RDONLY mount:
      
       * Don't write out unmount record on read-only mounts.
      
      Because of this, sequential ro mounts of a filesystem with
      a dirty log will replay the log each time, which seems odd.
      
      Fix this by writing an unmount record even for RO mounts, as long
      as norecovery wasn't specified (don't write a clean log record
      if a dirty log may still be there!) and the log device is
      writable.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      757a69ef
  3. 21 Aug, 2017 12 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · 6470812e
      Linus Torvalds authored
      Pull sparc fixes from David Miller:
       "Just a couple small fixes, two of which have to do with gcc-7:
      
         1) Don't clobber kernel fixed registers in __multi4 libgcc helper.
      
         2) Fix a new uninitialized variable warning on sparc32 with gcc-7,
            from Thomas Petazzoni.
      
         3) Adjust pmd_t initializer on sparc32 to make gcc happy.
      
         4) If ATU isn't available, don't bark in the logs. From Tushar Dave"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc: kernel/pcic: silence gcc 7.x warning in pcibios_fixup_bus()
        sparc64: remove unnecessary log message
        sparc64: Don't clibber fixed registers in __multi4.
        mm: add pmd_t initializer __pmd() to work around a GCC bug.
      6470812e
    • Thomas Petazzoni's avatar
      sparc: kernel/pcic: silence gcc 7.x warning in pcibios_fixup_bus() · 2dc77533
      Thomas Petazzoni authored
      When building the kernel for Sparc using gcc 7.x, the build fails
      with:
      
      arch/sparc/kernel/pcic.c: In function ‘pcibios_fixup_bus’:
      arch/sparc/kernel/pcic.c:647:8: error: ‘cmd’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
          cmd |= PCI_COMMAND_IO;
              ^~
      
      The simplified code looks like this:
      
      unsigned int cmd;
      [...]
      pcic_read_config(dev->bus, dev->devfn, PCI_COMMAND, 2, &cmd);
      [...]
      cmd |= PCI_COMMAND_IO;
      
      I.e, the code assumes that pcic_read_config() will always initialize
      cmd. But it's not the case. Looking at pcic_read_config(), if
      bus->number is != 0 or if the size is not one of 1, 2 or 4, *val will
      not be initialized.
      
      As a simple fix, we initialize cmd to zero at the beginning of
      pcibios_fixup_bus.
      Signed-off-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2dc77533
    • Linus Torvalds's avatar
      Merge tag 'arc-4.13-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 05ab303b
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
      
       - PAE40 related updates
      
       - SLC errata for region ops
      
       - intc line masking by default
      
      * tag 'arc-4.13-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        arc: Mask individual IRQ lines during core INTC init
        ARCv2: PAE40: set MSB even if !CONFIG_ARC_HAS_PAE40 but PAE exists in SoC
        ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses
        ARC: dma: implement dma_unmap_page and sg variant
        ARCv2: SLC: Make sure busy bit is set properly for region ops
        ARC: [plat-sim] Include this platform unconditionally
        ARC: [plat-axs10x]: prepare dts files for enabling PAE40 on axs103
        ARC: defconfig: Cleanup from old Kconfig options
      05ab303b
    • Linus Torvalds's avatar
      Merge tag 'rtc-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · 0b3baec8
      Linus Torvalds authored
      Pull RTC fix from Alexandre Belloni:
       "Fix regmap configuration for ds1307"
      
      * tag 'rtc-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
        rtc: ds1307: fix regmap config
      0b3baec8
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · e3181f2c
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix IGMP handling wrt VRF, from David Ahern.
      
       2) Fix timer access to freed object in dccp, from Eric Dumazet.
      
       3) Use kmalloc_array() in ptr_ring to avoid overflow cases which are
          triggerable by userspace. Also from Eric Dumazet.
      
       4) Fix infinite loop in unmapping cleanup of nfp driver, from Colin Ian
          King.
      
       5) Correct datagram peek handling of empty SKBs, from Matthew Dawson.
      
       6) Fix use after free in TIPC, from Eric Dumazet.
      
       7) When replacing a route in ipv6 we need to reset the round robin
          pointer, from Wei Wang.
      
       8) Fix bug in pci_find_pcie_root_port() which was unearthed by the
          relaxed ordering changes, from Thierry Redding. I made sure to get
          an explicit ACK from Bjorn this time around :-)
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits)
        ipv6: repair fib6 tree in failure case
        net_sched: fix order of queue length updates in qdisc_replace()
        tools lib bpf: improve warning
        switchdev: documentation: minor typo fixes
        bpf, doc: also add s390x as arch to sysctl description
        net: sched: fix NULL pointer dereference when action calls some targets
        rxrpc: Fix oops when discarding a preallocated service call
        irda: do not leak initialized list.dev to userspace
        net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled
        PCI: Allow PCI express root ports to find themselves
        tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP
        net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set
        ipv6: reset fn->rr_ptr when replacing route
        sctp: fully initialize the IPv6 address in sctp_v6_to_addr()
        tipc: fix use-after-free
        tun: handle register_netdevice() failures properly
        datagram: When peeking datagrams with offset < 0 don't skip empty skbs
        bpf, doc: improve sysctl knob description
        netxen: fix incorrect loop counter decrement
        nfp: fix infinite loop on umapping cleanup
        ...
      e3181f2c
    • Oleg Nesterov's avatar
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov authored
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: default avatarTroy Kensinger <tkensinger@google.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
    • Heiner Kallweit's avatar
      rtc: ds1307: fix regmap config · 03619844
      Heiner Kallweit authored
      Current max_register setting breaks reading nvram on certain chips and
      also reading the standard registers on RX8130 where register map starts
      at 0x10.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Fixes: 11e5890b "rtc: ds1307: convert driver to regmap"
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@free-electrons.com>
      03619844
    • Wei Wang's avatar
      ipv6: repair fib6 tree in failure case · 348a4002
      Wei Wang authored
      In fib6_add(), it is possible that fib6_add_1() picks an intermediate
      node and sets the node's fn->leaf to NULL in order to add this new
      route. However, if fib6_add_rt2node() fails to add the new
      route for some reason, fn->leaf will be left as NULL and could
      potentially cause crash when fn->leaf is accessed in fib6_locate().
      This patch makes sure fib6_repair_tree() is called to properly repair
      fn->leaf in the above failure case.
      
      Here is the syzkaller reported general protection fault in fib6_locate:
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 40937 Comm: syz-executor3 Not tainted
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff8801d7d64100 ti: ffff8801d01a0000 task.ti: ffff8801d01a0000
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] __ipv6_prefix_equal64_half include/net/ipv6.h:475 [inline]
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] ipv6_prefix_equal include/net/ipv6.h:492 [inline]
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] fib6_locate_1 net/ipv6/ip6_fib.c:1210 [inline]
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] fib6_locate+0x281/0x3c0 net/ipv6/ip6_fib.c:1233
      RSP: 0018:ffff8801d01a36a8  EFLAGS: 00010202
      RAX: 0000000000000020 RBX: ffff8801bc790e00 RCX: ffffc90002983000
      RDX: 0000000000001219 RSI: ffff8801d01a37a0 RDI: 0000000000000100
      RBP: ffff8801d01a36f0 R08: 00000000000000ff R09: 0000000000000000
      R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
      R13: dffffc0000000000 R14: ffff8801d01a37a0 R15: 0000000000000000
      FS:  00007f6afd68c700(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004c6340 CR3: 00000000ba41f000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Stack:
       ffff8801d01a37a8 ffff8801d01a3780 ffffed003a0346f5 0000000c82a23ea0
       ffff8800b7bd7700 ffff8801d01a3780 ffff8800b6a1c940 ffffffff82a23ea0
       ffff8801d01a3920 ffff8801d01a3748 ffffffff82a223d6 ffff8801d7d64988
      Call Trace:
       [<ffffffff82a223d6>] ip6_route_del+0x106/0x570 net/ipv6/route.c:2109
       [<ffffffff82a23f9d>] inet6_rtm_delroute+0xfd/0x100 net/ipv6/route.c:3075
       [<ffffffff82621359>] rtnetlink_rcv_msg+0x549/0x7a0 net/core/rtnetlink.c:3450
       [<ffffffff8274c1d1>] netlink_rcv_skb+0x141/0x370 net/netlink/af_netlink.c:2281
       [<ffffffff82613ddf>] rtnetlink_rcv+0x2f/0x40 net/core/rtnetlink.c:3456
       [<ffffffff8274ad38>] netlink_unicast_kernel net/netlink/af_netlink.c:1206 [inline]
       [<ffffffff8274ad38>] netlink_unicast+0x518/0x750 net/netlink/af_netlink.c:1232
       [<ffffffff8274b83e>] netlink_sendmsg+0x8ce/0xc30 net/netlink/af_netlink.c:1778
       [<ffffffff82564aff>] sock_sendmsg_nosec net/socket.c:609 [inline]
       [<ffffffff82564aff>] sock_sendmsg+0xcf/0x110 net/socket.c:619
       [<ffffffff82564d62>] sock_write_iter+0x222/0x3a0 net/socket.c:834
       [<ffffffff8178523d>] new_sync_write+0x1dd/0x2b0 fs/read_write.c:478
       [<ffffffff817853f4>] __vfs_write+0xe4/0x110 fs/read_write.c:491
       [<ffffffff81786c38>] vfs_write+0x178/0x4b0 fs/read_write.c:538
       [<ffffffff817892a9>] SYSC_write fs/read_write.c:585 [inline]
       [<ffffffff817892a9>] SyS_write+0xd9/0x1b0 fs/read_write.c:577
       [<ffffffff82c71e32>] entry_SYSCALL_64_fastpath+0x12/0x17
      
      Note: there is no "Fixes" tag as this seems to be a bug introduced
      very early.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      348a4002
    • Konstantin Khlebnikov's avatar
      net_sched: fix order of queue length updates in qdisc_replace() · 68a66d14
      Konstantin Khlebnikov authored
      This important to call qdisc_tree_reduce_backlog() after changing queue
      length. Parent qdisc should deactivate class in ->qlen_notify() called from
      qdisc_tree_reduce_backlog() but this happens only if qdisc->q.qlen in zero.
      
      Missed class deactivations leads to crashes/warnings at picking packets
      from empty qdisc and corrupting state at reactivating this class in future.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Fixes: 86a7996c ("net_sched: introduce qdisc_replace() helper")
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68a66d14
    • Eric Leblond's avatar
    • Chris Packham's avatar
      switchdev: documentation: minor typo fixes · 5a784498
      Chris Packham authored
      Two typos in switchdev.txt
      Signed-off-by: default avatarChris Packham <chris.packham@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a784498
    • Daniel Borkmann's avatar
      bpf, doc: also add s390x as arch to sysctl description · d4dd2d75
      Daniel Borkmann authored
      Looks like this was accidentally missed, so still add s390x
      as supported eBPF JIT arch to bpf_jit_enable.
      
      Fixes: 014cd0a3 ("bpf: Update sysctl documentation to list all supported architectures")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4dd2d75
  4. 20 Aug, 2017 2 commits
    • Linus Torvalds's avatar
      Linux 4.13-rc6 · 14ccee78
      Linus Torvalds authored
      14ccee78
    • Linus Torvalds's avatar
      Sanitize 'move_pages()' permission checks · 197e7e52
      Linus Torvalds authored
      The 'move_paghes()' system call was introduced long long ago with the
      same permission checks as for sending a signal (except using
      CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
      
      That turns out to not be a great choice - while the system call really
      only moves physical page allocations around (and you need other
      capabilities to do a lot of it), you can check the return value to map
      out some the virtual address choices and defeat ASLR of a binary that
      still shares your uid.
      
      So change the access checks to the more common 'ptrace_may_access()'
      model instead.
      
      This tightens the access checks for the uid, and also effectively
      changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
      anybody really _uses_ this legacy system call any more (we hav ebetter
      NUMA placement models these days), so I expect nobody to notice.
      
      Famous last words.
      Reported-by: default avatarOtto Ebeling <otto.ebeling@iki.fi>
      Acked-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      197e7e52