1. 24 Nov, 2016 9 commits
  2. 09 Nov, 2016 1 commit
    • Brian Foster's avatar
      xfs: fix unbalanced inode reclaim flush locking · 98efe8af
      Brian Foster authored
      Filesystem shutdown testing on an older distro kernel has uncovered an
      imbalanced locking pattern for the inode flush lock in
      xfs_reclaim_inode(). Specifically, there is a double unlock sequence
      between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
      "reclaim:" label.
      
      This actually does not cause obvious problems on current kernels due to
      the current flush lock implementation. Older kernels use a counting
      based flush lock mechanism, however, which effectively breaks the lock
      indefinitely when an already unlocked flush lock is repeatedly unlocked.
      Though this only currently occurs on filesystem shutdown, it has
      reproduced the effect of elevating an fs shutdown to a system-wide crash
      or hang.
      
      As it turns out, the flush lock is not actually required for the reclaim
      logic in xfs_reclaim_inode() because by that time we have already cycled
      the flush lock once while holding ILOCK_EXCL. Therefore, remove the
      additional flush lock/unlock cycle around the 'reclaim:' label and
      update branches into this label to release the flush lock where
      appropriate. Add an assert to xfs_ifunlock() to help prevent future
      occurences of the same problem.
      Reported-by: default avatarZorro Lang <zlang@redhat.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      98efe8af
  3. 08 Nov, 2016 4 commits
    • Eric Sandeen's avatar
      xfs: provide helper for counting extents from if_bytes · 5d829300
      Eric Sandeen authored
      The open-coded pattern:
      
      ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)
      
      is all over the xfs code; provide a new helper
      xfs_iext_count(ifp) to count the number of inline extents
      in an inode fork.
      
      [dchinner: pick up several missed conversions]
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      5d829300
    • Eric Sandeen's avatar
      xfs: fix up xfs_swap_extent_forks inline extent handling · 4dfce57d
      Eric Sandeen authored
      There have been several reports over the years of NULL pointer
      dereferences in xfs_trans_log_inode during xfs_fsr processes,
      when the process is doing an fput and tearing down extents
      on the temporary inode, something like:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
          [exception RIP: xfs_trans_log_inode+0x10]
       #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
      #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
      #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
      #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
      #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
      #14 [ffff8800a57bbe00] evict at ffffffff811e1b67
      #15 [ffff8800a57bbe28] iput at ffffffff811e23a5
      #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
      #17 [ffff8800a57bbe88] dput at ffffffff811dd06c
      #18 [ffff8800a57bbea8] __fput at ffffffff811c823b
      #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
      #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
      #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
      #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
      
      As it turns out, this is because the i_itemp pointer, along
      with the d_ops pointer, has been overwritten with zeros
      when we tear down the extents during truncate.  When the in-core
      inode fork on the temporary inode used by xfs_fsr was originally
      set up during the extent swap, we mistakenly looked at di_nextents
      to determine whether all extents fit inline, but this misses extents
      generated by speculative preallocation; we should be using if_bytes
      instead.
      
      This mistake corrupts the in-memory inode, and code in
      xfs_iext_remove_inline eventually gets bad inputs, causing
      it to memmove and memset incorrect ranges; this became apparent
      because the two values in ifp->if_u2.if_inline_ext[1] contained
      what should have been in d_ops and i_itemp; they were memmoved due
      to incorrect array indexing and then the original locations
      were zeroed with memset, again due to an array overrun.
      
      Fix this by properly using i_df.if_bytes to determine the number
      of extents, not di_nextents.
      
      Thanks to dchinner for looking at this with me and spotting the
      root cause.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      4dfce57d
    • Brian Foster's avatar
      xfs: don't BUG() on mixed direct and mapped I/O · 04197b34
      Brian Foster authored
      We've had reports of generic/095 causing XFS to BUG() in
      __xfs_get_blocks() due to the existence of delalloc blocks on a
      direct I/O read. generic/095 issues a mix of various types of I/O,
      including direct and memory mapped I/O to a single file. This is
      clearly not supported behavior and is known to lead to such
      problems. E.g., the lack of exclusion between the direct I/O and
      write fault paths means that a write fault can allocate delalloc
      blocks in a region of a file that was previously a hole after the
      direct read has attempted to flush/inval the file range, but before
      it actually reads the block mapping. In turn, the direct read
      discovers a delalloc extent and cannot proceed.
      
      While the appropriate solution here is to not mix direct and memory
      mapped I/O to the same regions of the same file, the current
      BUG_ON() behavior is probably overkill as it can crash the entire
      system.  Instead, localize the failure to the I/O in question by
      returning an error for a direct I/O that cannot be handled safely
      due to delalloc blocks. Be careful to allow the case of a direct
      write to post-eof delalloc blocks. This can occur due to speculative
      preallocation and is safe as post-eof blocks are not accompanied by
      dirty pages in pagecache (conversely, preallocation within eof must
      have been zeroed, and thus dirtied, before the inode size could have
      been increased beyond said blocks).
      
      Finally, provide an additional warning if a direct I/O write occurs
      while the file is memory mapped. This may not catch all problematic
      scenarios, but provides a hint that some known-to-be-problematic I/O
      methods are in use.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      04197b34
    • Brian Foster's avatar
      xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan · 39937234
      Brian Foster authored
      The cowblocks background scanner currently clears the cowblocks tag
      for inodes without any real allocations in the cow fork. This
      excludes inodes with only delalloc blocks in the cow fork. While we
      might never expect to clear delalloc blocks from the cow fork in the
      background scanner, it is not necessarily correct to clear the
      cowblocks tag from such inodes.
      
      For example, if the background scanner happens to process an inode
      between a buffered write and writeback, the scanner catches the
      inode in a state after delalloc blocks have been allocated to the
      cow fork but before the delalloc blocks have been converted to real
      blocks by writeback. The background scanner then incorrectly clears
      the cowblocks tag, even if part of the aforementioned delalloc
      reservation will not be remapped to the data fork (i.e., extra
      blocks due to the cowextsize hint). This means that any such
      additional blocks in the cow fork might never be reclaimed by the
      background scanner and could persist until the inode itself is
      reclaimed.
      
      To address this problem, only skip and clear inodes without any cow
      fork allocations whatsoever from the background scanner. While we
      generally do not want to cancel delalloc reservations from the
      background scanner, the pagecache dirty check following the
      cowblocks check should prevent that situation. If we do end up with
      delalloc cow fork blocks without a dirty address space mapping, this
      is probably an indication that something has gone wrong and the
      blocks should be reclaimed, as they may never be converted to a real
      allocation.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      39937234
  4. 24 Oct, 2016 4 commits
  5. 20 Oct, 2016 22 commits