1. 09 Dec, 2016 2 commits
  2. 30 Nov, 2016 7 commits
  3. 28 Nov, 2016 8 commits
    • Brian Foster's avatar
      xfs: pass post-eof speculative prealloc blocks to bmapi · f782088c
      Brian Foster authored
      xfs_file_iomap_begin_delay() implements post-eof speculative
      preallocation by extending the block count of the requested delayed
      allocation. Now that xfs_bmapi_reserve_delalloc() has been updated to
      handle prealloc blocks separately and tag the inode, update
      xfs_file_iomap_begin_delay() to use the new parameter and rely on the
      former to tag the inode.
      
      Note that this patch does not change behavior.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      f782088c
    • Brian Foster's avatar
      xfs: clean up cow fork reservation and tag inodes correctly · 0260d8ff
      Brian Foster authored
      COW fork reservation is implemented via delayed allocation. The code is
      modeled after the traditional delalloc allocation code, but is slightly
      different in terms of how preallocation occurs. Rather than post-eof
      speculative preallocation, COW fork preallocation is implemented via a
      COW extent size hint that is designed to minimize fragmentation as a
      reflinked file is split over time.
      
      xfs_reflink_reserve_cow() still uses logic that is oriented towards
      dealing with post-eof speculative preallocation, however, and is stale
      or not necessarily correct. First, the EOF alignment to the COW extent
      size hint is implemented in xfs_bmapi_reserve_delalloc() (which does so
      correctly by aligning the start and end offsets) and so is not necessary
      in xfs_reflink_reserve_cow(). The backoff and retry logic on ENOSPC is
      also ineffective for the same reason, as xfs_bmapi_reserve_delalloc()
      will simply perform the same allocation request on the retry. Finally,
      since the COW extent size hint aligns the start and end offset of the
      range to allocate, the end_fsb != orig_end_fsb logic is not sufficient.
      Indeed, if a write request happens to end on an aligned offset, it is
      possible that we do not tag the inode for COW preallocation even though
      xfs_bmapi_reserve_delalloc() may have preallocated at the start offset.
      
      Kill the unnecessary, duplicate code in xfs_reflink_reserve_cow().
      Remove the inode tag logic as well since xfs_bmapi_reserve_delalloc()
      has been updated to tag the inode correctly.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      0260d8ff
    • Brian Foster's avatar
      xfs: track preallocation separately in xfs_bmapi_reserve_delalloc() · 974ae922
      Brian Foster authored
      Speculative preallocation is currently processed entirely by the callers
      of xfs_bmapi_reserve_delalloc(). The caller determines how much
      preallocation to include, adjusts the extent length and passes down the
      resulting request.
      
      While this works fine for post-eof speculative preallocation, it is not
      as reliable for COW fork preallocation. COW fork preallocation is
      implemented via the cowextszhint, which aligns the start offset as well
      as the length of the extent. Further, it is difficult for the caller to
      accurately identify when preallocation occurs because the returned
      extent could have been merged with neighboring extents in the fork.
      
      To simplify this situation and facilitate further COW fork preallocation
      enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
      preallocation parameter to incorporate into the allocation request. The
      preallocation blocks value is tacked onto the end of the request and
      adjusted to accommodate neighboring extents and extent size limits.
      Since xfs_bmapi_reserve_delalloc() now knows precisely how much
      preallocation was included in the allocation, it can also tag the inodes
      appropriately to support preallocation reclaim.
      
      Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
      use the preallocation mechanism. This patch should not change behavior
      outside of correctly tagging reflink inodes when start offset
      preallocation occurs (which the caller does not handle correctly).
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      974ae922
    • Darrick J. Wong's avatar
      xfs: always succeed when deduping zero bytes · fba3e594
      Darrick J. Wong authored
      It turns out that btrfs and xfs had differing interpretations of what
      to do when the dedupe length is zero.  Change xfs to follow btrfs'
      semantics so that the userland interface is consistent.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      fba3e594
    • Bhumika Goyal's avatar
      fs: xfs: libxfs: constify xfs_nameops structures · cf7841c1
      Bhumika Goyal authored
      Declare the structure xfs_nameops as const as it is only stored in the
      m_dirnameops field of a xfs_mount structure. This field is of type
      const struct xfs_nameops *, so xfs_nameops structures having this
      property can be declared as const.
      Done using Coccinelle:
      @r1 disable optional_qualifier @
      identifier i;
      position p;
      @@
      static struct xfs_nameops i@p = {...};
      
      @ok1@
      identifier r1.i;
      position p;
      struct xfs_mount mp;
      @@
      mp.m_dirnameops=&i@p
      
      @bad@
      position p!={r1.p,ok1.p};
      identifier r1.i;
      @@
      i@p
      
      @depends on !bad disable optional_qualifier@
      identifier r1.i;
      @@
      static
      +const
      struct xfs_nameops i={...};
      
      @depends on !bad disable optional_qualifier@
      identifier r1.i;
      @@
      +const
      struct xfs_nameops i;
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
         5302	     85	      0	   5387	   150b	fs/xfs/libxfs/xfs_dir2.o
      
      File size after:
         text	   data	    bss	    dec	    hex	filename
         5318	     69	      0	   5387	   150b	fs/xfs/libxfs/xfs_dir2.o
      Signed-off-by: default avatarBhumika Goyal <bhumirks@gmail.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      cf7841c1
    • Bhumika Goyal's avatar
      fs: xfs: xfs_icreate_item: constify xfs_item_ops structure · bb6e0ebe
      Bhumika Goyal authored
      Declare the structure xfs_item_ops as const as it is only passed as an
      argument to the function xfs_log_item_init. As this argument is of type
      const struct xfs_item_ops *, so xfs_item_ops structures having this
      property can be declared as const.
      Done using Coccinelle:
      
      @r1 disable optional_qualifier @
      identifier i;
      position p;
      @@
      static struct xfs_item_ops i@p = {...};
      
      @ok1@
      identifier r1.i;
      position p;
      expression e1,e2,e3;
      @@
      xfs_log_item_init(e1,e2,e3,&i@p)
      
      @bad@
      position p!={r1.p,ok1.p};
      identifier r1.i;
      @@
      i@p
      
      @depends on !bad disable optional_qualifier@
      identifier r1.i;
      @@
      static
      +const
      struct xfs_item_ops i={...};
      
      @depends on !bad disable optional_qualifier@
      identifier r1.i;
      @@
      +const
      struct xfs_item_ops i;
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
          737	     64	      8	    809	    329	fs/xfs/xfs_icreate_item.o
      
      File size after:
         text	   data	    bss	    dec	    hex	filename
          801	      0	      8	    809	    329	fs/xfs/xfs_icreate_item.o
      Signed-off-by: default avatarBhumika Goyal <bhumirks@gmail.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      bb6e0ebe
    • Darrick J. Wong's avatar
      xfs: factor rmap btree size into the indlen calculations · fd26a880
      Darrick J. Wong authored
      When we're estimating the amount of space it's going to take to satisfy
      a delalloc reservation, we need to include the space that we might need
      to grow the rmapbt.  This helps us to avoid running out of space later
      when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
      observed this happening on generic/224 when sunit/swidth were set.
      Reported-by: default avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      fd26a880
    • Eric Sandeen's avatar
      xfs: add XBF_XBF_NO_IOACCT to buf trace output · 1247ec4c
      Eric Sandeen authored
      When XBF_NO_IOACCT got added, it missed the translation
      in XFS_BUF_FLAGS, so we see "0x8" in trace output rather
      than the flag name.  Fix it.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      1247ec4c
  4. 24 Nov, 2016 14 commits
  5. 09 Nov, 2016 2 commits
    • Jan Kara's avatar
      dax: Introduce IOMAP_FAULT flag · 9484ab1b
      Jan Kara authored
      Introduce a flag telling iomap operations whether they are handling a
      fault or other IO. That may influence behavior wrt inode size and
      similar things.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      9484ab1b
    • Brian Foster's avatar
      xfs: fix unbalanced inode reclaim flush locking · 98efe8af
      Brian Foster authored
      Filesystem shutdown testing on an older distro kernel has uncovered an
      imbalanced locking pattern for the inode flush lock in
      xfs_reclaim_inode(). Specifically, there is a double unlock sequence
      between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
      "reclaim:" label.
      
      This actually does not cause obvious problems on current kernels due to
      the current flush lock implementation. Older kernels use a counting
      based flush lock mechanism, however, which effectively breaks the lock
      indefinitely when an already unlocked flush lock is repeatedly unlocked.
      Though this only currently occurs on filesystem shutdown, it has
      reproduced the effect of elevating an fs shutdown to a system-wide crash
      or hang.
      
      As it turns out, the flush lock is not actually required for the reclaim
      logic in xfs_reclaim_inode() because by that time we have already cycled
      the flush lock once while holding ILOCK_EXCL. Therefore, remove the
      additional flush lock/unlock cycle around the 'reclaim:' label and
      update branches into this label to release the flush lock where
      appropriate. Add an assert to xfs_ifunlock() to help prevent future
      occurences of the same problem.
      Reported-by: default avatarZorro Lang <zlang@redhat.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      
      98efe8af
  6. 08 Nov, 2016 7 commits
    • Eric Sandeen's avatar
      xfs: provide helper for counting extents from if_bytes · 5d829300
      Eric Sandeen authored
      The open-coded pattern:
      
      ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)
      
      is all over the xfs code; provide a new helper
      xfs_iext_count(ifp) to count the number of inline extents
      in an inode fork.
      
      [dchinner: pick up several missed conversions]
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      5d829300
    • Eric Sandeen's avatar
      xfs: fix up xfs_swap_extent_forks inline extent handling · 4dfce57d
      Eric Sandeen authored
      There have been several reports over the years of NULL pointer
      dereferences in xfs_trans_log_inode during xfs_fsr processes,
      when the process is doing an fput and tearing down extents
      on the temporary inode, something like:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
          [exception RIP: xfs_trans_log_inode+0x10]
       #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
      #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
      #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
      #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
      #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
      #14 [ffff8800a57bbe00] evict at ffffffff811e1b67
      #15 [ffff8800a57bbe28] iput at ffffffff811e23a5
      #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
      #17 [ffff8800a57bbe88] dput at ffffffff811dd06c
      #18 [ffff8800a57bbea8] __fput at ffffffff811c823b
      #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
      #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
      #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
      #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
      
      As it turns out, this is because the i_itemp pointer, along
      with the d_ops pointer, has been overwritten with zeros
      when we tear down the extents during truncate.  When the in-core
      inode fork on the temporary inode used by xfs_fsr was originally
      set up during the extent swap, we mistakenly looked at di_nextents
      to determine whether all extents fit inline, but this misses extents
      generated by speculative preallocation; we should be using if_bytes
      instead.
      
      This mistake corrupts the in-memory inode, and code in
      xfs_iext_remove_inline eventually gets bad inputs, causing
      it to memmove and memset incorrect ranges; this became apparent
      because the two values in ifp->if_u2.if_inline_ext[1] contained
      what should have been in d_ops and i_itemp; they were memmoved due
      to incorrect array indexing and then the original locations
      were zeroed with memset, again due to an array overrun.
      
      Fix this by properly using i_df.if_bytes to determine the number
      of extents, not di_nextents.
      
      Thanks to dchinner for looking at this with me and spotting the
      root cause.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      4dfce57d
    • Brian Foster's avatar
      xfs: don't BUG() on mixed direct and mapped I/O · 04197b34
      Brian Foster authored
      We've had reports of generic/095 causing XFS to BUG() in
      __xfs_get_blocks() due to the existence of delalloc blocks on a
      direct I/O read. generic/095 issues a mix of various types of I/O,
      including direct and memory mapped I/O to a single file. This is
      clearly not supported behavior and is known to lead to such
      problems. E.g., the lack of exclusion between the direct I/O and
      write fault paths means that a write fault can allocate delalloc
      blocks in a region of a file that was previously a hole after the
      direct read has attempted to flush/inval the file range, but before
      it actually reads the block mapping. In turn, the direct read
      discovers a delalloc extent and cannot proceed.
      
      While the appropriate solution here is to not mix direct and memory
      mapped I/O to the same regions of the same file, the current
      BUG_ON() behavior is probably overkill as it can crash the entire
      system.  Instead, localize the failure to the I/O in question by
      returning an error for a direct I/O that cannot be handled safely
      due to delalloc blocks. Be careful to allow the case of a direct
      write to post-eof delalloc blocks. This can occur due to speculative
      preallocation and is safe as post-eof blocks are not accompanied by
      dirty pages in pagecache (conversely, preallocation within eof must
      have been zeroed, and thus dirtied, before the inode size could have
      been increased beyond said blocks).
      
      Finally, provide an additional warning if a direct I/O write occurs
      while the file is memory mapped. This may not catch all problematic
      scenarios, but provides a hint that some known-to-be-problematic I/O
      methods are in use.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      04197b34
    • Brian Foster's avatar
      xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan · 39937234
      Brian Foster authored
      The cowblocks background scanner currently clears the cowblocks tag
      for inodes without any real allocations in the cow fork. This
      excludes inodes with only delalloc blocks in the cow fork. While we
      might never expect to clear delalloc blocks from the cow fork in the
      background scanner, it is not necessarily correct to clear the
      cowblocks tag from such inodes.
      
      For example, if the background scanner happens to process an inode
      between a buffered write and writeback, the scanner catches the
      inode in a state after delalloc blocks have been allocated to the
      cow fork but before the delalloc blocks have been converted to real
      blocks by writeback. The background scanner then incorrectly clears
      the cowblocks tag, even if part of the aforementioned delalloc
      reservation will not be remapped to the data fork (i.e., extra
      blocks due to the cowextsize hint). This means that any such
      additional blocks in the cow fork might never be reclaimed by the
      background scanner and could persist until the inode itself is
      reclaimed.
      
      To address this problem, only skip and clear inodes without any cow
      fork allocations whatsoever from the background scanner. While we
      generally do not want to cancel delalloc reservations from the
      background scanner, the pagecache dirty check following the
      cowblocks check should prevent that situation. If we do end up with
      delalloc cow fork blocks without a dirty address space mapping, this
      is probably an indication that something has gone wrong and the
      blocks should be reclaimed, as they may never be converted to a real
      allocation.
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      39937234
    • Ross Zwisler's avatar
      dax: remove "depends on BROKEN" from FS_DAX_PMD · 190b5caa
      Ross Zwisler authored
      Now that DAX PMD faults are once again working and are now participating in
      DAX's radix tree locking scheme, allow their config option to be enabled.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      190b5caa
    • Ross Zwisler's avatar
      xfs: use struct iomap based DAX PMD fault path · 862f1b9d
      Ross Zwisler authored
      Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
      improved dax_iomap_pmd_fault().  Also, now that it has no more users,
      remove xfs_get_blocks_dax_fault().
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      862f1b9d
    • Ross Zwisler's avatar
      dax: add struct iomap based DAX PMD support · 642261ac
      Ross Zwisler authored
      DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
      locking.  This patch allows DAX PMDs to participate in the DAX radix tree
      based locking scheme so that they can be re-enabled using the new struct
      iomap based fault handlers.
      
      There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
      mappings that have an associated block allocation, and 4k DAX empty
      entries.  The empty entries exist to provide locking for the duration of a
      given page fault.
      
      This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
      entries, PMD DAX entries that have associated block allocations, and 2 MiB
      DAX empty entries.
      
      Unlike the 4k case where we insert a struct page* into the radix tree for
      4k zero pages, for HZP we insert a DAX exceptional entry with the new
      RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
      every 2MiB hole mapping, and it doesn't make sense to have that same struct
      page* with multiple entries in multiple trees.  This would cause contention
      on the single page lock for the one Huge Zero Page, and it would break the
      page->index and page->mapping associations that are assumed to be valid in
      many other places in the kernel.
      
      One difficult use case is when one thread is trying to use 4k entries in
      radix tree for a given offset, and another thread is using 2 MiB entries
      for that same offset.  The current code handles this by making the 2 MiB
      user fall back to 4k entries for most cases.  This was done because it is
      the simplest solution, and because the use of 2MiB pages is already
      opportunistic.
      
      If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
      we run into the problem of how we lock out 4k page faults for the entire
      2MiB range while we clean out the radix tree so we can insert the 2MiB
      entry.  We can solve this problem if we need to, but I think that the cases
      where both 2MiB entries and 4K entries are being used for the same range
      will be rare enough and the gain small enough that it probably won't be
      worth the complexity.
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      642261ac