1. 24 Aug, 2010 10 commits
    • Christoph Hellwig's avatar
      xfs: do not discard page cache data on EAGAIN · b5420f23
      Christoph Hellwig authored
      If xfs_map_blocks returns EAGAIN because of lock contention we must redirty the
      page and not disard the pagecache content and return an error from writepage.
      We used to do this correctly, but the logic got lost during the recent
      reshuffle of the writepage code.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarMike Gao <ygao.linux@gmail.com>
      Tested-by: default avatarMike Gao <ygao.linux@gmail.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      b5420f23
    • Dave Chinner's avatar
      xfs: don't do memory allocation under the CIL context lock · 3b93c7aa
      Dave Chinner authored
      Formatting items requires memory allocation when using delayed
      logging. Currently that memory allocation is done while holding the
      CIL context lock in read mode. This means that if memory allocation
      takes some time (e.g. enters reclaim), we cannot push on the CIL
      until the allocation(s) required by formatting complete. This can
      stall CIL pushes for some time, and once a push is stalled so are
      all new transaction commits.
      
      Fix this splitting the item formatting into two steps. The first
      step which does the allocation and memcpy() into the allocated
      buffer is now done outside the CIL context lock, and only the CIL
      insert is done inside the CIL context lock. This avoids the stall
      issue.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      3b93c7aa
    • Dave Chinner's avatar
      xfs: Reduce log force overhead for delayed logging · a44f13ed
      Dave Chinner authored
      Delayed logging adds some serialisation to the log force process to
      ensure that it does not deference a bad commit context structure
      when determining if a CIL push is necessary or not. It does this by
      grabing the CIL context lock exclusively, then dropping it before
      pushing the CIL if necessary. This causes serialisation of all log
      forces and pushes regardless of whether a force is necessary or not.
      As a result fsync heavy workloads (like dbench) can be significantly
      slower with delayed logging than without.
      
      To avoid this penalty, copy the current sequence from the context to
      the CIL structure when they are swapped. This allows us to do
      unlocked checks on the current sequence without having to worry
      about dereferencing context structures that may have already been
      freed. Hence we can remove the CIL context locking in the forcing
      code and only call into the push code if the current context matches
      the sequence we need to force.
      
      By passing the sequence into the push code, we can check the
      sequence again once we have the CIL lock held exclusive and abort if
      the sequence has already been pushed. This avoids a lock round-trip
      and unnecessary CIL pushes when we have racing push calls.
      
      The result is that the regression in dbench performance goes away -
      this change improves dbench performance on a ramdisk from ~2100MB/s
      to ~2500MB/s. This compares favourably to not using delayed logging
      which retuns ~2500MB/s for the same workload.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      a44f13ed
    • Dave Chinner's avatar
      xfs: dummy transactions should not dirty VFS state · 1a387d3b
      Dave Chinner authored
      When we  need to cover the log, we issue dummy transactions to ensure
      the current log tail is on disk. Unfortunately we currently use the
      root inode in the dummy transaction, and the act of committing the
      transaction dirties the inode at the VFS level.
      
      As a result, the VFS writeback of the dirty inode will prevent the
      filesystem from idling long enough for the log covering state
      machine to complete. The state machine gets stuck in a loop issuing
      new dummy transactions to cover the log and never makes progress.
      
      To avoid this problem, the dummy transactions should not cause
      externally visible state changes. To ensure this occurs, make sure
      that dummy transactions log an unchanging field in the superblock as
      it's state is never propagated outside the filesystem. This allows
      the log covering state machine to complete successfully and the
      filesystem now correctly enters a fully idle state about 90s after
      the last modification was made.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      1a387d3b
    • Stuart Brodsky's avatar
      xfs: ensure f_ffree returned by statfs() is non-negative · 2fe33661
      Stuart Brodsky authored
      Because of delayed updates to sb_icount field in the super block, it
      is possible to allocate over maxicount number of inodes.  This
      causes the arithmetic to calculate a negative number of free inodes
      in user commands like df or stat -f.
      
      Since maxicount is a somewhat arbitrary number, a slight over
      allocation is not critical but user commands should be displayed as
      0 or greater and never go negative.  To do this the value in the
      stats buffer f_ffree is capped to never go negative.
      
      [ Modified to use max_t as per Christoph's comment. ]
      Signed-off-by: default avatarStu Brodsky <sbrodsky@sgi.com>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      2fe33661
    • Dave Chinner's avatar
      xfs: handle negative wbc->nr_to_write during sync writeback · efceab1d
      Dave Chinner authored
      During data integrity (WB_SYNC_ALL) writeback, wbc->nr_to_write will
      go negative on inodes with more than 1024 dirty pages due to
      implementation details of write_cache_pages(). Currently XFS will
      abort page clustering in writeback once nr_to_write drops below
      zero, and so for data integrity writeback we will do very
      inefficient page at a time allocation and IO submission for inodes
      with large numbers of dirty pages.
      
      Fix this by only aborting the page clustering code when
      wbc->nr_to_write is negative and the sync mode is WB_SYNC_NONE.
      
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      efceab1d
    • Dave Chinner's avatar
      writeback: write_cache_pages doesn't terminate at nr_to_write <= 0 · 546a1924
      Dave Chinner authored
      I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
      been. Enabling writeback tracing showed:
      
          flush-253:16-8516  [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
      
      Writeback is not terminating in background writeback if ->writepage is
      returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
      writeback on XFS.
      
      Fix the write_cache_pages loop to terminate correctly when this situation
      occurs and so prevent this sub-optimal background writeback pattern. This
      improves sustained sequential buffered write performance from around
      250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.
      
      Cc:<stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      546a1924
    • Dave Chinner's avatar
      xfs: fix untrusted inode number lookup · 4536f2ad
      Dave Chinner authored
      Commit 7124fe0a ("xfs: validate untrusted inode
      numbers during lookup") changes the inode lookup code to do btree lookups for
      untrusted inode numbers. This change made an invalid assumption about the
      alignment of inodes and hence incorrectly calculated the first inode in the
      cluster. As a result, some inode numbers were being incorrectly considered
      invalid when they were actually valid.
      
      The issue was not picked up by the xfstests suite because it always runs fsr
      and dump (the two utilities that utilise the bulkstat interface) on cache hot
      inodes and hence the lookup code in the cold cache path was not sufficiently
      exercised to uncover this intermittent problem.
      
      Fix the issue by relaxing the btree lookup criteria and then checking if the
      record returned contains the inode number we are lookup for. If it we get an
      incorrect record, then the inode number is invalid.
      
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      4536f2ad
    • Dave Chinner's avatar
      xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE · 5b3eed75
      Dave Chinner authored
      Under heavy load parallel metadata loads (e.g. dbench), we can fail
      to mark all the inodes in a cluster being freed as XFS_ISTALE as we
      skip inodes we cannot get the XFS_ILOCK_EXCL or the flush lock on.
      When this happens and the inode cluster buffer has already been
      marked stale and freed, inode reclaim can try to write the inode out
      as it is dirty and not marked stale. This can result in writing th
      metadata to an freed extent, or in the case it has already
      been overwritten trigger a magic number check failure and return an
      EUCLEAN error such as:
      
      Filesystem "ram0": inode 0x442ba1 background reclaim flush failed with 117
      
      Fix this by ensuring that we hoover up all in memory inodes in the
      cluster and mark them XFS_ISTALE when freeing the cluster.
      
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      5b3eed75
    • Dave Chinner's avatar
      xfs: unlock items before allowing the CIL to commit · d17c701c
      Dave Chinner authored
      When we commit a transaction using delayed logging, we need to
      unlock the items in the transaciton before we unlock the CIL context
      and allow it to be checkpointed. If we unlock them after we release
      the CIl context lock, the CIL can checkpoint and complete before
      we free the log items. This breaks stale buffer item unlock and
      unpin processing as there is an implicit assumption that the unlock
      will occur before the unpin.
      
      Also, some log items need to store the LSN of the transaction commit
      in the item (inodes and EFIs) and so can race with other transaction
      completions if we don't prevent the CIL from checkpointing before
      the unlock occurs.
      
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d17c701c
  2. 23 Aug, 2010 1 commit
  3. 22 Aug, 2010 12 commits
  4. 21 Aug, 2010 6 commits
  5. 20 Aug, 2010 11 commits