1. 04 Jun, 2013 19 commits
    • Jan Kara's avatar
      ext4: defer clearing of PageWriteback after extent conversion · b0857d30
      Jan Kara authored
      Currently PageWriteback bit gets cleared from put_io_page() called
      from ext4_end_bio().  This is somewhat inconvenient as extent tree is
      not fully updated at that time (unwritten extents are not marked as
      written) so we cannot read the data back yet.  This design was
      dictated by lock ordering as we cannot start a transaction while
      PageWriteback bit is set (we could easily deadlock with
      ext4_da_writepages()).  But now that we use transaction reservation
      for extent conversion, locking issues are solved and we can move
      PageWriteback bit clearing after extent conversion is done.  As a
      result we can remove wait for unwritten extent conversion from
      ext4_sync_file() because it already implicitely happens through
      wait_on_page_writeback().
      
      We implement deferring of PageWriteback clearing by queueing completed
      bios to appropriate io_end and processing all the pages when io_end is
      going to be freed instead of at the moment ext4_io_end() is called.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b0857d30
    • Jan Kara's avatar
      ext4: split extent conversion lists to reserved & unreserved parts · 2e8fa54e
      Jan Kara authored
      Now that we have extent conversions with reserved transaction, we have
      to prevent extent conversions without reserved transaction (from DIO
      code) to block these (as that would effectively void any transaction
      reservation we did).  So split lists, work items, and work queues to
      reserved and unreserved parts.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      2e8fa54e
    • Jan Kara's avatar
      ext4: use transaction reservation for extent conversion in ext4_end_io · 6b523df4
      Jan Kara authored
      Later we would like to clear PageWriteback bit only after extent
      conversion from unwritten to written extents is performed.  However it
      is not possible to start a transaction after PageWriteback is set
      because that violates lock ordering (and is easy to deadlock).  So we
      have to reserve a transaction before locking pages and sending them
      for IO and later we use the transaction for extent conversion from
      ext4_end_io().
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      6b523df4
    • Jan Kara's avatar
      ext4: remove buffer_uninit handling · 3613d228
      Jan Kara authored
      There isn't any need for setting BH_Uninit on buffers anymore.  It was
      only used to signal we need to mark io_end as needing extent
      conversion in add_bh_to_extent() but now we can mark the io_end
      directly when mapping extent.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      3613d228
    • Jan Kara's avatar
      ext4: restructure writeback path · 4e7ea81d
      Jan Kara authored
      There are two issues with current writeback path in ext4.  For one we
      don't necessarily map complete pages when blocksize < pagesize and
      thus needn't do any writeback in one iteration.  We always map some
      blocks though so we will eventually finish mapping the page.  Just if
      writeback races with other operations on the file, forward progress is
      not really guaranteed. The second problem is that current code
      structure makes it hard to associate all the bios to some range of
      pages with one io_end structure so that unwritten extents can be
      converted after all the bios are finished.  This will be especially
      difficult later when io_end will be associated with reserved
      transaction handle.
      
      We restructure the writeback path to a relatively simple loop which
      first prepares extent of pages, then maps one or more extents so that
      no page is partially mapped, and once page is fully mapped it is
      submitted for IO. We keep all the mapping and IO submission
      information in mpage_da_data structure to somewhat reduce stack usage.
      Resulting code is somewhat shorter than the old one and hopefully also
      easier to read.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      4e7ea81d
    • Jan Kara's avatar
      ext4: better estimate credits needed for ext4_da_writepages() · fffb2739
      Jan Kara authored
      We limit the number of blocks written in a single loop of
      ext4_da_writepages() to 64 when inode uses indirect blocks.  That is
      unnecessary as credit estimates for mapping logically continguous run
      of blocks is rather low even for inode with indirect blocks.  So just
      lift this limitation and properly calculate the number of necessary
      credits.
      
      This better credit estimate will also later allow us to always write
      at least a single page in one iteration.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      fffb2739
    • Jan Kara's avatar
      ext4: improve writepage credit estimate for files with indirect blocks · fa55a0ed
      Jan Kara authored
      ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
      blocks mapped are logically contiguous. That is wrong since the argument
      informs whether the blocks are physically contiguous. As the blocks
      mapped are always logically contiguous and that's all
      ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      fa55a0ed
    • Jan Kara's avatar
      ext4: deprecate max_writeback_mb_bump sysfs attribute · f2d50a65
      Jan Kara authored
      This attribute is now unused so deprecate it.  We still show the old
      default value to keep some compatibility but we don't allow writing to
      that attribute anymore.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f2d50a65
    • Jan Kara's avatar
      ext4: stop messing with nr_to_write in ext4_da_writepages() · 39bba40b
      Jan Kara authored
      Writeback code got better in how it submits IO and now the number of
      pages requested to be written is usually higher than original 1024.
      The number is now dynamically computed based on observed throughput
      and is set to be about 0.5 s worth of writeback.  E.g. on ordinary
      SATA drive this ends up somewhere around 10000 as my testing shows.
      So remove the unnecessary smarts from ext4_da_writepages().
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      39bba40b
    • Jan Kara's avatar
      5fe2fe89
    • Jan Kara's avatar
      jbd2: transaction reservation support · 8f7d89f3
      Jan Kara authored
      In some cases we cannot start a transaction because of locking
      constraints and passing started transaction into those places is not
      handy either because we could block transaction commit for too long.
      Transaction reservation is designed to solve these issues.  It
      reserves a handle with given number of credits in the journal and the
      handle can be later attached to the running transaction without
      blocking on commit or checkpointing.  Reserved handles do not block
      transaction commit in any way, they only reduce maximum size of the
      running transaction (because we have to always be prepared to
      accomodate request for attaching reserved handle).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8f7d89f3
    • Jan Kara's avatar
      jbd2: remove unused waitqueues · f29fad72
      Jan Kara authored
      j_wait_logspace and j_wait_checkpoint are unused.  Remove them.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f29fad72
    • Jan Kara's avatar
      jbd2: fix race in t_outstanding_credits update in jbd2_journal_extend() · fe1e8db5
      Jan Kara authored
      jbd2_journal_extend() first checked whether transaction can accept
      extending handle with more credits and then added credits to
      t_outstanding_credits.  This can race with start_this_handle() adding
      another handle to a transaction and thus overbooking a transaction.
      Make jbd2_journal_extend() use atomic_add_return() to close the race.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      fe1e8db5
    • Jan Kara's avatar
      jbd2: cleanup needed free block estimates when starting a transaction · 76c39904
      Jan Kara authored
      __jbd2_log_space_left() and jbd_space_needed() were kind of odd.
      jbd_space_needed() accounted also credits needed for currently
      committing transaction while it didn't account for credits needed for
      control blocks.  __jbd2_log_space_left() then accounted for control
      blocks as a fraction of free space.  Since results of these two
      functions are always only compared against each other, this works
      correct but is somewhat strange.  Move the estimates so that
      jbd_space_needed() returns number of blocks needed for a transaction
      including control blocks and __jbd2_log_space_left() returns free
      space in the journal (with the committing transaction already
      subtracted).  Rename functions to jbd2_log_space_left() and
      jbd2_space_needed() while we are changing them.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      76c39904
    • Jan Kara's avatar
      jbd2: remove outdated comment · 2f387f84
      Jan Kara authored
      The comment about credit estimates isn't true anymore. We do what the
      comment describes now.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      2f387f84
    • Jan Kara's avatar
      jbd2: refine waiting for shadow buffers · b34090e5
      Jan Kara authored
      Currently when we add a buffer to a transaction, we wait until the
      buffer is removed from BJ_Shadow list (so that we prevent any changes
      to the buffer that is just written to the journal).  This can take
      unnecessarily long as a lot happens between the time the buffer is
      submitted to the journal and the time when we remove the buffer from
      BJ_Shadow list.  (e.g.  We wait for all data buffers in the
      transaction, we issue a cache flush, etc.)  Also this creates a
      dependency of do_get_write_access() on transaction commit (namely
      waiting for data IO to complete) which we want to avoid when
      implementing transaction reservation.
      
      So we modify commit code to set new BH_Shadow flag when temporary
      shadowing buffer is created and we clear that flag once IO on that
      buffer is complete.  This allows do_get_write_access() to wait only
      for BH_Shadow bit and thus removes the dependency on data IO
      completion.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b34090e5
    • Jan Kara's avatar
      jbd2: remove journal_head from descriptor buffers · e5a120ae
      Jan Kara authored
      Similarly as for metadata buffers, also log descriptor buffers don't
      really need the journal head. So strip it and remove BJ_LogCtl list.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      e5a120ae
    • Jan Kara's avatar
      jbd2: don't create journal_head for temporary journal buffers · f5113eff
      Jan Kara authored
      When writing metadata to the journal, we create temporary buffer heads
      for that task.  We also attach journal heads to these buffer heads but
      the only purpose of the journal heads is to keep buffers linked in
      transaction's BJ_IO list.  We remove the need for journal heads by
      reusing buffer_head's b_assoc_buffers list for that purpose.  Also
      since BJ_IO list is just a temporary list for transaction commit, we
      use a private list in jbd2_journal_commit_transaction() for that thus
      removing BJ_IO list from transaction completely.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f5113eff
    • Jan Kara's avatar
      ext4: use io_end for multiple bios · 97a851ed
      Jan Kara authored
      Change writeback path to create just one io_end structure for the
      extent to which we submit IO and share it among bios writing that
      extent. This prevents needless splitting and joining of unwritten
      extents when they cannot be submitted as a single bio.
      
      Bugs in ENOMEM handling found by Linux File System Verification project
      (linuxtesting.org) and fixed by Alexey Khoroshilov
      <khoroshilov@ispras.ru>.
      
      CC: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      97a851ed
  2. 31 May, 2013 4 commits
  3. 28 May, 2013 13 commits
    • Paul Taysom's avatar
      ext4: suppress ext4 orphan messages on mount · 566370a2
      Paul Taysom authored
      Suppress the messages releating to processing the ext4 orphan list
      ("truncating inode" and "deleting unreferenced inode") unless the
      debug option is on, since otherwise they end up taking up space in the
      log that could be used for more useful information.
      
      Tested by opening several files, unlinking them, then
      crashing the system, rebooting the system and examining
      /var/log/messages.
      
      Addresses the problem described in http://crbug.com/220976Signed-off-by: default avatarPaul Taysom <taysom@chromium.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      566370a2
    • Darrick J. Wong's avatar
      jbd2: fix block tag checksum verification brokenness · eee06c56
      Darrick J. Wong authored
      Al Viro complained of a ton of bogosity with regards to the jbd2 block
      tag header checksum.  This one checksum is 16 bits, so cut off the
      upper 16 bits and treat it as a 16-bit value and don't mess around
      with be32* conversions.  Fortunately metadata checksumming is still
      "experimental" and not in a shipping e2fsprogs, so there should be few
      users affected by this.
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      eee06c56
    • Zheng Liu's avatar
      jbd2: use kmem_cache_zalloc for allocating journal head · 5d9cf9c6
      Zheng Liu authored
      This commit tries to use kmem_cache_zalloc instead of kmem_cache_alloc/
      memset when a new journal head is alloctated.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      5d9cf9c6
    • Lukas Czerner's avatar
      ext4: make punch hole code path work with bigalloc · d23142c6
      Lukas Czerner authored
      Currently punch hole is disabled in file systems with bigalloc
      feature enabled. However the recent changes in punch hole patch should
      make it easier to support punching holes on bigalloc enabled file
      systems.
      
      This commit changes partial_cluster handling in ext4_remove_blocks(),
      ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
      partial_cluster is unsigned long long type and it makes sure that we
      will free the partial cluster if all extents has been released from that
      cluster. However it has been specifically designed only for truncate.
      
      With punch hole we can be freeing just some extents in the cluster
      leaving the rest untouched. So we have to make sure that we will notice
      cluster which still has some extents. To do this I've changed
      partial_cluster to be signed long long type. The only scenario where
      this could be a problem is when cluster_size == block size, however in
      that case there would not be any partial clusters so we're safe. For
      bigger clusters the signed type is enough. Now we use the negative value
      in partial_cluster to mark such cluster used, hence we know that we must
      not free it even if all other extents has been freed from such cluster.
      
      This scenario can be described in simple diagram:
      
      |FFF...FF..FF.UUU|
       ^----------^
        punch hole
      
      . - free space
      | - cluster boundary
      F - freed extent
      U - used extent
      
      Also update respective tracepoints to use signed long long type for
      partial_cluster.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d23142c6
    • Lukas Czerner's avatar
      ext4: update ext4_ext_remove_space trace point · 61801325
      Lukas Czerner authored
      Add "end" variable.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      61801325
    • Lukas Czerner's avatar
      ext4: remove unused code from ext4_remove_blocks() · 78fb9cdf
      Lukas Czerner authored
      The "head removal" branch in the condition is never used in any code
      path in ext4 since the function only caller ext4_ext_rm_leaf() will make
      sure that the extent is properly split before removing blocks. Note that
      there is a bug in this branch anyway.
      
      This commit removes the unused code completely and makes use of
      ext4_error() instead of printk if dubious range is provided.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      78fb9cdf
    • Lukas Czerner's avatar
      ext4: remove unused discard_partial_page_buffers · c121ffd0
      Lukas Czerner authored
      The discard_partial_page_buffers is no longer used anywhere so we can
      simply remove it including the *_no_lock variant and
      EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c121ffd0
    • Lukas Czerner's avatar
      ext4: use ext4_zero_partial_blocks in punch_hole · a87dd18c
      Lukas Czerner authored
      We're doing to get rid of ext4_discard_partial_page_buffers() since it is
      duplicating some code and also partially duplicating work of
      truncate_pagecache_range(), moreover the old implementation was much
      clearer.
      
      Now when the truncate_inode_pages_range() can handle truncating non page
      aligned regions we can use this to invalidate and zero out block aligned
      region of the punched out range and then use ext4_block_truncate_page()
      to zero the unaligned blocks on the start and end of the range. This
      will greatly simplify the punch hole code. Moreover after this commit we
      can get rid of the ext4_discard_partial_page_buffers() completely.
      
      We also introduce function ext4_prepare_punch_hole() to do come common
      operations before we attempt to do the actual punch hole on
      indirect or extent file which saves us some code duplication.
      
      This has been tested on ppc64 with 1k block size with fsx and xfstests
      without any problems.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a87dd18c
    • Lukas Czerner's avatar
      ext4: truncate_inode_pages() in orphan cleanup path · 55f252c9
      Lukas Czerner authored
      Currently we do not tell mm to zero out tail of the page before truncate
      in orphan_cleanup(). This is ok, because the page should not be
      uptodate, however this may eventually change and I might cause problems.
      
      Call truncate_inode_pages() as precautionary measure. Thanks Jan Kara
      for pointing this out.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      55f252c9
    • Lukas Czerner's avatar
      Revert "ext4: fix fsx truncate failure" · eb3544c6
      Lukas Czerner authored
      This reverts commit 189e868f.
      
      This commit reintroduces the use of ext4_block_truncate_page() in ext4
      truncate operation instead of ext4_discard_partial_page_buffers().
      
      The statement in the commit description that the truncate operation only
      zero block unaligned portion of the last page is not exactly right,
      since truncate_pagecache_range() also zeroes and invalidate the unaligned
      portion of the page. Then there is no need to zero and unmap it once more
      and ext4_block_truncate_page() was doing the right job, although we
      still need to update the buffer head containing the last block, which is
      exactly what ext4_block_truncate_page() is doing.
      
      Moreover the problem described in the commit is fixed more properly with
      commit
      
      15291164
      	jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer
      
      This was tested on ppc64 machine with block size of 1024 bytes without
      any problems.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      eb3544c6
    • Lukas Czerner's avatar
      ext4: Call ext4_jbd2_file_inode() after zeroing block · 0713ed0c
      Lukas Czerner authored
      In data=ordered mode we should call ext4_jbd2_file_inode() so that crash
      after the truncate transaction has committed does not expose stall data
      in the tail of the block.
      
      Thanks Jan Kara for pointing that out.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      0713ed0c
    • Lukas Czerner's avatar
      Revert "ext4: remove no longer used functions in inode.c" · d863dc36
      Lukas Czerner authored
      This reverts commit ccb4d7af.
      
      This commit reintroduces functions ext4_block_truncate_page() and
      ext4_block_zero_page_range() which has been previously removed in favour
      of ext4_discard_partial_page_buffers().
      
      In future commits we want to reintroduce those function and remove
      ext4_discard_partial_page_buffers() since it is duplicating some code
      and also partially duplicating work of truncate_pagecache_range(),
      moreover the old implementation was much clearer.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d863dc36
    • Lukas Czerner's avatar
      mm: teach truncate_inode_pages_range() to handle non page aligned ranges · 5a720394
      Lukas Czerner authored
      This commit changes truncate_inode_pages_range() so it can handle non
      page aligned regions of the truncate. Currently we can hit BUG_ON when
      the end of the range is not page aligned, but we can handle unaligned
      start of the range.
      
      Being able to handle non page aligned regions of the page can help file
      system punch_hole implementations and save some work, because once we're
      holding the page we might as well deal with it right away.
      
      In previous commits we've changed ->invalidatepage() prototype to accept
      'length' argument to be able to specify range to invalidate. No we can
      use that new ability in truncate_inode_pages_range().
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      5a720394
  4. 22 May, 2013 4 commits