1. 02 Mar, 2013 2 commits
    • Dmitry Monakhov's avatar
      jbd2: fix ERR_PTR dereference in jbd2__journal_start · df05c1b8
      Dmitry Monakhov authored
      If start_this_handle() failed handle will be initialized
      to ERR_PTR() and can not be dereferenced.
      
      paging request at fffffffffffffff6
      IP: [<ffffffff813c073f>] jbd2__journal_start+0x18f/0x290
      PGD 200e067 PUD 200f067 PMD 0
      Oops: 0000 [#1] SMP
      Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
      CPU 0 journal commit I/O error
      
      Pid: 2694, comm: fio Not tainted 3.8.0-rc3+ #79                  /DQ67SW
      RIP: 0010:[<ffffffff813c073f>]  [<ffffffff813c073f>] jbd2__journal_start+0x18f/0x290
      RSP: 0018:ffff880233b8ba58  EFLAGS: 00010292
      RAX: 00000000ffffffe2 RBX: ffffffffffffffe2 RCX: 0000000000000006
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff82128f48
      RBP: ffff880233b8ba98 R08: 0000000000000000 R09: ffff88021440a6e0
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      df05c1b8
    • Theodore Ts'o's avatar
      ext4: use percpu counter for extent cache count · 1ac6466f
      Theodore Ts'o authored
      Use a percpu counter rather than atomic types for shrinker accounting.
      There's no need for ultimate accuracy in the shrinker, so this
      should come a little more cheaply.  The percpu struct is somewhat
      large, but there was a big gap before the cache-aligned
      s_es_lru_lock anyway, and it fits nicely in there.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1ac6466f
  2. 01 Mar, 2013 1 commit
    • Theodore Ts'o's avatar
      ext4: optimize ext4_es_shrink() · 24630774
      Theodore Ts'o authored
      When the system is under memory pressure, ext4_es_srhink() will get
      called very often.  So optimize returning the number of items in the
      file system's extent status cache by keeping a per-filesystem count,
      instead of calculating it each time by scanning all of the inodes in
      the extent status cache.
      
      Also rename the slab used for the extent status cache to be
      "ext4_extent_status" so it's obviousl the slab in question is created
      by ext4.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      24630774
  3. 27 Feb, 2013 1 commit
  4. 22 Feb, 2013 2 commits
    • Lukas Czerner's avatar
      ext4: fix free clusters calculation in bigalloc filesystem · 304e220f
      Lukas Czerner authored
      ext4_has_free_clusters() should tell us whether there is enough free
      clusters to allocate, however number of free clusters in the file system
      is converted to blocks using EXT4_C2B() which is not only wrong use of
      the macro (we should have used EXT4_NUM_B2C) but it's also completely
      wrong concept since everything else is in cluster units.
      
      Moreover when calculating number of root clusters we should be using
      macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
      off by one. However r_blocks_count should always be a multiple of the
      cluster ratio so doing a plain bit shift should be enough here. We
      avoid using EXT4_B2C() because it's confusing.
      
      As a result of the first problem number of free clusters is much bigger
      than it should have been and ext4_has_free_clusters() would return 1 even
      if there is really not enough free clusters available.
      
      Fix this by removing the EXT4_C2B() conversion of free clusters and
      using bit shift when calculating number of root clusters. This bug
      affects number of xfstests tests covering file system ENOSPC situation
      handling. With this patch most of the ENOSPC problems with bigalloc file
      system disappear, especially the errors caused by delayed allocation not
      having enough space when the actual allocation is finally requested.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      304e220f
    • Eryu Guan's avatar
      ext4: no need to remove extent if len is 0 in ext4_es_remove_extent() · d4381472
      Eryu Guan authored
      len is 0 means no extent needs to be removed, so return immediately.
      Otherwise it could trigger the following BUG_ON() in
      ext4_es_remove_extent()
      
      	end = lblk + len - 1;
      	BUG_ON(end < lblk);
      
      This could be reproduced by a simple truncate(1) command by an
      unprivileged user
      
      	truncate -s $(($((2**32 - 1)) * 4096)) /mnt/ext4/testfile
      
      The same is true for __es_insert_extent().
      
      Patched kernel passed xfstests regression test.
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      d4381472
  5. 18 Feb, 2013 10 commits
    • Lukas Czerner's avatar
      ext4: fix xattr block allocation/release with bigalloc · 1231b3a1
      Lukas Czerner authored
      Currently when new xattr block is created or released we we would call
      dquot_free_block() or dquot_alloc_block() respectively, among the else
      decrementing or incrementing the number of blocks assigned to the
      inode by one block.
      
      This however does not work for bigalloc file system because we always
      allocate/free the whole cluster so we have to count with that in
      dquot_free_block() and dquot_alloc_block() as well.
      
      Use the clusters-to-blocks conversion EXT4_C2B() when passing number of
      blocks to the dquot_alloc/free functions to fix the problem.
      
      The problem has been revealed by xfstests #117 (and possibly others).
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Cc: stable@vger.kernel.org
      1231b3a1
    • Zheng Liu's avatar
      ext4: reclaim extents from extent status tree · 74cd15cd
      Zheng Liu authored
      Although extent status is loaded on-demand, we also need to reclaim
      extent from the tree when we are under a heavy memory pressure because
      in some cases fragmented extent tree causes status tree costs too much
      memory.
      
      Here we maintain a lru list in super_block.  When the extent status of
      an inode is accessed and changed, this inode will be move to the tail
      of the list.  The inode will be dropped from this list when it is
      cleared.  In the inode, a counter is added to count the number of
      cached objects in extent status tree.  Here only written/unwritten/hole
      extent is counted because delayed extent doesn't be reclaimed due to
      fiemap, bigalloc and seek_data/hole need it.  The counter will be
      increased as a new extent is allocated, and it will be decreased as a
      extent is freed.
      
      In this commit we use normal shrinker framework to reclaim memory from
      the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
      to count the number of reclaimable extents.  ext4_es_shrink() tries to
      reclaim written/unwritten/hole extents from extent status tree.  The
      inode that has been shrunk is moved to the tail of lru list.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      74cd15cd
    • Zheng Liu's avatar
      ext4: adjust some functions for reclaiming extents from extent status tree · bdedbb7b
      Zheng Liu authored
      This commit changes some interfaces in extent status tree because we
      need to use inode to count the cached objects in a extent status tree.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      bdedbb7b
    • Zheng Liu's avatar
      ext4: remove single extent cache · 69eb33dc
      Zheng Liu authored
      Single extent cache could be removed because we have extent status tree
      as a extent cache, and it would be better.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      69eb33dc
    • Zheng Liu's avatar
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu authored
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2
    • Zheng Liu's avatar
      ext4: track all extent status in extent status tree · f7fec032
      Zheng Liu authored
      By recording the phycisal block and status, extent status tree is able
      to track the status of every extents.  When we call _map_blocks
      functions to lookup an extent or create a new written/unwritten/delayed
      extent, this extent will be inserted into extent status tree.
      
      We don't load all extents from disk in alloc_inode() because it costs
      too much memory, and if a file is opened and closed frequently it will
      takes too much time to load all extent information.  So currently when
      we create/lookup an extent, this extent will be inserted into extent
      status tree.  Hence, the extent status tree may not comprehensively
      contain all of the extents found in the file.
      
      Here a condition we need to take care is that an extent might contains
      unwritten and delayed status simultaneously because an extent is delayed
      allocated and could be allocated by fallocate.  At this time we need to
      keep delayed status because later we need to update delayed reservation
      space using it.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      f7fec032
    • Zheng Liu's avatar
      ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag · a25a4e1a
      Zheng Liu authored
      This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
      because in later commit ext4_map_blocks needs to use this flag to
      determine the extent status.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      a25a4e1a
    • Zheng Liu's avatar
      ext4: rename and improbe ext4_es_find_extent() · be401363
      Zheng Liu authored
      This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
      and improve this function.  First, we split input and output parameter.
      Second, this function never return the first block of the next delayed
      extent after 'es'.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      be401363
    • Zheng Liu's avatar
      ext4: add physical block and status member into extent status tree · fdc0212e
      Zheng Liu authored
      This commit adds two members in extent_status structure to let it record
      physical block and extent status.  Here es_pblk is used to record both
      of them because physical block only has 48 bits.  So extent status could
      be stashed into it so that we can save some memory.  Now written,
      unwritten, delayed and hole are defined as status.
      
      Due to new member is added into extent status tree, all interfaces need
      to be adjusted.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      fdc0212e
    • Zheng Liu's avatar
      ext4: refine extent status tree · 06b0c886
      Zheng Liu authored
      This commit refines the extent status tree code.
      
      1) A prefix 'es_' is added to to the extent status tree structure
      members.
      
      2) Refactored es_remove_extent() so that __es_remove_extent() can be
      used by es_insert_extent() to remove the old extent entry(-ies) before
      inserting a new one.
      
      3) Rename extent_status_end() to ext4_es_end()
      
      4) ext4_es_can_be_merged() is define to check whether two extents can
      be merged or not.
      
      5) Update and clarified comments.
      Signed-off-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      06b0c886
  6. 15 Feb, 2013 2 commits
    • Theodore Ts'o's avatar
      ext4: use ERR_PTR() abstraction for ext4_append() · 0f70b406
      Theodore Ts'o authored
      Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
      pointer to an integer for the error code, as a code cleanup.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0f70b406
    • Theodore Ts'o's avatar
      ext4: refactor code to read directory blocks into ext4_read_dirblock() · dc6982ff
      Theodore Ts'o authored
      The code to read in directory blocks and verify their metadata
      checksums was replicated in ten different places across
      fs/ext4/namei.c, and the code was buggy in subtle ways in a number of
      those replicated sites.  In some cases, ext4_error() was called with a
      training newline.  In others, in particularly in empty_dir(), it was
      possible to call ext4_dirent_csum_verify() on an index block, which
      would trigger false warnings requesting the system adminsitrator to
      run e2fsck.
      
      By refactoring the code, we make the code more readable, as well as
      shrinking the compiled object file by over 700 bytes and 50 lines of
      code.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      dc6982ff
  7. 14 Feb, 2013 2 commits
  8. 09 Feb, 2013 10 commits
    • Theodore Ts'o's avatar
      jbd2: use module parameters instead of debugfs for jbd_debug · b6e96d00
      Theodore Ts'o authored
      There are multiple reasons to move away from debugfs.  First of all,
      we are only using it for a single parameter, and it is much more
      complicated to set up (some 30 lines of code compared to 3), and one
      more thing that might fail while loading the jbd2 module.
      
      Secondly, as a module paramter it can be specified as a boot option if
      jbd2 is built into the kernel, or as a parameter when the module is
      loaded, and it can also be manipulated dynamically under
      /sys/module/jbd2/parameters/jbd2_debug.  So it is more flexible.
      
      Ultimately we want to move away from using jbd_debug() towards
      tracepoints, but for now this is still a useful simplification of the
      code base.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b6e96d00
    • Theodore Ts'o's avatar
      ext4: use module parameters instead of debugfs for mballoc_debug · a0b30c12
      Theodore Ts'o authored
      There are multiple reasons to move away from debugfs.  First of all,
      we are only using it for a single parameter, and it is much more
      complicated to set up (some 30 lines of code compared to 3), and one
      more thing that might fail while loading the ext4 module.
      
      Secondly, as a module paramter it can be specified as a boot option if
      ext4 is built into the kernel, or as a parameter when the module is
      loaded, and it can also be manipulated dynamically under
      /sys/module/ext4/parameters/mballoc_debug.  So it is more flexible.
      
      Ultimately we want to move away from using mb_debug() towards
      tracepoints, but for now this is still a useful simplification of the
      code base.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a0b30c12
    • Theodore Ts'o's avatar
      ext4: start handle at the last possible moment when creating inodes · 1139575a
      Theodore Ts'o authored
      In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
      until the inode has been succesfully allocated.  In order to do this,
      we need to start the handle in the ext4_new_inode().  So create a new
      variant of this function, ext4_new_inode_start_handle(), so the handle
      can be created at the last possible minute, before we need to modify
      the inode allocation bitmap block.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1139575a
    • Theodore Ts'o's avatar
      ext4: fix the number of credits needed for acl ops with inline data · 95eaefbd
      Theodore Ts'o authored
      Operations which modify extended attributes may need extra journal
      credits if inline data is used, since there is a chance that some
      extended attributes may need to get pushed to an external attribute
      block.
      
      Changes to reflect this was made in xattr.c, but they were missed in
      fs/ext4/acl.c.  To fix this, abstract the calculation of the number of
      credits needed for xattr operations to an inline function defined in
      ext4_jbd2.h, and use it in acl.c and xattr.c.
      
      Also move the function declarations used in inline.c from xattr.h
      (where they are non-obviously hidden, and caused problems since
      ext4_jbd2.h needs to use the function ext4_has_inline_data), and move
      them to ext4.h.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarTao Ma <boyu.mt@taobao.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      95eaefbd
    • Theodore Ts'o's avatar
      ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir() · 64044abf
      Theodore Ts'o authored
      The ext4_unlink() and ext4_rmdir() don't actually release the blocks
      associated with the file/directory.  This gets done in a separate jbd2
      handle called via ext4_evict_inode().  Thus, we don't need to reserve
      lots of journal credits for the truncate.
      
      Note that using too many journal credits is non-optimal because it can
      leading to the journal transmit getting closed too early, before it is
      strictly necessary.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      64044abf
    • Theodore Ts'o's avatar
      ext4: fix the number of credits needed for ext4_ext_migrate() · 4b217630
      Theodore Ts'o authored
      The migration ioctl creates a temporary inode.  Since this inode is
      never linked to a directory, we don't need to reserve journal credits
      required for modifying the directory.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      4b217630
    • Theodore Ts'o's avatar
      ext4: start handle at the last possible moment in ext4_rmdir() · 8dcfaad2
      Theodore Ts'o authored
      Don't start the jbd2 transaction handle until after the directory
      entry has been found, to minimize the amount of time that a handle is
      held active.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      8dcfaad2
    • Theodore Ts'o's avatar
      ext4: start handle at the last possible moment in ext4_unlink() · 931b6864
      Theodore Ts'o authored
      Don't start the jbd2 transaction handle until after the directory
      entry has been found, to minimize the amount of time that a handle is
      held active.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      931b6864
    • Theodore Ts'o's avatar
      ext4: grab page before starting transaction handle in write_begin() · 47564bfb
      Theodore Ts'o authored
      The grab_cache_page_write_begin() function can potentially sleep for a
      long time, since it may need to do memory allocation which can block
      if the system is under significant memory pressure, and because it may
      be blocked on page writeback.  If it does take a long time to grab the
      page, it's better that we not hold an active jbd2 handle.
      
      So grab a handle on the page first, and _then_ start the transaction
      handle.
      
      This commit fixes the following long transaction handle hold time:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      47564bfb
    • Theodore Ts'o's avatar
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o authored
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
  9. 08 Feb, 2013 2 commits
  10. 07 Feb, 2013 2 commits
  11. 04 Feb, 2013 1 commit
    • Theodore Ts'o's avatar
      ext4: optimize mballoc for large allocations · 40ae3487
      Theodore Ts'o authored
      The ext4 block allocator only maintains buddy bitmaps for chunks which
      are less than or equal to one quarter of a block group.  That is, for
      a file aystem with a 1k blocksize, and where the number of blocks in a
      block group is 8192 blocks, the largest chunk size tracked by buddy
      bitmaps is 2048 blocks.
      
      For a file system with a 4k blocksize, and where the number of blocks
      in a block group is 32768 blocks, the largest chunk size tracked by
      buddy bitmaps is 8192 blocks.
      
      To work around this code, mballoc.c before this commit would truncate
      allocation requests to the number of blocks in a block group minus 10.
      Why 10?  Aside from being a completely arbitrary number, it avoids
      block allocation to be a power of two larger than 25% of the block
      group.  If you try to explicitly fallocate 50% of the block group
      size, this will demonstrate the problem; the block allocation code
      will scan the all of the blocks in the file system with cr==0 (since
      the request is for a natural power of two), but then completely fail
      for all blocks groups, since the buddy bitmaps don't track chunk sizes
      of 50% of the block group.
      
      To fix this, in these we use ext4_mb_complex_scan_group() instead of
      ext4_mb_simple_scan_group().
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger@dilger.ca>
      40ae3487
  12. 03 Feb, 2013 4 commits
  13. 02 Feb, 2013 1 commit