1. 13 Mar, 2016 4 commits
    • Konstantin Khlebnikov's avatar
      ext4: use __GFP_NOFAIL in ext4_free_blocks() · adb7ef60
      Konstantin Khlebnikov authored
      This might be unexpected but pages allocated for sbi->s_buddy_cache are
      charged to current memory cgroup. So, GFP_NOFS allocation could fail if
      current task has been killed by OOM or if current memory cgroup has no
      free memory left. Block allocator cannot handle such failures here yet.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      adb7ef60
    • Aihua Zhang's avatar
      ext4: fix compile error while opening the macro DOUBLE_CHECK · a2821e34
      Aihua Zhang authored
      the error is:
          fs/ext4/mballoc.c:475:43: error: 'struct ext4_group_info' has
      no member named 'bb_bitmap'.
          so, the definition of macro DOUBLE_CHECK should before
      'struct ext4_group_info', I fixed it, and I moved the macro
      AGGRESSIVE_CHECK together, because I think they shoule be together.
      Signed-off-by: default avatarAihua Zhang <zhangaihua1@huawei.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a2821e34
    • Ales Novak's avatar
      ext4: print ext4 mount option data_err=abort correctly · 7915a861
      Ales Novak authored
      If data_err=abort option is specified for an ext3/ext4 mount,
      /proc/mounts does show it as "(null)". This is caused by token2str()
      returning NULL for Opt_data_err_abort (due to its pattern containing
      '=').
      Signed-off-by: default avatarAles Novak <alnovak@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7915a861
    • Eryu Guan's avatar
      ext4: fix NULL pointer dereference in ext4_mark_inode_dirty() · 5e1021f2
      Eryu Guan authored
      ext4_reserve_inode_write() in ext4_mark_inode_dirty() could fail on
      error (e.g. EIO) and iloc.bh can be NULL in this case. But the error is
      ignored in the following "if" condition and ext4_expand_extra_isize()
      might be called with NULL iloc.bh set, which triggers NULL pointer
      dereference.
      
      This is uncovered by commit 8b4953e1 ("ext4: reserve code points for
      the project quota feature"), which enlarges the ext4_inode size, and
      run the following script on new kernel but with old mke2fs:
      
        #/bin/bash
        mnt=/mnt/ext4
        devname=ext4-error
        dev=/dev/mapper/$devname
        fsimg=/home/fs.img
      
        trap cleanup 0 1 2 3 9 15
      
        cleanup()
        {
                umount $mnt >/dev/null 2>&1
                dmsetup remove $devname
                losetup -d $backend_dev
                rm -f $fsimg
                exit 0
        }
      
        rm -f $fsimg
        fallocate -l 1g $fsimg
        backend_dev=`losetup -f --show $fsimg`
        devsize=`blockdev --getsz $backend_dev`
      
        good_tab="0 $devsize linear $backend_dev 0"
        error_tab="0 $devsize error $backend_dev 0"
      
        dmsetup create $devname --table "$good_tab"
      
        mkfs -t ext4 $dev
        mount -t ext4 -o errors=continue,strictatime $dev $mnt
      
        dmsetup load $devname --table "$error_tab" && dmsetup resume $devname
        echo 3 > /proc/sys/vm/drop_caches
        ls -l $mnt
        exit 0
      
      [ Patch changed to simplify the function a tiny bit. -- Ted ]
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      5e1021f2
  2. 10 Mar, 2016 8 commits
    • Geliang Tang's avatar
      ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry() · a8ed9b86
      Geliang Tang authored
      BUFFER_TRACE info "call ext4_handle_dirty_metadata" doesn't match the
      code, so drop it.
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a8ed9b86
    • Adam Buchbinder's avatar
      b8a07463
    • OGAWA Hirofumi's avatar
      jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path · c0a2ad9b
      OGAWA Hirofumi authored
      On umount path, jbd2_journal_destroy() writes latest transaction ID
      (->j_tail_sequence) to be used at next mount.
      
      The bug is that ->j_tail_sequence is not holding latest transaction ID
      in some cases. So, at next mount, there is chance to conflict with
      remaining (not overwritten yet) transactions.
      
      	mount (id=10)
      	write transaction (id=11)
      	write transaction (id=12)
      	umount (id=10) <= the bug doesn't write latest ID
      
      	mount (id=10)
      	write transaction (id=11)
      	crash
      
      	mount
      	[recovery process]
      		transaction (id=11)
      		transaction (id=12) <= valid transaction ID, but old commit
                                             must not replay
      
      Like above, this bug become the cause of recovery failure, or FS
      corruption.
      
      So why ->j_tail_sequence doesn't point latest ID?
      
      Because if checkpoint transactions was reclaimed by memory pressure
      (i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated.
      (And another case is, __jbd2_journal_clean_checkpoint_list() is called
      with empty transaction.)
      
      So in above cases, ->j_tail_sequence is not pointing latest
      transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not
      done too.
      
      So, to fix this problem with minimum changes, this patch updates
      ->j_tail_sequence, and issue REQ_FLUSH.  (With more complex changes,
      some optimizations would be possible to avoid unnecessary REQ_FLUSH
      for example though.)
      
      BTW,
      
      	journal->j_tail_sequence =
      		++journal->j_transaction_sequence;
      
      Increment of ->j_transaction_sequence seems to be unnecessary, but
      ext3 does this.
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c0a2ad9b
    • Jan Kara's avatar
      ext4: more efficient SEEK_DATA implementation · 2d90c160
      Jan Kara authored
      Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
      ext4_seek_data() iterates hole block-by-block. Fix the problem by using
      returned hole size from ext4_map_blocks() and thus skip the hole in one
      go.
      
      Update also SEEK_HOLE implementation to follow the same pattern as
      SEEK_DATA to make future maintenance easier.
      
      Furthermore we add cond_resched() to both ext4_seek_data() and
      ext4_seek_hole() to avoid softlockups in case evil user creates huge
      fragmented file and we have to go through lots of extents.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2d90c160
    • Jan Kara's avatar
      ext4: cleanup handling of bh->b_state in DAX mmap · e3fb8eb1
      Jan Kara authored
      ext4_dax_mmap_get_block() updates bh->b_state directly instead of using
      ext4_update_bh_state(). This is mostly a cosmetic issue since DAX code
      always passes on-stack buffer_head but clean this up to make code more
      uniform.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      e3fb8eb1
    • Jan Kara's avatar
      ext4: return hole from ext4_map_blocks() · facab4d9
      Jan Kara authored
      Currently, ext4_map_blocks() just returns 0 when it finds a hole and
      allocation is not requested. However we have all the information
      available to tell how large the hole actually is and there are callers
      of ext4_map_blocks() which would save some block-by-block hole iteration
      if they knew this information. So fill in struct ext4_map_blocks even
      for holes with the information we have. We keep returning 0 for holes to
      maintain backward compatibility of the function.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      facab4d9
    • Jan Kara's avatar
      ext4: factor out determining of hole size · 140a5250
      Jan Kara authored
      ext4_ext_put_gap_in_cache() determines hole size in the extent tree,
      then trims this with possible delayed allocated blocks, and inserts the
      result into the extent status tree. Factor out determination of the size
      of the hole in the extent tree as we will need this information in
      ext4_ext_map_blocks() as well.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      140a5250
    • Jan Kara's avatar
      ext4: fix setting of referenced bit in ext4_es_lookup_extent() · 87d8a74b
      Jan Kara authored
      We were setting referenced bit on the extent structure we return from
      ext4_es_lookup_extent() which is just a private structure on stack. Thus
      setting had no effect. Set the bit in the structure in the status tree
      instead.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      87d8a74b
  3. 09 Mar, 2016 6 commits
    • Jan Kara's avatar
      ext4: remove i_ioend_count · 600be30a
      Jan Kara authored
      Remove counter of pending io ends as it is unused.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      600be30a
    • Jan Kara's avatar
      ext4: simplify io_end handling for AIO DIO · 109811c2
      Jan Kara authored
      When mapping blocks for direct IO, we allocate io_end structure before
      mapping blocks and store pointer to it in the inode. This creates a
      requirement that any AIO DIO using io_end must be protected by i_mutex.
      This created problems in the past with dioread_nolock mode which was
      corrupting io_end pointers. Also io_end is allocated unnecessarily in
      case where we don't need to convert any extents (which is a common case
      for example when overwriting file).
      
      We fix the problem by allocating io_end only once we return unwritten
      extent from block mapping function for AIO DIO (so we can save some
      pointless io_end allocations) and we pass pointer to it in bh->b_private
      which generic DIO code later passes to our end IO callback. That way we
      remove any need for global pointer to io_end structure and thus fix the
      races.
      
      The downside of this change is that the checking for unwritten IO in
      flight in ext4_extents_can_be_merged() is more racy since we now
      increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
      i_data_sem. However the check has been racy already before because
      ext4_writepages() already increment i_unwritten after dropping
      i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
      case.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      109811c2
    • Jan Kara's avatar
      ext4: move trans handling and completion deferal out of _ext4_get_block · efe70c29
      Jan Kara authored
      There is no need to handle starting of a transaction and deferal of DIO
      completion in _ext4_get_block() function. We can move this out to get
      block functions for direct IO that need it. That way we can add stricter
      checks verifying things work as we expect.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      efe70c29
    • Jan Kara's avatar
      ext4: rename and split get blocks functions · 705965bd
      Jan Kara authored
      Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
      describe what it does. Also split out get blocks functions for direct
      IO. Later we move functionality from _ext4_get_blocks() there. There's no
      functional change in this patch.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      705965bd
    • Jan Kara's avatar
      ext4: use i_mutex to serialize unaligned AIO DIO · e142d052
      Jan Kara authored
      Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
      However the code cleanups that happened after 2011 when the lock was
      introduced made aio_mutex acquired at almost the same places where we
      already have exclusion using i_mutex. So just use i_mutex for the
      exclusion of unaligned AIO DIO.
      
      The change moves waiting for pending unwritten extent conversion under
      i_mutex. That makes special handling of O_APPEND writes unnecessary and
      also avoids possible livelocking of unaligned AIO DIO with aligned one
      (nothing was preventing contiguous stream of aligned AIO DIOs to let
      unaligned AIO DIO wait forever).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      e142d052
    • Jan Kara's avatar
      ext4: pack ioend structure better · 3bd6ad7b
      Jan Kara authored
      On 64-bit architectures we have two 4-byte holes in struct ext4_io_end.
      Order entries better to avoid this and thus make the structure occupy
      64 instead of 72 bytes for 64-bit architectures.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      3bd6ad7b
  4. 23 Feb, 2016 10 commits
  5. 22 Feb, 2016 6 commits
    • Jan Kara's avatar
      mbcache2: Use referenced bit instead of LRU · f0c8b462
      Jan Kara authored
      Currently we maintain perfect LRU list by moving entry to the tail of
      the list when it gets used. However these operations on cache-global
      list are relatively expensive.
      
      In this patch we switch to lazy updates of LRU list. Whenever entry gets
      used, we set a referenced bit in it. When reclaiming entries, we give
      referenced entries another round in the LRU. Since the list is not a
      real LRU anymore, rename it to just 'list'.
      
      In my testing this logic gives about 30% boost to workloads with mostly
      unique xattr blocks (e.g. xattr-bench with 10 files and 10000 unique
      xattr values).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      f0c8b462
    • Jan Kara's avatar
      mbcache2: limit cache size · c2f3140f
      Jan Kara authored
      So far number of entries in mbcache is limited only by the pressure from
      the shrinker. Since too many entries degrade the hash table and
      generally we expect that caching more entries has diminishing returns,
      limit number of entries the same way as in the old mbcache to 16 * hash
      table size.
      
      Once we exceed the desired maximum number of entries, we schedule a
      backround work to reclaim entries. If the background work cannot keep up
      and the number of entries exceeds two times the desired maximum, we
      reclaim some entries directly when allocating a new entry.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      c2f3140f
    • Jan Kara's avatar
      mbcache: remove mbcache · ecd1e644
      Jan Kara authored
      Both ext2 and ext4 are now converted to mbcache2. Remove the old mbcache
      code.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ecd1e644
    • Jan Kara's avatar
      ext2: convert to mbcache2 · be0726d3
      Jan Kara authored
      The conversion is generally straightforward. We convert filesystem from
      a global cache to per-fs one. Similarly to ext4 the tricky part is that
      xattr block corresponding to found mbcache entry can get freed before we
      get buffer lock for that block. So we have to check whether the entry is
      still valid after getting the buffer lock.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      be0726d3
    • Jan Kara's avatar
      ext4: convert to mbcache2 · 82939d79
      Jan Kara authored
      The conversion is generally straightforward. The only tricky part is
      that xattr block corresponding to found mbcache entry can get freed
      before we get buffer lock for that block. So we have to check whether
      the entry is still valid after getting buffer lock.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      82939d79
    • Jan Kara's avatar
      mbcache2: reimplement mbcache · f9a61eb4
      Jan Kara authored
      Original mbcache was designed to have more features than what ext?
      filesystems ended up using. It supported entry being in more hashes, it
      had a home-grown rwlocking of each entry, and one cache could cache
      entries from multiple filesystems. This genericity also resulted in more
      complex locking, larger cache entries, and generally more code
      complexity.
      
      This is reimplementation of the mbcache functionality to exactly fit the
      purpose ext? filesystems use it for. Cache entries are now considerably
      smaller (7 instead of 13 longs), the code is considerably smaller as
      well (414 vs 913 lines of code), and IMO also simpler. The new code is
      also much more lightweight.
      
      I have measured the speed using artificial xattr-bench benchmark, which
      spawns P processes, each process sets xattr for F different files, and
      the value of xattr is randomly chosen from a pool of V values. Averages
      of runtimes for 5 runs for various combinations of parameters are below.
      The first value in each cell is old mbache, the second value is the new
      mbcache.
      
      V=10
      F\P	1		2		4		8		16		32		64
      10	0.158,0.157	0.208,0.196	0.500,0.277	0.798,0.400	3.258,0.584	13.807,1.047	61.339,2.803
      100	0.172,0.167	0.279,0.222	0.520,0.275	0.825,0.341	2.981,0.505	12.022,1.202	44.641,2.943
      1000	0.185,0.174	0.297,0.239	0.445,0.283	0.767,0.340	2.329,0.480	6.342,1.198	16.440,3.888
      
      V=100
      F\P	1		2		4		8		16		32		64
      10	0.162,0.153	0.200,0.186	0.362,0.257	0.671,0.496	1.433,0.943	3.801,1.345	7.938,2.501
      100	0.153,0.160	0.221,0.199	0.404,0.264	0.945,0.379	1.556,0.485	3.761,1.156	7.901,2.484
      1000	0.215,0.191	0.303,0.246	0.471,0.288	0.960,0.347	1.647,0.479	3.916,1.176	8.058,3.160
      
      V=1000
      F\P	1		2		4		8		16		32		64
      10	0.151,0.129	0.210,0.163	0.326,0.245	0.685,0.521	1.284,0.859	3.087,2.251	6.451,4.801
      100	0.154,0.153	0.211,0.191	0.276,0.282	0.687,0.506	1.202,0.877	3.259,1.954	8.738,2.887
      1000	0.145,0.179	0.202,0.222	0.449,0.319	0.899,0.333	1.577,0.524	4.221,1.240	9.782,3.579
      
      V=10000
      F\P	1		2		4		8		16		32		64
      10	0.161,0.154	0.198,0.190	0.296,0.256	0.662,0.480	1.192,0.818	2.989,2.200	6.362,4.746
      100	0.176,0.174	0.236,0.203	0.326,0.255	0.696,0.511	1.183,0.855	4.205,3.444	19.510,17.760
      1000	0.199,0.183	0.240,0.227	1.159,1.014	2.286,2.154	6.023,6.039	---,10.933	---,36.620
      
      V=100000
      F\P	1		2		4		8		16		32		64
      10	0.171,0.162	0.204,0.198	0.285,0.230	0.692,0.500	1.225,0.881	2.990,2.243	6.379,4.771
      100	0.151,0.171	0.220,0.210	0.295,0.255	0.720,0.518	1.226,0.844	3.423,2.831	19.234,17.544
      1000	0.192,0.189	0.249,0.225	1.162,1.043	2.257,2.093	5.853,4.997	---,10.399	---,32.198
      
      We see that the new code is faster in pretty much all the cases and
      starting from 4 processes there are significant gains with the new code
      resulting in upto 20-times shorter runtimes. Also for large numbers of
      cached entries all values for the old code could not be measured as the
      kernel started hitting softlockups and died before the test completed.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      f9a61eb4
  6. 21 Feb, 2016 2 commits
  7. 19 Feb, 2016 2 commits
    • Jan Kara's avatar
      ext4: fix crashes in dioread_nolock mode · 74dae427
      Jan Kara authored
      Competing overwrite DIO in dioread_nolock mode will just overwrite
      pointer to io_end in the inode. This may result in data corruption or
      extent conversion happening from IO completion interrupt because we
      don't properly set buffer_defer_completion() when unlocked DIO races
      with locked DIO to unwritten extent.
      
      Since unlocked DIO doesn't need io_end for anything, just avoid
      allocating it and corrupting pointer from inode for locked DIO.
      A cleaner fix would be to avoid these games with io_end pointer from the
      inode but that requires more intrusive changes so we leave that for
      later.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      74dae427
    • Jan Kara's avatar
      ext4: fix bh->b_state corruption · ed8ad838
      Jan Kara authored
      ext4 can update bh->b_state non-atomically in _ext4_get_block() and
      ext4_da_get_block_prep(). Usually this is fine since bh is just a
      temporary storage for mapping information on stack but in some cases it
      can be fully living bh attached to a page. In such case non-atomic
      update of bh->b_state can race with an atomic update which then gets
      lost. Usually when we are mapping bh and thus updating bh->b_state
      non-atomically, nobody else touches the bh and so things work out fine
      but there is one case to especially worry about: ext4_finish_bio() uses
      BH_Uptodate_Lock on the first bh in the page to synchronize handling of
      PageWriteback state. So when blocksize < pagesize, we can be atomically
      modifying bh->b_state of a buffer that actually isn't under IO and thus
      can race e.g. with delalloc trying to map that buffer. The result is
      that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
      corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
      
      Fix the problem by always updating bh->b_state bits atomically.
      
      CC: stable@vger.kernel.org
      Reported-by: default avatarNikolay Borisov <kernel@kyup.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      ed8ad838
  8. 16 Feb, 2016 1 commit
  9. 12 Feb, 2016 1 commit