1. 29 Sep, 2012 6 commits
    • Dmitry Monakhov's avatar
      ext4: serialize unlocked dio reads with truncate · 1c9114f9
      Dmitry Monakhov authored
      Current serialization will works only for DIO which holds
      i_mutex, but nonlocked DIO following race is possible:
      
      dio_nolock_read_task            truncate_task
      				->ext4_setattr()
      				 ->inode_dio_wait()
      ->ext4_ext_direct_IO
        ->ext4_ind_direct_IO
          ->__blockdev_direct_IO
            ->ext4_get_block
      				 ->truncate_setsize()
      				 ->ext4_truncate()
      				 #alloc truncated blocks
      				 #to other inode
            ->submit_io()
           #INFORMATION LEAK
      
      In order to serialize with unlocked DIO reads we have to
      rearrange wait sequence
      1) update i_size first
      2) if i_size about to be reduced wait for outstanding DIO requests
      3) and only after that truncate inode blocks
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      1c9114f9
    • Dmitry Monakhov's avatar
      ext4: serialize dio nonlocked reads with defrag workers · 17335dcc
      Dmitry Monakhov authored
      Inode's block defrag and ext4_change_inode_journal_flag() may
      affect nonlocked DIO reads result, so proper synchronization
      required.
      
      - Add missed inode_dio_wait() calls where appropriate
      - Check inode state under extra i_dio_count reference.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      17335dcc
    • Dmitry Monakhov's avatar
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov authored
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
      
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          mutex_lock(i_mutex)
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
                                              inode_dio_done()
        #TEST_CASE1_BEGIN
        MNT=/mnt_scrach
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
        #TEST_CASE1_END
      
      Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
      
      This patch makes state machine simple and clean:
      
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      28a535f9
    • Dmitry Monakhov's avatar
      ext4: fix unwritten counter leakage · 82e54229
      Dmitry Monakhov authored
      ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
      once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
      on error path.
      
       - add missed error checks to prevent counter leakage
       - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
         that conversion finished.
       - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.
      
      Visible effect of this bug is that unaligned aio_stress may deadlock
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      82e54229
    • Dmitry Monakhov's avatar
      ext4: give i_aiodio_unwritten a more appropriate name · e27f41e1
      Dmitry Monakhov authored
      AIO/DIO prefix is wrong because it account unwritten extents which
      also may be scheduled from buffered write endio
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      e27f41e1
    • Dmitry Monakhov's avatar
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov authored
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      storage.
      
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: default avatarZheng Liu <wenqing.lz@taobao.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      f45ee3a1
  2. 27 Sep, 2012 12 commits
  3. 26 Sep, 2012 6 commits
    • Dmitry Monakhov's avatar
      ext4: reimplement uninit extent optimization for move_extent_per_page() · 8c854473
      Dmitry Monakhov authored
      Uninitialized extent may became initialized(parallel writeback task)
      at any moment after we drop i_data_sem, so we have to recheck extent's
      state after we hold page's lock and i_data_sem.
      
      If we about to change page's mapping we must hold page's lock in order to
      serialize other users.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8c854473
    • Dmitry Monakhov's avatar
      ext4: clean up online defrag bugs in move_extent_per_page() · bb557488
      Dmitry Monakhov authored
      Non-full list of bugs:
      1) uninitialized extent optimization does not hold page's lock,
         and simply replace brunches after that writeback code goes
         crazy because block mapping changed under it's feets
         kernel BUG at fs/ext4/inode.c:1434!  ( 288'th xfstress)
      
      2) uninitialized extent may became initialized right after we
         drop i_data_sem, so extent state must be rechecked
      
      3) Locked pages goes uptodate via following sequence:
         ->readpage(page); lock_page(page); use_that_page(page)
         But after readpage() one may invalidate it because it is
         uptodate and unlocked (reclaimer does that)
         As result kernel bug at include/linux/buffer_head.c:133!
      
      4) We call write_begin() with already opened stansaction which
         result in following deadlock:
      ->move_extent_per_page()
        ->ext4_journal_start()-> hold journal transaction
        ->write_begin()
          ->ext4_da_write_begin()
            ->ext4_nonda_switch()
              ->writeback_inodes_sb_if_idle()  --> will wait for journal_stop()
      
      5) try_to_release_page() may fail and it does fail if one of page's bh was
         pinned by journal
      
      6) If we about to change page's mapping we MUST hold it's lock during entire
         remapping procedure, this is true for both pages(original and donor one)
      
      Fixes:
      
      - Avoid (1) and (2) simply by temproraly drop uninitialized extent handling
        optimization, this will be reimplemented later.
      
      - Fix (3) by manually forcing page to uptodate state w/o dropping it's lock
      
      - Fix (4) by rearranging existing locking:
        from: journal_start(); ->write_begin
        to: write_begin(); journal_extend()
      - Fix (5) simply by checking retvalue
      - Fix (6) by locking both (original and donor one) pages during extent swap
        with help of mext_page_double_lock()
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      bb557488
    • Dmitry Monakhov's avatar
      ext4: online defrag is not supported for journaled files · f066055a
      Dmitry Monakhov authored
      Proper block swap for inodes with full journaling enabled is
      truly non obvious task. In order to be on a safe side let's
      explicitly disable it for now.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      f066055a
    • Dmitry Monakhov's avatar
      ext4: move_extent code cleanup · 03bd8b9b
      Dmitry Monakhov authored
      - Remove usless checks, because it is too late to check that inode != NULL
        at the moment it was referenced several times.
      - Double lock routines looks very ugly and locking ordering relays on
        order of i_ino, but other kernel code rely on order of pointers.
        Let's make them simple and clean.
      - check that inodes belongs to the same SB as soon as possible.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      03bd8b9b
    • Tao Ma's avatar
      ext4: don't call update_backups() multiple times for the same bg · 0acdb887
      Tao Ma authored
      When performing an online resize, we add a bunch of groups at one time
      in ext4_flex_group_add, so in most cases a lot of group descriptors
      will be in the same group block. But in the end of this function,
      update_backups will be called for every group descriptor and the same
      block will be copied and journalled again and again.  It is really a
      waste.
      
      Fix things so we only update a particular bg descriptor block once and
      skip subsequent updates of the same block.
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      0acdb887
    • Dmitry Monakhov's avatar
      ext4: fix double unlock buffer mess during fs-resize · 7f1468d1
      Dmitry Monakhov authored
      bh_submit_read() is responsible for unlock bh on endio.  In addition,
      we need to use bh_uptodate_or_lock() to avoid races.
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      7f1468d1
  4. 24 Sep, 2012 3 commits
  5. 20 Sep, 2012 2 commits
    • Tao Ma's avatar
      ext4: remove erroneous ext4_superblock_csum_set() in update_backups() · bef53b01
      Tao Ma authored
      The update_backups() function is used to backup all the metadata
      blocks, so we should not take it for granted that 'data' is pointed to
      a super block and use ext4_superblock_csum_set to calculate the
      checksum there.  In case where the data is a group descriptor block,
      it will corrupt the last group descriptor, and then e2fsck will
      complain about it it.
      
      As all the metadata checksums should already be OK when we do the
      backup, remove the wrong ext4_superblock_csum_set and it should be
      just fine.
      Reported-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarTao Ma <boyu.mt@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bef53b01
    • Theodore Ts'o's avatar
      ext4: fix potential deadlock in ext4_nonda_switch() · 00d4e736
      Theodore Ts'o authored
      In ext4_nonda_switch(), if the file system is getting full we used to
      call writeback_inodes_sb_if_idle().  The problem is that we can be
      holding i_mutex already, and this causes a potential deadlock when
      writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
      lockdep output below).
      
      As it turns out we don't need need to hold s_umount; the fact that we
      are in the middle of the write(2) system call will keep the superblock
      pinned.  Unfortunately writeback_inodes_sb() checks to make sure
      s_umount is taken, and the VFS uses a different mechanism for making
      sure the file system doesn't get unmounted out from under us.  The
      simplest way of dealing with this is to just simply grab s_umount
      using a trylock, and skip kicking the writeback flusher thread in the
      very unlikely case that we can't take a read lock on s_umount without
      blocking.
      
      Also, we now check the cirteria for kicking the writeback thread
      before we decide to whether to fall back to non-delayed writeback, so
      if there are any outstanding delayed allocation writes, we try to get
      them resolved as soon as possible.
      
         [ INFO: possible circular locking dependency detected ]
         3.6.0-rc1-00042-gce894ca #367 Not tainted
         -------------------------------------------------------
         dd/8298 is trying to acquire lock:
          (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
      
         but task is already holding lock:
          (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         which lock already depends on the new lock.
      
         2 locks held by dd/8298:
          #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
          #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         stack backtrace:
         Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
         Call Trace:
          [<c015b79c>] ? console_unlock+0x345/0x372
          [<c06d62a1>] print_circular_bug+0x190/0x19d
          [<c019906c>] __lock_acquire+0x86d/0xb6c
          [<c01999db>] ? mark_held_locks+0x5c/0x7b
          [<c0199724>] lock_acquire+0x66/0xb9
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c06db935>] down_read+0x28/0x58
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
          [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
          [<c0271ece>] ext4_da_write_begin+0x27/0x193
          [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
          [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
          [<c01ddce7>] generic_file_aio_write+0x78/0xd3
          [<c026d336>] ext4_file_write+0x480/0x4a6
          [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
          [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
          [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
          [<c018099f>] ? local_clock+0x37/0x4e
          [<c0209f2c>] do_sync_write+0x67/0x9d
          [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
          [<c020a7b9>] vfs_write+0x7b/0xe6
          [<c020a9a6>] sys_write+0x3b/0x64
          [<c06dd4bd>] syscall_call+0x7/0xb
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      00d4e736
  6. 19 Sep, 2012 2 commits
    • Andrey Sidorov's avatar
      ext4: speed up truncate/unlink by not using bforget() unless needed · 18888cf0
      Andrey Sidorov authored
      Do not iterate over data blocks scanning for bh's to forget as they're
      never exist. This improves time taken by unlink / truncate syscall.
      Tested by continuously truncating file that is being written by dd.
      Another test is rm -rf of linux tree while tar unpacks it. With
      ordered data mode condition unlikely(!tbh) was always met in
      ext4_free_blocks. With journal data mode tbh was found only few times,
      so optimisation is also possible.
      
      Unlinking fallocated 60G file after doing sync && echo 3 >
      /proc/sys/vm/drop_caches && time rm --help
      
      X86 before (linux 3.6-rc4):
      # time rm -f test1
      real    0m2.710s
      user    0m0.000s
      sys     0m1.530s
      
      X86 after:
      # time rm -f test1
      real    0m0.644s
      user    0m0.003s
      sys     0m0.060s
      
      MIPS before (linux 2.6.37):
      # time rm -f test1
      real    0m 4.93s
      user    0m 0.00s
      sys     0m 4.61s
      
      MIPS after:
      # time rm -f test1
      real    0m 0.16s
      user    0m 0.00s
      sys     0m 0.06s
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrey Sidorov <qrxd43@motorola.com>
      18888cf0
    • Theodore Ts'o's avatar
      ext4: fix online resizing when the # of block groups is constant · 59e31c15
      Theodore Ts'o authored
      Commit 1c6bd717 introduced a regression where an online resize
      operation which did not change the number of block groups would fail,
      i.e:
      
      	mke2fs -t /dev/vdc 60000
      	mount /dev/vdc
      	resize2fs /dev/vdc 60001
      
      This was due to a bug in the logic regarding when to try converting
      the filesystem to use meta_bg.
      
      Also fix up a number of other minor issues with the online resizing
      code: (a) Fix a sparse warning; (b) only check to make sure the device
      is large enough once, instead of multiple times through the resize
      loop.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      59e31c15
  7. 18 Sep, 2012 4 commits
    • Anatol Pomozov's avatar
      ext4: make orphan functions be no-op in no-journal mode · c9b92530
      Anatol Pomozov authored
      Instead of checking whether the handle is valid, we check if journal
      is enabled. This avoids taking the s_orphan_lock mutex in all cases
      when there is no journal in use, including the error paths where
      ext4_orphan_del() is called with a handle set to NULL.
      Signed-off-by: default avatarAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c9b92530
    • Theodore Ts'o's avatar
      ext4: re-enable -o discard functionality in no-journal mode · b5e2368b
      Theodore Ts'o authored
      This is a revert of commit b56ff9d3, which removed the call to
      ext4_issue_discard() to fix a BUG reported because
      ext4_issue_discard() was being called from inside a block group
      spinlock.  As it turns out this bug had already been fixed by Lukas
      Czerner in commit 53fdcf99 by the simple expedient of moving when
      we call ext4_issue_discard() outside the spinlock.
      
      So it should be safe to re-enable this functionality, which I tested
      by putting an BUG_ON(in_atomic) just after the restored callsite to
      ext4_issue_discard().
      
      Addresses-Google-Bug: #6750518
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Anatol Pomozov <anatol.pomozov@gmail.com>
      b5e2368b
    • Carlos Maiolino's avatar
      ext4: fix possible non-initialized variable in htree_dirblock_to_tree() · 90b0a973
      Carlos Maiolino authored
      htree_dirblock_to_tree() declares a non-initialized 'err' variable,
      which is passed as a reference to another functions expecting them to
      set this variable with their error codes.
      
      It's passed to ext4_bread(), which then passes it to ext4_getblk(). If
      ext4_map_blocks() returns 0 due to a lookup failure, leaving the
      ext4_getblk() buffer_head uninitialized, it will make ext4_getblk()
      return to ext4_bread() without initialize the 'err' variable, and
      ext4_bread() will return to htree_dirblock_to_tree() with this variable
      still uninitialized.  htree_dirblock_to_tree() will pass this variable
      with garbage back to ext4_htree_fill_tree(), which expects a number of
      directory entries added to the rb-tree. which, in case, might return a
      fake non-zero value due the garbage left in the 'err' variable, leading
      the kernel to an Oops in ext4_dx_readdir(), once this is expecting a
      filled rb-tree node, when in turn it will have a NULL-ed one, causing an
      invalid page request when trying to get a fname struct from this NULL-ed
      rb-tree node in this line:
      
      fname = rb_entry(info->curr_node, struct fname, rb_hash);
      
      The patch itself initializes the err variable in
      htree_dirblock_to_tree() to avoid usage mistakes by the called
      functions, and also fix ext4_getblk() to return a initialized 'err'
      variable when ext4_map_blocks() fails a lookup.
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      90b0a973
    • Theodore Ts'o's avatar
      bc0b75f7
  8. 13 Sep, 2012 3 commits
  9. 12 Sep, 2012 1 commit
  10. 05 Sep, 2012 1 commit