1. 22 Mar, 2011 3 commits
  2. 21 Mar, 2011 6 commits
  3. 16 Mar, 2011 1 commit
    • Theodore Ts'o's avatar
      ext4: Initialize fsync transaction ids in ext4_new_inode() · 688f869c
      Theodore Ts'o authored
      When allocating a new inode, we need to make sure i_sync_tid and
      i_datasync_tid are initialized.  Otherwise, one or both of these two
      values could be left initialized to zero, which could potentially
      result in BUG_ON in jbd2_journal_commit_transaction.
      
      (This could happen by having journal->commit_request getting set to
      zero, which could wake up the kjournald process even though there is
      no running transaction, which then causes a BUG_ON via the 
      J_ASSERT(j_ruinning_transaction != NULL) statement.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      688f869c
  4. 05 Mar, 2011 1 commit
  5. 28 Feb, 2011 4 commits
  6. 27 Feb, 2011 3 commits
    • Yongqiang Yang's avatar
      ext4: make FIEMAP and delayed allocation play well together · 6d9c85eb
      Yongqiang Yang authored
      Fix the FIEMAP ioctl so that it returns all of the page ranges which
      are still subject to delayed allocation.  We were missing some cases
      if the file was sparse.
      
      Reported by Chris Mason <chris.mason@oracle.com>:
      >We've had reports on btrfs that cp is giving us files full of zeros
      >instead of actually copying them.  It was tracked down to a bug with
      >the btrfs fiemap implementation where it was returning holes for
      >delalloc ranges.
      >
      >Newer versions of cp are trusting fiemap to tell it where the holes
      >are, which does seem like a pretty neat trick.
      >
      >I decided to give xfs and ext4 a shot with a few tests cases too, xfs
      >passed with all the ones btrfs was getting wrong, and ext4 got the basic
      >delalloc case right.
      >$ mkfs.ext4 /dev/xxx
      >$ mount /dev/xxx /mnt
      >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1
      >$ fiemap-test foo
      >ext:   0 logical: [       0..     255] phys:        0..     255
      >flags: 0x007 tot: 256
      >
      >Horray!  But once we throw a hole in, things go bad:
      >$ mkfs.ext4 /dev/xxx
      >$ mount /dev/xxx /mnt
      >$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
      >$ fiemap-test foo
      >< no output >
      >
      >We've got a delalloc extent after the hole and ext4 fiemap didn't find
      >it.  If I run sync to kick the delalloc out:
      >$sync
      >$ fiemap-test foo
      >ext:   0 logical: [     256..     511] phys:    34048..   34303
      >flags: 0x001 tot: 256
      >
      >fiemap-test is sitting in my /usr/local/bin, and I have no idea how it
      >got there.  It's full of pretty comments so I know it isn't mine, but
      >you can grab it here:
      >
      >http://oss.oracle.com/~mason/fiemap-test.c
      >
      >xfsqa has a fiemap program too.
      
      After Fix, test results are as follows:
      ext:   0 logical: [     256..     511] phys:        0..     255
      flags: 0x007 tot: 256
      ext:   0 logical: [     256..     511] phys:    33280..   33535
      flags: 0x001 tot: 256
      
      $ mkfs.ext4 /dev/xxx
      $ mount /dev/xxx /mnt
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
      $ sync
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3
      $ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5
      $ fiemap-test foo
      ext:   0 logical: [     256..     511] phys:    33280..   33535
      flags: 0x000 tot: 256
      ext:   1 logical: [     768..    1023] phys:        0..     255
      flags: 0x006 tot: 256
      ext:   2 logical: [    1280..    1535] phys:        0..     255
      flags: 0x007 tot: 256
      Tested-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarYongqiang Yang <xiaoqiangnk@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      6d9c85eb
    • Theodore Ts'o's avatar
      ext4: suppress verbose debugging information if malloc-debug is off · 4dd89fc6
      Theodore Ts'o authored
      If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due
      to disk being full, a verbose debugging message is printed, even if
      the malloc-debug switch has not been enabled.  Suppress the debugging
      message so that nothing is printed unless malloc-debug has been turned
      on.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      4dd89fc6
    • Theodore Ts'o's avatar
      ext4: don't leave PageWriteback set after memory failure · a54aa761
      Theodore Ts'o authored
      In ext4_bio_write_page(), if the memory allocation for the struct
      ext4_io_page fails, it returns with the page's PageWriteback flag set.
      This will end up causing the page not to skip writeback in
      WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or
      umount) the writeback daemon will get stuck forever on the
      wait_on_page_writeback() function in write_cache_pages_da().
      
      Or, if journalling is enabled and the file gets deleted, it the
      journal thread can get stuck in journal_finish_inode_data_buffers()
      call to filemap_fdatawait().
      
      Another place where things can get hung up is in
      truncate_inode_pages(), called out of ext4_evict_inode().
      
      Fix this by not setting PageWriteback until after we have successfully
      allocated the struct ext4_io_page.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a54aa761
  7. 26 Feb, 2011 9 commits
    • Theodore Ts'o's avatar
      ext4: move setup of the mpd structure to write_cache_pages_da() · 168fc022
      Theodore Ts'o authored
      Move the initialization of all of the fields of the mpd structure to
      write_cache_pages_da().  This simplifies the code considerably.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      168fc022
    • Theodore Ts'o's avatar
      ext4: don't lock the next page in write_cache_pages if not needed · 78aaced3
      Theodore Ts'o authored
      If we have accumulated a contiguous region of memory to be written
      out, and the next page can added to this region, don't bother locking
      (and then unlocking the page) before writing out the memory.  In the
      unlikely event that the next page was being written back by some other
      CPU, we can also skip waiting that page to finish writeback.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      78aaced3
    • Theodore Ts'o's avatar
      ext4: remove page_skipped hackery in ext4_da_writepages() · ee6ecbcc
      Theodore Ts'o authored
      Because the ext4 page writeback codepath had been prematurely calling
      clear_page_dirty_for_io(), if it turned out that a particular page
      couldn't be written out during a particular pass of
      write_cache_pages_da(), the page would have to get redirtied by
      calling redirty_pages_for_writeback().  Not only was this wasted work,
      but redirty_page_for_writeback() would increment wbc->pages_skipped to
      signal to writeback_sb_inodes() that buffers were locked, and that it
      should skip this inode until later.
      
      Since this signal was incorrect in ext4's case --- which was caused by
      ext4's historically incorrect use of write_cache_pages() ---
      ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
      confusing writeback_sb_inodes().
      
      Now that we've fixed ext4 to call clear_page_dirty_for_io() right
      before initiating the page I/O, we can nuke the page_skipped
      save/restore hackery, and breathe a sigh of relief.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      ee6ecbcc
    • Theodore Ts'o's avatar
      ext4: clear the dirty bit for a page in writeback at the last minute · 97498956
      Theodore Ts'o authored
      Move when we call clear_page_dirty_for_io() to just before we actually
      write the page.  This simplifies the code somewhat, and avoids marking
      pages as clean and then needing to remark them as dirty later.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      97498956
    • Theodore Ts'o's avatar
      ext4: simple cleanups to write_cache_pages_da() · 4f01b02c
      Theodore Ts'o authored
      Eliminate duplicate code, unneeded variables, etc., to make it easier
      to understand the code.  No behavioral changes were made in this patch.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      4f01b02c
    • Theodore Ts'o's avatar
      ext4: fold __mpage_da_writepage() into write_cache_pages_da() · 8eb9e5ce
      Theodore Ts'o authored
      Fold the __mpage_da_writepage() function into write_cache_pages_da().
      This will give us opportunities to clean up and simplify the resulting
      code.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      8eb9e5ce
    • Theodore Ts'o's avatar
      ext4: enable mblk_io_submit by default · 6fd7a467
      Theodore Ts'o authored
      Now that we've fixed the file corruption bug in commit d50bdd5a,
      it's time to enable mblk_io_submit by default.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      6fd7a467
    • Curt Wohlgemuth's avatar
      ext4: fix ext4_da_block_invalidatepages() to handle page range properly · c7f5938a
      Curt Wohlgemuth authored
      If ext4_da_block_invalidatepages() is called because of a
      failure from ext4_map_blocks() in mpage_da_map_and_submit(),
      it's supposed to clean up -- including unlock -- all the
      pages in the mpd structure.  But these values may not match
      up, even on a system in which block size == page size:
      
         mpd->b_blocknr != mpd->first_page
         mpd->b_size != (mpd->next_page - mpd->first_page)
      
      ext4_da_block_invalidatepages() has been using b_blocknr and
      b_size; this patch changes it to use first_page and
      next_page.
      
      Tested:  I injected a small number (5%) of failures in
      ext4_map_blocks() in the case that the flags contain
      EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
      kernel.  Without this patch, I got hung tasks every time.
      With this patch, I see no hangs in many runs of fsstress.
      Signed-off-by: default avatarCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c7f5938a
    • Curt Wohlgemuth's avatar
      ext4: mark multi-page IO complete on mapping failure · e0fd9b90
      Curt Wohlgemuth authored
      In mpage_da_map_and_submit(), if we have a delayed block
      allocation failure from ext4_map_blocks(), we need to mark
      the IO as complete, by setting
      
            mpd->io_done = 1;
      
      Otherwise, we could end up submitting the pages in an outer
      loop; since they are unlocked on mapping failure in
      ext4_da_block_invalidatepages(), this will cause a bug check
      in mpage_da_submit_io().
      
      I tested this by injected failures into ext4_map_blocks().
      Without this patch, a simple fsstress run will bug check;
      with the patch, it works fine.
      Signed-off-by: default avatarCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      e0fd9b90
  8. 24 Feb, 2011 5 commits
    • Coly Li's avatar
      ext4: mballoc: don't replace the current preallocation group unnecessarily · 5a54b2f1
      Coly Li authored
      In ext4_mb_check_group_pa(), the current preallocation space is
      replaced with a new preallocation space when the two have the same
      distance from the goal block.
      
      This doesn't actually gain us anything, so change things so that the
      function only switches to the new preallocation group if its distance
      from the goal block is strictly smaller than the current preallocaiton
      group's distance from the goal block.
      Signed-off-by: default avatarColy Li <bosong.ly@taobao.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5a54b2f1
    • Coly Li's avatar
      ext4: clarify description of ac_g_ex in struct ext4_allocation_context · 58696f3a
      Coly Li authored
      Signed-off-by: default avatarColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      58696f3a
    • Coly Li's avatar
      mballoc: add comments to ext4_mb_mark_free_simple() · 7c786059
      Coly Li authored
      This patch adds comments to ext4_mb_mark_free_simple to make it more
      understandable.
      Signed-off-by: default avatarColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      7c786059
    • Coly Li's avatar
      ext4: remove unncessary call mb_find_buddy() in debugging code · 235772da
      Coly Li authored
      In __mb_check_buddy(), look at the code below:
        591         fstart = -1;
        592         buddy = mb_find_buddy(e4b, 0, &max);
        593         for (i = 0; i < max; i++) {
        594                 if (!mb_test_bit(i, buddy)) {
        595                         MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
        596                         if (fstart == -1) {
        597                                 fragments++;
        598                                 fstart = i;
        599                         }
        600                         continue;
        601                 }
        602                 fstart = -1;
        603                 /* check used bits only */
        604                 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
        605                         buddy2 = mb_find_buddy(e4b, j, &max2);
        606                         k = i >> j;
        607                         MB_CHECK_ASSERT(k < max2);
        608                         MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
        609                 }
        610         }
        611         MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
        612         MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
        613
        614         grp = ext4_get_group_info(sb, e4b->bd_group);
        615         buddy = mb_find_buddy(e4b, 0, &max);
      
      On line 592, buddy is fetched by mb_find_buddy() with order 0, between
      line 593 to line 615, buddy is not changed, therefore there is
      no need to fetch buddy again from mb_find_buddy() with order 0 again.
      
      We can safely remove the second mb_find_buddy() on line 615.
      Signed-off-by: default avatarColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      235772da
    • Coly Li's avatar
      ext4: code cleanup in mb_find_buddy() · 84b775a3
      Coly Li authored
      Current code calculate max no matter whether order is zero, it's
      unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
      + 3)" only when order == 0.
      Signed-off-by: default avatarColy Li <bosong.ly@taobao.com>
      Cc: Alex Tomas <alex@clusterfs.com>
      Cc: Theodore Tso <tytso@google.com>
      84b775a3
  9. 23 Feb, 2011 4 commits
  10. 22 Feb, 2011 4 commits