1. 28 Oct, 2010 17 commits
    • Toshiyuki Okajima's avatar
      ext4: improve llseek error handling for overly large seek offsets · e0d10bfa
      Toshiyuki Okajima authored
      The llseek system call should return EINVAL if passed a seek offset
      which results in a write error.  What this maximum offset should be
      depends on whether or not the huge_file file system feature is set,
      and whether or not the file is extent based or not.
      
      
      If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be 
      written (write systemcall) is different from the maximum size which can be 
      sought (lseek systemcall).
      
      For example, the following 2 cases demonstrates the differences
      between the maximum size which can be written, versus the seek offset
      allowed by the llseek system call:
      
      #1: mkfs.ext3 <dev>; mount -t ext4 <dev>
      #2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
      
      Table. the max file size which we can write or seek
             at each filesystem feature tuning and file flag setting
      +============+===============================+===============================+
      | \ File flag|                               |                               |
      |      \     |     !EXT4_EXTENTS_FL          |        EXT4_EXTETNS_FL        |
      |case       \|                               |                               |
      +------------+-------------------------------+-------------------------------+
      | #1         |   write:      2194719883264   | write:       --------------   |
      |            |   seek:       2199023251456   | seek:        --------------   |
      +------------+-------------------------------+-------------------------------+
      | #2         |   write:      4402345721856   | write:       17592186044415   |
      |            |   seek:      17592186044415   | seek:        17592186044415   |
      +------------+-------------------------------+-------------------------------+
      
      The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
      (= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped 
      maxbytes).  Although generic_file_llseek uses only extent-mapped maxbytes.
      (llseek of ext4_file_operations is generic_file_llseek which uses
      sb->s_maxbytes.)
      
      Therefore we create ext4 llseek function which uses 2 maxbytes.
      
      The new own function originates from generic_file_llseek().
      If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters 
      inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
      Signed-off-by: default avatarToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      e0d10bfa
    • Maciej Żenczykowski's avatar
      ext4: don't update sb journal_devnum when RO dev · c41303ce
      Maciej Żenczykowski authored
      An ext4 filesystem on a read-only device, with an external journal
      which is at a different device number then recorded in the superblock
      will fail to honor the read-only setting of the device and trigger
      a superblock update (write).
      
      For example:
        - ext4 on a software raid which is in read-only mode
        - external journal on a read-write device which has changed device num
        - attempt to mount with -o journal_dev=<new_number>
        - hits BUG_ON(mddev->ro = 1) in md.c
      
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarMaciej Żenczykowski <zenczykowski@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      c41303ce
    • Lukas Czerner's avatar
      ext4: use sb_issue_zeroout in ext4_ext_zeroout · 2407518d
      Lukas Czerner authored
      Change ext4_ext_zeroout to use sb_issue_zeroout instead of its
      own approach to zero out extents.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      2407518d
    • Lukas Czerner's avatar
      ext4: use sb_issue_zeroout in setup_new_group_blocks · a31437b8
      Lukas Czerner authored
      Use sb_issue_zeroout to zero out inode table and descriptor table
      blocks instead of old approach which involves journaling.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a31437b8
    • Lukas Czerner's avatar
      ext4: add interface to advertise ext4 features in sysfs · 857ac889
      Lukas Czerner authored
      User-space should have the opportunity to check what features doest ext4
      support in each particular copy. This adds easy interface by creating new
      "features" directory in sys/fs/ext4/. In that directory files
      advertising feature names can be created.
      
      Add lazy_itable_init to the feature list.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      857ac889
    • Lukas Czerner's avatar
      ext4: add support for lazy inode table initialization · bfff6873
      Lukas Czerner authored
      When the lazy_itable_init extended option is passed to mke2fs, it
      considerably speeds up filesystem creation because inode tables are
      not zeroed out.  The fact that parts of the inode table are
      uninitialized is not a problem so long as the block group descriptors,
      which contain information regarding how much of the inode table has
      been initialized, has not been corrupted However, if the block group
      checksums are not valid, e2fsck must scan the entire inode table, and
      the the old, uninitialized data could potentially cause e2fsck to
      report false problems.
      
      Hence, it is important for the inode tables to be initialized as soon
      as possble.  This commit adds this feature so that mke2fs can safely
      use the lazy inode table initialization feature to speed up formatting
      file systems.
      
      This is done via a new new kernel thread called ext4lazyinit, which is
      created on demand and destroyed, when it is no longer needed.  There
      is only one thread for all ext4 filesystems in the system. When the
      first filesystem with inititable mount option is mounted, ext4lazyinit
      thread is created, then the filesystem can register its request in the
      request list.
      
      This thread then walks through the list of requests picking up
      scheduled requests and invoking ext4_init_inode_table(). Next schedule
      time for the request is computed by multiplying the time it took to
      zero out last inode table with wait multiplier, which can be set with
      the (init_itable=n) mount option (default is 10).  We are doing
      this so we do not take the whole I/O bandwidth. When the thread is no
      longer necessary (request list is empty) it frees the appropriate
      structures and exits (and can be created later later by another
      filesystem).
      
      We do not disturb regular inode allocations in any way, it just do not
      care whether the inode table is, or is not zeroed. But when zeroing, we
      have to skip used inodes, obviously. Also we should prevent new inode
      allocations from the group, while zeroing is on the way. For that we
      take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
      in the ext4_claim_inode, so when we are unlucky and allocator hits the
      group which is currently being zeroed, it just has to wait.
      
      This can be suppresed using the mount option no_init_itable.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      bfff6873
    • Lukas Czerner's avatar
      Add helper function for blkdev_issue_zeroout (sb_issue_discard) · e6fa0be6
      Lukas Czerner authored
      This is done the same way as helper sb_issue_discard for
      blkdev_issue_discard.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      e6fa0be6
    • Theodore Ts'o's avatar
      jbd2: Add sanity check for attempts to start handle during umount · 5c2178e7
      Theodore Ts'o authored
      An attempt to modify the file system during the call to
      jbd2_destroy_journal() can lead to a system lockup.  So add some
      checking to make it much more obvious when this happens to and to
      determine where the offending code is located.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      5c2178e7
    • Sergey Senozhatsky's avatar
      ext4: fix NULL pointer dereference in print_daily_error_info() · a1c6c569
      Sergey Senozhatsky authored
      Fix NULL pointer dereference in print_daily_error_info, when   
      called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error 
      reporting timer in ext4_put_super.
      
      Google-Bug-Id: 3017663
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      a1c6c569
    • Lukas Czerner's avatar
      ext4: don't hold spinlock while calling ext4_issue_discard() · 53fdcf99
      Lukas Czerner authored
      We can't hold the block group spinlock because we ext4_issue_discard()
      calls wait and hence can get rescheduled.
      
      Google-Bug-Id: 3017678
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      53fdcf99
    • Lukas Czerner's avatar
      ext4: check for negative error code from sb_issue_discard · 58298709
      Lukas Czerner authored
      sb_issue_discard() is returning negative error code, so check for
      -EOPNOTSUPP.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      58298709
    • Eric Sandeen's avatar
      ext4: don't bump up LONG_MAX nr_to_write by a factor of 8 · b443e733
      Eric Sandeen authored
      I'm uneasy with lots of stuff going on in ext4_da_writepages(),
      but bumping nr_to_write from LLONG_MAX to -8 clearly isn't
      making anything better, so avoid the multiplier in that case.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b443e733
    • Eric Sandeen's avatar
      ext4: stop looping in ext4_num_dirty_pages when max_pages reached · 659c6009
      Eric Sandeen authored
      Today we simply break out of the inner loop when we have accumulated
      max_pages; this keeps scanning forwad and doing pagevec_lookup_tag()
      in the while (!done) loop, this does potentially a lot of work
      with no net effect.
      
      When we have accumulated max_pages, just clean up and return.
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      659c6009
    • Curt Wohlgemuth's avatar
      ext4: use dedicated slab caches for group_info structures · fb1813f4
      Curt Wohlgemuth authored
      ext4_group_info structures are currently allocated with kmalloc().
      With a typical 4K block size, these are 136 bytes each -- meaning
      they'll each consume a 256-byte slab object.  On a system with many
      ext4 large partitions, that's a lot of wasted kernel slab space.
      (E.g., a single 1TB partition will have about 8000 block groups, using
      about 2MB of slab, of which nearly 1MB is wasted.)
      
      This patch creates an array of slab pointers created as needed --
      depending on the superblock block size -- and uses these slabs to
      allocate the group info objects.
      
      Google-Bug-Id: 2980809
      Signed-off-by: default avatarCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      fb1813f4
    • Wen Congyang's avatar
      ext4: avoid null dereference in trace_ext4_mballoc_discard · b853fd36
      Wen Congyang authored
      ac->inode is set to null in function ext4_mb_release_group_pa(),
      and then trace_ext4_mballoc_discard(ac) is called, the kernel
      will panic.
      
      BUG: unable to handle kernel NULL pointer dereference at 000000a4
      IP: [<f87e1714>] ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
      *pdpt = 0000000000abd001 *pde = 0000000000000000
      Oops: 0000 [#1] SMP
      
      Pid: 550, comm: flush-8:16 Not tainted 2.6.36-rc1 #1 SE7320EP2/Altos G530
      EIP: 0060:[<f87e1714>] EFLAGS: 00010206 CPU: 1
      EIP is at ftrace_raw_event_ext4__mballoc+0x54/0xc0 [ext4]
      EAX: f32ac840 EBX: f3f1cf88 ECX: f32ac840 EDX: 00000000
      ESI: f32ac83c EDI: f880b9d8 EBP: 00000000 ESP: f4b77ae4
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      Process flush-8:16 (pid: 550, ti=f4b76000 task=f613e540 task.ti=f4b76000)
      Call Trace:
       [<f87f5ac1>] ? ext4_mb_release_group_pa+0x121/0x150 [ext4]
       [<f87f8356>] ? ext4_mb_discard_group_preallocations+0x336/0x400 [ext4]
       [<f87fb7f1>] ? ext4_mb_new_blocks+0x3d1/0x4f0 [ext4]
       [<c05a6c5b>] ? __make_request+0x10b/0x440
       [<f87f1fb4>] ? ext4_ext_map_blocks+0x1334/0x1980 [ext4]
       [<c04ac78a>] ? rb_reserve_next_event+0xaa/0x3b0
       [<f87d18d6>] ? ext4_map_blocks+0xd6/0x1d0 [ext4]
       [<f87d2da7>] ? mpage_da_map_blocks+0xc7/0x8a0 [ext4]
       [<c04c8a68>] ? find_get_pages_tag+0x38/0x110
       [<c04d23a5>] ? __pagevec_release+0x15/0x20
       [<f87d3ca5>] ? ext4_da_writepages+0x2b5/0x5d0 [ext4]
       [<c04cfbe0>] ? __writepage+0x0/0x30
       [<c04d0e34>] ? do_writepages+0x14/0x30
       [<c0526600>] ? writeback_single_inode+0xa0/0x240
       [<c0526971>] ? writeback_sb_inodes+0xc1/0x180
       [<c0526ab8>] ? writeback_inodes_wb+0x88/0x140
       [<c0526d7b>] ? wb_writeback+0x20b/0x320
       [<c045aca7>] ? lock_timer_base+0x27/0x50
       [<c0526fe0>] ? wb_do_writeback+0x150/0x190
       [<c05270a8>] ? bdi_writeback_thread+0x88/0x1f0
       [<c043b680>] ? complete+0x40/0x60
       [<c0527020>] ? bdi_writeback_thread+0x0/0x1f0
       [<c0469474>] ? kthread+0x74/0x80
       [<c0469400>] ? kthread+0x0/0x80
       [<c040a23e>] ? kernel_thread_helper+0x6/0x10
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Acked-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      b853fd36
    • Brian King's avatar
      jbd2: Fix I/O hang in jbd2_journal_release_jbd_inode · 39e3ac25
      Brian King authored
      This fixes a hang seen in jbd2_journal_release_jbd_inode
      on a lot of Power 6 systems running with ext4. When we get
      in the hung state, all I/O to the disk in question gets blocked
      where we stay indefinitely. Looking at the task list, I can see
      we are stuck in jbd2_journal_release_jbd_inode waiting on a
      wake up. I added some debug code to detect this scenario and
      dump additional data if we were stuck in jbd2_journal_release_jbd_inode
      for longer than 30 minutes. When it hit, I was able to see that
      i_flags was 0, suggesting we missed the wake up.
      
      This patch changes i_flags to be an unsigned long, uses bit operators
      to access it, and adds barriers around the accesses. Prior to applying
      this patch, we were regularly hitting this hang on numerous systems
      in our test environment. After applying the patch, the hangs no longer
      occur.
      Signed-off-by: default avatarBrian King <brking@linux.vnet.ibm.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      39e3ac25
    • Theodore Ts'o's avatar
      ext4: fix EOFBLOCKS_FL handling · 58590b06
      Theodore Ts'o authored
      It turns out we have several problems with how EOFBLOCKS_FL is
      handled.  First of all, there was a fencepost error where we were not
      clearing the EOFBLOCKS_FL when fill in the last uninitialized block,
      but rather when we allocate the next block _after_ the uninitalized
      block.  Secondly we were not testing to see if we needed to clear the
      EOFBLOCKS_FL when writing to the file O_DIRECT or when were converting
      an uninitialized block (which is the most common case).
      
      Google-Bug-Id: 2928259
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      58590b06
  2. 29 Sep, 2010 2 commits
  3. 28 Sep, 2010 9 commits
  4. 27 Sep, 2010 12 commits