1. 29 Oct, 2002 18 commits
    • Jens Axboe's avatar
      [PATCH] bad scsi merge · ad3fb438
      Jens Axboe authored
      When someone deleted scsi_merge, they also killed the fixes I sent to
      you earlier...
      ad3fb438
    • Andrew Morton's avatar
      [PATCH] much miscellany · 2e7f5efb
      Andrew Morton authored
      - add locking comments to do_mmap_pgoff(), filemap.c
      
      - used unsigned long for cpu flags in aio.c (Andi)
      
      - An x86-64 typo fix from Andi.
      
      - Fix a tpyo
      
      - Fix an unused var warning in the stack overflow check code
      
      - mptlan compile fix (Rasmus Andersen)
      
      - Update misleading comment in ia32 highmem.c
      
      - "attempting to mount an ext3 fs on a stopped md/raid1 array caused a
         divide by 0 error in ext3_fill_super.  Fix duplicates check already
         in ext2." - Angus Sawyer <angus.sawyer@dsl.pipex.com>
      
      - Someone changed the return type of inl() again! Fix up compiler
        warnings in 3c59x.c again.
      2e7f5efb
    • Andrew Morton's avatar
      [PATCH] don't invalidate pagecache after direct-IO reads · c4c95471
      Andrew Morton authored
      There's no need to take down pagecache after performing direct-IO reads
      from a file or a blockdevice.
      
      And when using direct access to a blockdev which has a filesystem
      mounted it creates unnecessary disturbance of filesystem activity.
      c4c95471
    • Andrew Morton's avatar
      [PATCH] thread-aware oom-killer · f7844601
      Andrew Morton authored
      From Ingo
      
      - performance optimization: do not kill threads in the same thread group
        as the OOM-ing thread. (it's still necessery to scan over every thread
        though, as it's possible to have CLONE_VM threads in a different thread
        group - we do not want those to escape the OOM-kill.)
      
      - to not let newly created child threads slip out of the group-kill. Note
        that the 2.4 kernel's OOM handler has the same problem, and it could be
        the reason why forkbombs occasionally slip out of the OOM kill.
      f7844601
    • Andrew Morton's avatar
      [PATCH] shrink_slab arith overflow fix · d08b03c5
      Andrew Morton authored
      shrink_slab() wants to calculate
      
      	nr_scanned_pages * seeks_per_object * entries_in_slab /
      		nr_lru_pages
      
      entries_in_slab and nr_lru_pages can vary a lot.  There is a potential
      for 32-bit overflows.
      
      I spent ages trying to avoid corner cases which cause a significant
      lack of precision while preserving some clarity.  Gave up and used
      do_div().  The code is called rarely - at most once per 128 kbytes of
      reclaim.
      
      The patch adds a tweak to balance_pgdat() to reduce the call rate to
      shrink_slab() in the case where the zone is just a little bit below
      pages_high.
      
      Also increase SHRINK_BATCH.  The things we're shrinking are typically a
      few hundred bytes, and a batchcount of 128 gives us a minimum of ten
      pages or so per shrinking callout.
      d08b03c5
    • Andrew Morton's avatar
      [PATCH] uninline the ia32 copy_*_user functions · 0a7bf9c8
      Andrew Morton authored
      There's more work to do on these, for well-aligned copies.
      Arjan has some stuff for that.   First step on that path is
      to clean the code up, get it uninlined and have a framework for
      making per-CPU-type decisions.
      0a7bf9c8
    • Andrew Morton's avatar
      [PATCH] faster copy_*_user for bad alignments on intel ia32 · a792a27c
      Andrew Morton authored
      This patch speeds up copy_*_user for some Intel ia32 processors.  It is
      based on work by Mala Anand.
      
      It is a good win.  Around 30% for all src/dest alignments except 32/32.
      
      In this test a fully-cached one gigabyte file was read into an
      8192-byte userspace buffer using read(fd, buf, 8192).  The alignment of
      the user-side buffer was altered between runs.  This is a PIII.  Times
      are in seconds.
      
      User buffer	2.5.41		2.5.41+
      				patch++
      
      0x804c000	4.373		4.343
      0x804c001	10.024		6.401
      0x804c002	10.002		6.347
      0x804c003	10.013		6.328
      0x804c004	10.105		6.273
      0x804c005	10.184		6.323
      0x804c006	10.179		6.322
      0x804c007	10.185		6.319
      0x804c008	9.725		6.347
      0x804c009	9.780		6.275
      0x804c00a	9.779		6.355
      0x804c00b	9.778		6.350
      0x804c00c	9.723		6.351
      0x804c00d	9.790		6.307
      0x804c00e	9.790		6.289
      0x804c00f	9.785		6.294
      0x804c010	9.727		6.277
      0x804c011	9.779		6.251
      0x804c012	9.783		6.246
      0x804c013	9.786		6.245
      0x804c014	9.772		6.063
      0x804c015	9.919		6.237
      0x804c016	9.920		6.234
      0x804c017	9.918		6.237
      0x804c018	9.846		6.372
      0x804c019	10.060		6.294
      0x804c01a	10.049		6.328
      0x804c01b	10.041		6.337
      0x804c01c	9.931		6.347
      0x804c01d	10.013		6.273
      0x804c01e	10.020		6.346
      0x804c01f	10.016		6.356
      0x804c020	4.442		4.366
      
      So `rep;movsl' is slower at all non-cache-aligned offsets.
      
      PII is using the PIII alignment.  I don't have a PII any more, but I do
      recall that it demonstrated the same behaviour as the PIII.
      
      The patch contains an enhancement (based on careful testing) from
      Hirokazu Takahashi <taka@valinux.co.jp>.  In cases where source and
      dest have the same alignment, but that aligment is poor, we do a short
      copy of a few bytes to bring the two pointers onto a favourable
      boundary and then do the big copy.
      
      And also a bugfix from Hirokazu Takahashi.
      
      As an added bonus, this patch decreases the kernel text by 28 kbytes.
      22k of this in in .text and the rest in __ex_table.  I'm not really
      sure why .text shrunk so much.
      
      These copy routines have no special-case for constant-sized copies.  So
      a lot of uaccess.h becomes dead code with this patch.  The next patch
      which uninlines the copy_*_user functions cleans all that up and saves
      an additional 5k.
      a792a27c
    • Andrew Morton's avatar
      [PATCH] export nr_running and nr_iowait tasks in /proc · 43c8cc21
      Andrew Morton authored
      From Rik.
      
      "this trivial patch, against 2.5-current, exports nr_running and
       nr_iowait_tasks in /proc/stat.  With this patch in vmstat will no
       longer need to walk all the processes in the system just to determine
       the number of running and blocked processes."
      43c8cc21
    • Andrew Morton's avatar
      [PATCH] radix_tree_gang_lookup fix · 7d196748
      Andrew Morton authored
      When performing lookups against very sparse trees
      radix_tree_gang_lookup fails to find nodes "far" to the right of the
      start point.  Because it only understands sparseness in the leaf nodes,
      not the intermediate nodes.
      
      Nobody noticed this because all callers are incrementing the start
      index as they walk the tree.
      
      Change it to terminate the search when it really has inspected the last
      possible node for the current tree's height.
      7d196748
    • Andrew Morton's avatar
      [PATCH] less buslocked operations in the page allocator · f5046231
      Andrew Morton authored
      Sort-of-but-not-really from High Dickins.
      
      We're doing a lot of buslocked operations in the page allocator just
      for debug.  Plus when they _do_ trigger, there are so many BUG_ONs in
      there that it's rather hard to work out from user reports which one
      actually triggered.
      
      So redo all that and also print out some more useful info about the
      page state before taking the machine out.
      
      (And yes, we need to take the machine out.  Incorrect page handling in
      there can cause file corruption).
      f5046231
    • Andrew Morton's avatar
      [PATCH] add a file_ra_state init function · 6b390b3b
      Andrew Morton authored
      Provide a function in core kernel to initialise a file_ra_state structure.
      
      Perviously this was all taken care of by the fact that new struct
      file's are all zeroed out.  But now a file_ra_state may be
      independently allocated, and we don't want users of it to have to know
      how to initialise it.
      6b390b3b
    • Andrew Morton's avatar
      [PATCH] permit direct IO with finer-than-fs-blocksize alignments · 4a4c6811
      Andrew Morton authored
      Mainly from Badari Pulavarty
      
      Traditionally we have only supported O_DIRECT I/O at an alignment and
      granularity which matches the underlying filesystem.  That typically
      means that all IO must be 4k-aligned and a multiple of 4k in size.
      
      Here, we relax that so that direct I/O happens with (typically)
      512-byte alignment and multiple-of-512-byte size.
      
      The tricky part is when a write starts and/or ends partway through a
      filesystem block which has just been added.  We need to zero out the
      parts of that block which lie outside the written region.
      
      We handle that by putting appropriately-sized parts of the ZERO_PAGE
      into sepatate BIOs.
      
      The generic_direct_IO() function has been changed so that the
      filesystem must pass in the address of the block_device against which
      the IO is to be performed.  I'd have preferred to not do this, but we
      do need that info at that time so that alignment checks can be
      performed.
      
      If the filesystem passes in a NULL block_device pointer then we fall
      back to the old behaviour - must align with the fs blocksize.
      
      There is no trivial way for userspace to know what the minimum
      alignment is - it depends on what bdev_hardsect_size() says about the
      device.  It is _usually_ 512 bytes, but not always.  This introduces
      the risk that someone will develop and test applications which work
      fine on their hardware, but will fail on someone else's hardware.
      
      It is possible to query the hardsect size using the BLKSSZGET ioctl
      against the backing block device.  This can be performed at runtime or
      at application installation time.
      4a4c6811
    • Andrew Morton's avatar
      [PATCH] restructure direct-io to suit bio_add_page · a9577554
      Andrew Morton authored
      The direct IO code was initially designed to allocate a known-sized
      BIO, to fill it with pages and to then send it off.
      
      Then along came bio_add_page().  Really, it broke direct-io.c - it
      meant that the direct-IO BIO assembly code no longer had a-priori
      knowledge of whether a page would fit into the current BIO.
      
      Our attempts to rework the initial design to play well with
      bio_add_page() really weren't adequate.  The code was getting more and
      more twisty and we kept finding corner-cases which failed.
      
      So this patch redesigns the BIO assembly and submission path of the
      direct-IO code so that it better suits the bio_add_page() semantics.
      
      It introduces another layer in the assembly phase: the 'cur_page' which
      is cached in the dio structure.
      
      The function which walks the file mapping do_direct_IO() simply emits a
      sequence of (page,offset,len,sector) quads into the next layer down -
      submit_page_section().
      
      submit_page_section() is responsible for looking for a merge of the new
      quad against the previous page section (same page).  If no merge is
      possible it passes the currently-cached page down to the next level,
      dio_send_cur_page().
      
      dio_send_cur_page() will try to add the current page to the current
      BIO.  If that fails, the current BIO is submitted for IO and we open a
      new one.
      
      So it's all nicely layered.  The assembly of sections-of-page into the
      current page closely mirrors the assembly of sections-of-BIO into the
      current BIO.
      
      At both of these levels everything is done in a "deferred" manner: try
      to merge a new request onto the currently-cached one.  If that fails
      then send the currently-cached request and then cache this one instead.
      
      Some variables have been renamed to more closely represent their usage.
      
      Some thought has been put into ownership of the various state variables
      within `struct dio'.  We were updating and inspecting these in various
      places in a rather hard-to-follow manner.  So things have been reworked
      so that particular functions "own" particular parts of the dio
      structure.  Violators have been exterminated and commentary has been
      added to describe this ownership.
      
      The handling of file holes has been simplified.
      
      As a consequence of all this, the code is clearer and simpler than it
      used to be, and it now passes the modified-for-O_DIRECT fsx-linux
      testing again.
      a9577554
    • Andrew Morton's avatar
      [PATCH] invalidate_inode_pages fixes · caa2f807
      Andrew Morton authored
      Two fixes here.
      
      First:
      
      Fixes a BUG() which occurs if you try to perform O_DIRECT IO against a
      blockdev which has an fs mounted on it.  (We should be able to do
      that).
      
      What happens is that do_invalidatepage() ends up calling
      discard_buffer() on buffers which it couldn't strip.  That clears
      buffer_mapped() against useful things like the superblock buffer_head.
      The next submit_bh() goes BUG over the write of an unmapped buffer.
      
      So just run try_to_release_page() (aka try_to_free_buffers()) on the
      invalidate path.
      
      
      Second:
      
      The invalidate_inode_pages() functions are best-effort pagecache
      shrinkers.  They are used against pages inside i_size and are not
      supposed to throw away dirty data.
      
      However it is possible for another CPU to run set_page_dirty() against
      one of these pages after invalidate_inode_pages() has decided that it
      is clean.  This could happen if someone was performing O_DIRECT IO
      against a file which was also mapped with MAP_SHARED.
      
      So recheck the dirty state of the page inside the mapping->page_lock
      and back out if the page has just been marked dirty.
      
      This will also prevent the remove_from_page_cache() BUG which will occur
      if someone marks the page dirty between the clear_page_dirty() and
      remove_from_page_cache() calls in truncate_complete_page().
      caa2f807
    • Andrew Morton's avatar
      [PATCH] libfs a_ops correctnes · 303c9cf6
      Andrew Morton authored
      simple_prepare_write() currently memsets the entire page.  It only
      needs to clear the parts which are outside the to-be-written region.
      This change makes no difference to performance - that memset was just a
      cache preload for the copy_from_user() in generic_file_write().  But
      it's more correct.
      
      Also, mark the page dirty in simple_commit_write(), not in
      simple_prepare_write().  Because the page's contents are changed after
      prepare_write().  This doesn't matter in practice, but it is setting a
      bad example.
      
      Also, add a flush_dcache_page() to simple_prepare_write().  Again, not
      really needed because the page cannot be mapped into pagetables if it
      is not uptodate.  But it is example code and should not be missing such
      things.
      303c9cf6
    • Andrew Morton's avatar
      [PATCH] move ramfs a_ops into libfs · 3ee477f0
      Andrew Morton authored
      From Bill Irwin.
      
      Abstract out ramfs readpage(), prepare_write(), and commit_write()
      operations.
      
      Ram-backed filesystems are going to be doing a lot of zero-filled read
      and write operations.  So in this patch, ramfs' implementations are
      moved to libfs in anticipation of other callers.
      3ee477f0
    • Andrew Morton's avatar
      [PATCH] blkdev_get_block fix · f596aeef
      Andrew Morton authored
      Patch from Hugh Dickins <hugh@veritas.com>
      
      Fix premature -EIO from blkdev_get_block: bdget initialize
      bd_block_size consistent with bd_inode->i_blkbits (assigned by
      new_inode).  Otherwise, subsequent set_blocksize can find bd_block_size
      doesn't need updating, and skip updating i_blkbits, leaving them
      inconsistent.
      f596aeef
    • Andrew Morton's avatar
      [PATCH] fid dmi compile warning · ba3d6419
      Andrew Morton authored
      Local variable `data' is only used for debugging.
      ba3d6419
  2. 28 Oct, 2002 22 commits