1. 12 Apr, 2004 40 commits
    • Andrew Morton's avatar
      [PATCH] writeback efficiency and QoS improvements · 9672a337
      Andrew Morton authored
      The radix-tree walk for writeback has a couple of problems:
      
      a) It always scans a file from its first dirty page, so if someone
         is repeatedly dirtying the front part of a file, pages near the end
         may be starved of writeout.  (Well, not completely: the `kupdate'
         function will write an entire file once the file's dirty timestamp
         has expired).  
      
      b) When the disk queues are huge (10000 requests), there can be a
         very large number of locked pages.  Scanning past these in writeback
         consumes quite some CPU time.
      
      So in each address_space we record the index at which the last batch of
      writeout terminated and start the next batch of writeback from that
      point.
      9672a337
    • Andrew Morton's avatar
      [PATCH] don't allow background writes to hide dirty buffers · bd134f27
      Andrew Morton authored
      If pdflush hits a locked-and-clean buffer in __block_write_full_page() it
      will just pass over the buffer.  Typically the buffer is an ext3 data=ordered
      buffer which is being written by kjournald, but a similar thing can happen
      with blockdev buffers and ll_rw_block().
      
      This is bad because the buffer is still under I/O and a subsequent fsync's
      fdatawait() needs to know about it.
      
      It is not practical to tag the page for writeback - only the submitter of the
      I/O can do that, because the submitter has control of the end_io handler.
      
      So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on
      the underway I/O.
      
      There is a risk that pdflush::background_writeout() will lock up, repeatedly
      trying and failing to write the same page.  This is prevented by ensuring
      that background_writeout() always throttles when it made no progress.
      bd134f27
    • Andrew Morton's avatar
      [PATCH] fdatasync integrity fix · d3eb546e
      Andrew Morton authored
      fdatasync can fail to wait on some pages due to a race.
      
      If some task (eg pdflush) is flushing the same mapping it can remove a page's
      dirty tag but not then mark that page as being under writeback, because
      pdflush hit a locked buffer in __block_write_full_page().  This will happen
      because kjournald is writing the buffer.  In this situation
      __block_write_full_page() will redirty the page so that fsync notices it, but
      there is a window where the page eludes the radix tree dirty page walk.
      
      Consequently a concurrent fsync will fail to notice the page when walking the
      radix tree's dirty pages.
      
      The approach taken by this patch is to leave the page marked as dirty in the
      radix tree while ->writepage is working out what to do with it.  This ensures
      that a concurrent write-for-sync will successfully locate the page and will
      then block in lock_page() until the non-write-for-sync code has finished
      altering the page state.
      d3eb546e
    • Andrew Morton's avatar
      [PATCH] remove page.list · be5ceb40
      Andrew Morton authored
      Remove the now-unneeded page.list field.
      be5ceb40
    • Andrew Morton's avatar
      [PATCH] switch the m68k pointer-table code over to page->lru · 67817afb
      Andrew Morton authored
      Switch the m68k pointer-table code over to page->lru.
      67817afb
    • Andrew Morton's avatar
      [PATCH] arm: stop using page->list · de894013
      Andrew Morton authored
      Switch the ARM `small_page' code over to page->lru.
      de894013
    • Andrew Morton's avatar
      [PATCH] stop using page->lru in compound pages · 0fcb51fd
      Andrew Morton authored
      The compound page logic is using page->lru, and these get will scribbled on
      in various places so switch the Compound page logic over to using ->mapping
      and ->private.
      0fcb51fd
    • Andrew Morton's avatar
      [PATCH] stop using page.list in readahead · bd64f049
      Andrew Morton authored
      The address_space.readapges() function currently takes a list of pages,
      strung together via page->list.  Switch it to using page->lru.
      
      This changes the API into filesystems.
      bd64f049
    • Andrew Morton's avatar
      [PATCH] stop using page.list in pageattr.c · 90687aa1
      Andrew Morton authored
      Switch it to ->lru
      90687aa1
    • Andrew Morton's avatar
      [PATCH] stop using page->list in the hugetlbpage implementations · c41bb9c4
      Andrew Morton authored
      Switch them over to page.lru
      c41bb9c4
    • Andrew Morton's avatar
      [PATCH] stop using page.list in the page allocator · 62e52945
      Andrew Morton authored
      Switch the page allocator over to using page.lru for the buddy lists.
      62e52945
    • Andrew Morton's avatar
      [PATCH] slab: stop using page.list · 02979dcb
      Andrew Morton authored
      slab.c is using page->list.  Switch it over to using page->lru so we can
      remove page.list.
      02979dcb
    • Andrew Morton's avatar
      [PATCH] revert the slabification of i386 pgd's and pmd's · c33c9e78
      Andrew Morton authored
      This code is playing with page->lru from pages which came from slab.  But to
      remove page->list we need to convert slab over to using page->lru.  So we
      cannot allow the i386 pagetable code to go scribbling on the ->lru field of
      active slab pages.
      
      This optimisation was pretty thin, and it is more important to shrink the
      pageframe (on all architectures).
      c33c9e78
    • Andrew Morton's avatar
      [PATCH] stop using address_space.clean_pages · d672c382
      Andrew Morton authored
      Remove remaining references to address_space.clean_pages.
      d672c382
    • Andrew Morton's avatar
      [PATCH] Stop using address_space.locked_pages · a1513309
      Andrew Morton authored
      Instead, use a radix-tree walk of the pages which are tagged as being under
      writeback.
      
      The new function wait_on_page_writeback_range() was generalised out of
      filemap_fdatawait().  We can later use this to provide concurrent fsync of
      just a section of a file.
      a1513309
    • Andrew Morton's avatar
      [PATCH] remove address_space.io_pages · 3c1ed9b2
      Andrew Morton authored
      Now remove address_space.io_pages.
      3c1ed9b2
    • Andrew Morton's avatar
      [PATCH] fix the kupdate function · b79a8408
      Andrew Morton authored
      Juggle dirty pages and dirty inodes and dirty superblocks and various
      different writeback modes and livelock avoidance and fairness to recover from
      the loss of mapping->io_pages.
      b79a8408
    • Andrew Morton's avatar
      [PATCH] stop using the address_space dirty_pages list · 1d7d3304
      Andrew Morton authored
      Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY
      tag.  Remove address_space.dirty_pages.
      1d7d3304
    • Andrew Morton's avatar
      [PATCH] tag writeback pages as such in their radix tree · 40c8348e
      Andrew Morton authored
      Arrange for under-writeback pages to be marked thus in their pagecache radix
      tree.
      40c8348e
    • Andrew Morton's avatar
      [PATCH] tag dirty pages as such in the radix tree · 8ece6262
      Andrew Morton authored
      Arrange for all dirty pagecache pages to be tagged as dirty within their
      radix tree.
      8ece6262
    • Andrew Morton's avatar
      [PATCH] make the pagecache lock irq-safe. · 89261aab
      Andrew Morton authored
      Intro to these patches:
      
      - Major surgery against the pagecache, radix-tree and writeback code.  This
        work is to address the O_DIRECT-vs-buffered data exposure horrors which
        we've been struggling with for months.
      
        As a side-effect, 32 bytes are saved from struct inode and eight bytes
        are removed from struct page.  At a cost of approximately 2.5 bits per page
        in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
        populated.  Not all pages are pagecache; other pages gain the full 8 byte
        saving.
      
        This change will break any arch code which is using page->list and will
        also break any arch code which is using page->lru of memory which was
        obtained from slab.
      
        The basic problem which we (mainly Daniel McNeil) have been struggling
        with is in getting a really reliable fsync() across the page lists while
        other processes are performing writeback against the same file.  It's like
        juggling four bars of wet soap with your eyes shut while someone is
        whacking you with a baseball bat.  Daniel pretty much has the problem
        plugged but I suspect that's just because we don't have testcases to
        trigger the remaining problems.  The complexity and additional locking
        which those patches add is worrisome.
      
        So the approach taken here is to remove the page lists altogether and
        replace the list-based writeback and wait operations with in-order
        radix-tree walks.
      
        The radix-tree code has been enhanced to support "tagging" of pages, for
        later searches for pages which have a particular tag set.  This means that
        we can ask the radix tree code "find me the next 16 dirty pages starting at
        pagecache index N" and it will do that in O(log64(N)) time.
      
        This affects I/O scheduling potentially quite significantly.  It is no
        longer the case that the kernel will submit pages for I/O in the order in
        which the application dirtied them.  We instead submit them in file-offset
        order all the time.
      
        This is likely to be advantageous when applications are seeking all over
        a large file randomly writing small amounts of data.  I haven't performed
        much benchmarking, but tiobench random write throughput seems to be
        increased by 30%.  Other tests appear to be unaltered.  dbench may have got
        10-20% quicker, but it's variable.
      
        There is one large file which everyone seeks all over randomly writing
        small amounts of data: the blockdev mapping which caches filesystem
        metadata.  The kernel's IO submission patterns for this are now ideal.
      
      
        Because writeback and wait-for-writeback use a tree walk instead of a
        list walk they are no longer livelockable.  This probably means that we no
        longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
        fdatasync().  This may be beneficial for databases: multiple processes
        writing and syncing different parts of the same file at the same time can
        now all submit and wait upon writes to just their own little bit of the
        file, so we can get a lot more data into the queues.
      
        It is trivial to implement a part-file-fdatasync() as well, so
        applications can say "sync the file from byte N to byte M", and multiple
        applications can do this concurrently.  This is easy for ext2 filesystems,
        but probably needs lots of work for data-journalled filesystems and XFS and
        it probably doesn't offer much benefit over an i_semless O_SYNC write.
      
      
        These patches can end up making ext3 (even) slower:
      
      	for i in 1 2 3 4
      	do
      		dd if=/dev/zero of=$i bs=1M count=2000 &
      	done          
      
        runs awfully slow on SMP.  This is, yet again, because all the file
        blocks are jumbled up and the per-file linear writeout causes tons of
        seeking.  The above test runs sweetly on UP because the on UP we don't
        allocate blocks to different files in parallel.
      
        Mingming and Badari are working on getting block reservation working for
        ext3 (preallocation on steroids).  That should fix ext3 up.
      
      
      This patch:
      
      - Later, we'll need to access the radix trees from inside disk I/O
        completion handlers.  So make mapping->page_lock irq-safe.  And rename it
        to tree_lock to reliably break any missed conversions.
      89261aab
    • Andrew Morton's avatar
      [PATCH] radix-tree tags for selective lookup · 8691fb83
      Andrew Morton authored
      Add radix-tree tagging so we can look up dirty or writeback pages in
      O(log64(n)) time.
      
      Each radix-tree node gains two bits for each slot: one for page dirtiness and
      one for page writebackness.
      
      If a tag bit is set on a leaf node, it indicates that item at the
      corresponding slot is tagged (say, a dirty page).
      
      If a tag bit is set in a non-leaf node it indicates that the same tag bit is
      set in the subtree which lies under the corresponding slot.  ie: "there is a
      dirty page under here somewhere, but you need to search down further to find
      it".
      
      A gang lookup function is provided which can walk the radix tree in
      logarithmic time looking for items which are tagged, starting from a
      specified offset.  We use this for in-order searches for dirty or writeback
      pages.
      
      There is a userspace test harness for this code at
      
      http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz
      8691fb83
    • Andrew Morton's avatar
      [PATCH] rw_swap_page_sync(): place the pages in swapcache · e279bfef
      Andrew Morton authored
      This function is setting page->mapping = swapper_space, but isn't actually
      adding the page to swapcache.  This triggers soon-to-be-added BUGs in the
      radix tree code.
      
      So temporarily add these pages to swapcache for real.
      
      Also, make rw_swap_page_sync() go away if it has no callers.
      e279bfef
    • Andrew Morton's avatar
      [PATCH] AIO+DIO bio_count race fix · c58d3aeb
      Andrew Morton authored
      From: Suparna Bhattacharya <suparna@in.ibm.com>,
            Daniel McNeil <daniel@osdl.org>
      
      This patch ensures that when the DIO code falls back to buffered i/o after
      having submitted part of the i/o, then buffered i/o is issued only for the
      remaining part of the request (i.e.  the part not already covered by DIO),
      rather than redo the entire i/o.  Now, instead of returning written ==
      -ENOTBLK, generic_file_direct_IO returns the number of bytes already handled
      by DIO, so that the caller knows how much of the I/O is left to be handled
      via fallback to buffered write.
      
      We need to careful not to access dio fields if its possible that the dio
      could already have been freed asynchronously during i/o completion.  A tricky
      part of this involves plugging the window between the decrement of bio_count
      and accessing dio->waiter during i/o completion where the dio could get freed
      by the submission path.  This potential "bio_count race" was tackled (by
      Daniel) by changing bio_list_lock into bio_lock and using that for all the
      bio fields.  Now bio_count and bios_in_flight have been converted from
      atomics into int and are both protected by the bio_lock.  The race in
      finished_one_bio() could thus be fixed by leaving the bio_count at 1 until
      after the dio_complete() and then doing the bio_count decrement and wakeup
      holding the bio_lock.  It appears that shifting to the spin_lock instead of
      atomic_inc/decs is ok performance wise as well.
      
      Update:
      
      An AIO O_DIRECT request was extending the file so it was done
      synchronously.  However, the request got an EFAULT and direct_io_worker()
      was calling aio_complete() on the iocb and returning the EFAULT.  When
      io_submit_one() got the EFAULT return, it assume it had to call
      aio_complete() since the i/o never got queued.
      
      The fix is for direct_io_worker() to only call aio_complete() when the
      upper layer is going to return -EIOCBQUEUED and not when getting errors
      that are being return to the submit path.
      c58d3aeb
    • Andrew Morton's avatar
      [PATCH] direct-io AIO fixes · 332c8cf1
      Andrew Morton authored
      From: Suparna Bhattacharya <suparna@in.ibm.com>
      
      Fixes the following remaining issues with the DIO code:
      
      1. During DIO file extends, intermediate writes could extend i_size
         exposing unwritten blocks to intermediate reads (Soln: Don't drop i_sem
         for file extends)
      
      2. AIO-DIO file extends may update i_size before I/O completes,
         exposing unwritten blocks to intermediate reads.  (Soln: Force AIO-DIO
         file extends to be synchronous)
      
      3. AIO-DIO writes to holes call aio_complete() before falling back to
         buffered I/O !  (Soln: Avoid calling aio_complete() if -ENOTBLK)
      
      4. AIO-DIO writes to an allocated region followed by a hole, falls back
         to buffered i/o without waiting for already submitted i/o to complete;
         might return to user-space, which could overwrite the buffer contents
         while they are still being written out by the kernel (Soln: Always wait
         for submitted i/o to complete before falling back to buffered i/o)
      332c8cf1
    • Andrew Morton's avatar
      [PATCH] blockdev direct-io speedups · aa34baa2
      Andrew Morton authored
      From: Badari Pulavarty <pbadari@us.ibm.com>
      
      1) blkdev_direct_IO() calls blockdev_direct_IO() instead of
         blockdev_direct_IO_no_locking().
      
      2) writev entry point is generic_file_writev() which grabs i_sem.  It
         should use generic_file_write_nolock() instead.
      aa34baa2
    • Andrew Morton's avatar
      [PATCH] Fix race between ll_rw_block() and block_write_full_page() · c2179a48
      Andrew Morton authored
      Fix a race which was identified by Daniel McNeil <daniel@osdl.org>
      
      If a buffer_head is under I/O due to JBD's ordered data writeout (which uses
      ll_rw_block()) then either filemap_fdatawrite() or filemap_fdatawait() need
      to wait on the buffer's existing I/O.
      
      Presently neither will do so, because __block_write_full_page() will not
      actually submit any I/O and will hence not mark the page as being under
      writeback.
      
      The best-performing fix would be to somehow mark the page as being under
      writeback and defer waiting for the ll_rw_block-initiated I/O until
      filemap_fdatawait()-time.  But this is hard, because in
      __block_write_full_page() we do not have control of the buffer_head's end_io
      handler.  Possibly we could make JBD call into end_buffer_async_write(), but
      that gets nasty.
      
      This patch makes __block_write_full_page() wait for any buffer_head I/O to
      complete before inspecting the buffer_head state.  It only does this in the
      case where __block_write_full_page() was called for a "data-integrity" write:
      (wbc->sync_mode != WB_SYNC_NONE).
      
      Probably it doesn't matter, because kjournald is currently submitting (or has
      already submitted) all dirty buffers anyway.
      c2179a48
    • Andrew Morton's avatar
      [PATCH] O_DIRECT data exposure fixes · bc0e2bbf
      Andrew Morton authored
      From: Badari Pulavarty, Suparna Bhattacharya, Andrew Morton
      
      Forward port of Stephen Tweedie's DIO fixes from 2.4, to fix various DIO vs
      buffered IO exposures involving races causing:
      
      (a) stale data from uninstantiated blocks to be read, e.g.
      
          - O_DIRECT reads against buffered writes to a sparse region
      
          - O_DIRECT writes to a sparse region against buffered reads
      
      (b) potential data corruption with
      
          - O_DIRECT IOs against truncate
      
          due to writes to truncated blocks (which may have been reallocated to
          another file).
      
      Summary of fixes:
      
      1) All the changes affect only regular files.  RAW/O_DIRECT on block are
         unaffected. 
      
      2) The DIO code will not fill in sparse regions on a write.  Instead
         -ENOTBLK is returned and the generic file write code would fallthrough to
         buffered IO in this case followed by writing through the pages to disk
         using filemap_fdatawrite/wait.
      
      3) i_sem is held during both DIO reads and writes.  For reads, and writes
         to already allocated blocks, it is released right after IO is issued,
         while for writes to newly allocated blocks (e.g file extending writes and
         hole overwrites) it is held all the way through until IO completes (and
         data is committed to disk).
      
      4) filemap_fdatawrite/wait are called under i_sem to synchronize buffered
         pages to disk blocks before issuing DIO.
      
      5) A new rwsem (i_alloc_sem) is held in shared mode all the while a DIO
         (read or write) is in progress, and in exclusive mode by truncate to guard
         against deallocation of data blocks during DIO. 
      
      6) All this new locking has been pushed down into blockdev_direct_IO to
         avoid interfering with NFS direct IO.  The locks are taken in the order
         i_sem followed by i_alloc_sem.  While i_sem may be released after IO
         submission in some cases, i_alloc_sem is held through until dio_complete
         (in the case of AIO-DIO this happens through the IO completion callback).
      
      7) i_sem and i_alloc_sem are not held for the _nolock versions of write
         routines, as used by blockdev and XFS.  Filesystems can specify the
         needs_special_locking parameter to __blockdev_direct_IO from their direct
         IO address space op accordingly.
      
      Note from Badari:
      Here is the locking (when needs_special_locking is true):
      
      (1) generic_file_*_write() holds i_sem (as before) and calls
          ->direct_IO().  blockdev_direct_IO gets i_alloc_sem and call
          direct_io_worker().
      
      (2) generic_file_*_read() does not hold any locks.  blockdev_direct_IO()
          gets i_sem and then i_alloc_sem and calls direct_io_worker() to do the
          work
      
      (3) direct_io_worker() does the work and drops i_sem after submitting IOs
          if appropriate and drops i_alloc_sem after completing IOs.
      bc0e2bbf
    • Andrew Morton's avatar
      [PATCH] enable suspend-on-halt for NS Geode · 62a36b1f
      Andrew Morton authored
      From: Matt Mackall <mpm@selenic.com>
      
      From: Zwane Mwaikambo <zwane@arm.linux.org.uk>
      
      This enables deep powersaving mode on Geode boxes.
      62a36b1f
    • Andrew Morton's avatar
      [PATCH] shrink inode when quota is disabled · 87217f47
      Andrew Morton authored
      From: Matt Mackall <mpm@selenic.com>
      
      drop quota array in inode struct if no quota support
      87217f47
    • Andrew Morton's avatar
      [PATCH] eliminate nswap and cnswap · 8398bcc6
      Andrew Morton authored
      From: Matt Mackall <mpm@selenic.com>
      
      The nswap and cnswap variables counters have never been incremented as
      Linux doesn't do task swapping.
      8398bcc6
    • Andrew Morton's avatar
      [PATCH] improve CONFIG_EMBEDDED help text · b931abdb
      Andrew Morton authored
      From: Matt Mackall <mpm@selenic.com>
      
      Make CONFIG_EMBEDDED description more accurate
      b931abdb
    • Andrew Morton's avatar
      [PATCH] remove bogus MOD_{INC,DEC}_USE_COUNT from hysdn · cc66b6fc
      Andrew Morton authored
      From: Christoph Hellwig <hch@lst.de>
      
      the maintainer doesn't response unfortauntely, but removing these from
      net_devices unconditionally is the 2.6 way to go, there's no more module
      refcounting on net devices.
      cc66b6fc
    • Andrew Morton's avatar
      [PATCH] oss/wavfront.c warning fix. · 36bf1087
      Andrew Morton authored
      From: "Luiz Fernando N. Capitulino" <lcapitulino@prefeitura.sp.gov.br>
      
      sound/oss/wavfront.c: At top level:
      sound/oss/wavfront.c:2498: warning: `errno' defined but not used
      36bf1087
    • Andrew Morton's avatar
      [PATCH] kill spurious MAKDEV scripts · ffe52a4a
      Andrew Morton authored
      From: Christoph Hellwig <hch@lst.de>
      
      Kill magic ide/sound makedev scripts in scripts/.  The userland MAKEDEV is
      the proper place and already has support for them.
      ffe52a4a
    • Andrew Morton's avatar
      [PATCH] missing NULL pointer check in pte_alloc_one. · 7653e3ac
      Andrew Morton authored
      From: Martin Schwidefsky <schwidefsky@de.ibm.com>
      
      Just found an small bug in pgalloc for s390*.  Comparing notes with other
      architectures I found that pte_alloc_one is sick for alpha and sparc64 as
      well.
      7653e3ac
    • Andrew Morton's avatar
      [PATCH] selinux: fix struct type · d15128eb
      Andrew Morton authored
      From: Stephen Smalley <sds@epoch.ncsc.mil>
      
      This patch fixes the type of the ssec pointer in the sk_free_security
      function.  This has no current impact as the magic element is the top of each
      structure.  Thanks to Chad Hanson of TCS for discovering the bug and
      submitting the patch.
      d15128eb
    • Andrew Morton's avatar
      [PATCH] stv0299.c unused variable · 25c1c70b
      Andrew Morton authored
      From: "Luiz Fernando N. Capitulino" <lcapitulino@prefeitura.sp.gov.br>
      
      drivers/media/dvb/frontends/stv0299.c:356: warning: unused variable `i'
      25c1c70b
    • Andrew Morton's avatar
      [PATCH] ia64 MSI support · 9938e2c2
      Andrew Morton authored
      From: "Nguyen, Tom L" <tom.l.nguyen@intel.com>
      
      Adds MSI support for ia64.
      
      - Modified existing code in drivers/pci/msi.c and drivers/pci/msi.h to
        include MSI support on IA64 platform.
      
      - Based on the comments received from Zwane Mwaikambo and David Mosberger,
        this patch consolidates the vector allocators as
        assign_irq_vector(AUTO_ASSIGN) has the same semantics as
        ia64_alloc_vector() by converting the existing uses of ia64_alloc_vector()
        to assign_irq_vector(AUTO_ASSIGN).
      
      - Based on the comments received from Zwane Mwaikambo, this patch
        consolidates the semantics of vector allocator assign_irq_vector() in
        drivers/pci/msi.c into the relevant architecture's vector allocator
        assign_irq_vector() in arch/i386/kernel/io_apic.c.
      
      - Regarding vector allocation, this patch modifies the existing function
        assign_irq_vector() to maximize the number of allocated vectors to 188
        before going -ENOSPC.
      
      - Based on your comments, this patch creates <asm-i386/msi.h>,
        <asm-ia64/msi.h> and <asm-x86_64/msi.h>, includes <asm/msi.h> from within
        drivers/pci/msi.h and then places all the code which is currently under
        ifdef in msi.h into the relevant architecture's <asm/msi.h> file.
      
      - Based on your comments, this patch places pci_vector_resources() in
        existing drivers/pci/msi.c in the relevant architecture implementations
        such as into arch/.../pci/irq.c.
      9938e2c2
    • Andrew Morton's avatar
      [PATCH] summmit: increase MAX_MP_BUSSES · 27b5c750
      Andrew Morton authored
      From: James Cleverdon <jamesclv@us.ibm.com>
      
      Bump up MAX_MP_BUSSES for summit/generic subarch to cope with big IBM x440
      systems.
      27b5c750