An error occurred fetching the project authors.
  1. 29 Jun, 2004 2 commits
  2. 22 May, 2004 2 commits
    • Andrew Morton's avatar
      [PATCH] slab: consolidate panic code · b33a7bad
      Andrew Morton authored
      Many places do:
      
      	if (kmem_cache_create(...) == NULL)
      		panic(...);
      
      We can consolidate all that by passing another flag to kmem_cache_create()
      which says "panic if it doesn't work".
      b33a7bad
    • Andrew Morton's avatar
      [PATCH] revert recent swapcache handling changes · e74193ad
      Andrew Morton authored
      Go back to the 2.6.5 concepts, with rmap additions.  In particular:
      
      - Implement Andrea's flavour of page_mapping().  This function opaquely does
        the right thing for pagecache pages, anon pages and for swapcache pages.
      
        The critical thing here is that page_mapping() returns &swapper_space for
        swapcache pages without actually requiring the storage at page->mapping. 
        This frees page->mapping for the anonmm/anonvma metadata.
      
      - Andrea and Hugh placed the pagecache index of swapcache pages into
        page->private rather than page->index.  So add new page_index() function
        which hides this.
      
      - Make swapper_space.set_page_dirty() again point at
        __set_page_dirty_buffers().  If we don't do that, a bare set_page_dirty()
        will fall through to __set_page_dirty_buffers(), which is silly.
      
        This way, __set_page_dirty_buffers() can continue to use page->mapping.
        It should never go near anon or swapcache pages.
      
      - Give swapper_space a ->set_page_dirty address_space_operation method, so
        that set_page_dirty() will not fall through to __set_page_dirty_buffers()
        for swapcache pages.  That function is not set up to handle them.
      
      
      The main effect of these changes is that swapcache pages are treated more
      similarly to pagecache pages.  And we are again tagging swapcache pages as
      dirty in their radix tree, which is a requirement if we later wish to
      implement swapcache writearound based on tagged radix-tree walks.
      e74193ad
  3. 21 May, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] getblk() BUG removal · bd105032
      Andrew Morton authored
      We keep on getting BUG()s from isofs_read_super() because it passes an insane
      blocksize to bread().  See http://bugme.osdl.org/show_bug.cgi?id=2735 for
      example.
      
      I don't know what's up with isofs, but going BUG in there seems a bit rude.
      Change it to drop a bunch of diagnostics and a backtrace then return a null
      bh*.
      
      Most callers of getblk() don't expect it to fail, so they'll oops anyway.  But
      isofs does actually check for a NULL return.  This way, the machine stays up
      and we get better debug diagnostics.
      bd105032
  4. 19 May, 2004 2 commits
    • Andrew Morton's avatar
      [PATCH] Fix overzealous use of online cpu iterators · 2cb2f31f
      Andrew Morton authored
      From: Rusty Russell <rusty@rustcorp.com.au>
      
      The IA64 hotplug CPU merge seems to have included some core changes: in
      particular the recalc_bh_state() needs to sum for all (including offline)
      cpus, since we don't empty the counters on CPU down.  The totals printed by
      /proc/stat (the first loop) should include offline cpus, too (apparently
      printing out the per-cpu lines for offline cpus confuses top).
      2cb2f31f
    • Andrew Morton's avatar
      [PATCH] blk_run_page() race fix · 66a759eb
      Andrew Morton authored
      blk_run_page() is incorrectly using page->mapping, which makes it racy against
      removal from swapcache.
      
      Make block_sync_page() use page_mapping(), and remove bkl_run_page(), which
      only had one caller.
      66a759eb
  5. 14 May, 2004 4 commits
    • Andrew Morton's avatar
      [PATCH] Revisited: ia64-cpu-hotplug-cpu_present.patch · fda94eff
      Andrew Morton authored
      From: Paul Jackson <pj@sgi.com>
      
      With a hotplug capable kernel, there is a requirement to distinguish a
      possible CPU from one actually present.  The set of possible CPU numbers
      doesn't change during a single system boot, but the set of present CPUs
      changes as CPUs are physically inserted into or removed from a system.  The
      cpu_possible_map does not change once initialized at boot, but the
      cpu_present_map changes dynamically as CPUs are inserted or removed.
      
      
      Paul Jackson <pj@sgi.com> provided an expanded explanation:
      
      
      Ashok's cpu hot plug patch adds a cpu_present_map, resulting in the following
      cpu maps being available.  All the following maps are fixed size bitmaps of
      size NR_CPUS.
      
      #ifdef CONFIG_HOTPLUG_CPU
      	cpu_possible_map - map with all NR_CPUS bits set
      	cpu_present_map - map with bit 'cpu' set iff cpu is populated
      	cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
      #else
      	cpu_possible_map - map with bit 'cpu' set iff cpu is populated
      	cpu_present_map - copy of cpu_possible_map
      	cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
      #endif
      
      In either case, NR_CPUS is fixed at compile time, as the static size of these
      bitmaps.  The cpu_possible_map is fixed at boot time, as the set of CPU id's
      that it is possible might ever be plugged in at anytime during the life of
      that system boot.  The cpu_present_map is dynamic(*), representing which CPUs
      are currently plugged in.  And cpu_online_map is the dynamic subset of
      cpu_present_map, indicating those CPUs available for scheduling.
      
      If HOTPLUG is enabled, then cpu_possible_map is forced to have all NR_CPUS
      bits set, otherwise it is just the set of CPUs that ACPI reports present at
      boot.
      
      If HOTPLUG is enabled, then cpu_present_map varies dynamically, depending on
      what ACPI reports as currently plugged in, otherwise cpu_present_map is just a
      copy of cpu_possible_map.
      
      (*) Well, cpu_present_map is dynamic in the hotplug case.  If not hotplug,
          it's the same as cpu_possible_map, hence fixed at boot.
      fda94eff
    • Andrew Morton's avatar
      [PATCH] blk_run_page(): we don't trust bh->b_page · 4e36c118
      Andrew Morton authored
      We don't trust bh->b_page to point to the right thing across all filesystems,
      so revert this bit.
      4e36c118
    • Andrew Morton's avatar
      [PATCH] Add blk_run_page() · e059d5da
      Andrew Morton authored
      From: Andrea Arcangeli <andrea@suse.de>
      
      From: Jens Axboe
      
      Add blk_run_page() API.  This is so that we can pass the target page all the
      way down to (for example) the swap unplug function.  So swap can work out
      which blockdevs back this particular page.
      e059d5da
    • Andrew Morton's avatar
      [PATCH] filtered wakeups: apply to buffer_head functions · 70d1f017
      Andrew Morton authored
      From: William Lee Irwin III <wli@holomorphy.com>
      
      This patch implements wake-one semantics for buffer_head wakeups in a single
      step.  The buffer_head being waited on is passed to the waiter's wakeup
      function by the waker, and the wakeup function compares that to the a pointer
      stored in its on-stack structure and checking the readiness of the bit there
      also.  Wake-one semantics are achieved by using WQ_FLAG_EXCLUSIVE in the
      codepaths waiting to acquire the bit for mutual exclusion.
      70d1f017
  6. 22 Apr, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] writeback livelock fix · 1ed73535
      Andrew Morton authored
      If a filesystem's ->writepage implementation repeatedly refuses to write the
      page (it keeps on redirtying it instead) (reiserfs seems to do this) then the
      writeback logic can get stuck repeately trying to write the same page.
      
      Fix that up by correctly setting wbc->pages_skipped, to tell the writeback
      logic that things aren't working out.
      1ed73535
  7. 21 Apr, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] lockfs - vfs bits · 137718ec
      Andrew Morton authored
      From: Christoph Hellwig <hch@lst.de>
      
      These are the generic lockfs bits.  Basically it takes the XFS freezing
      statemachine into the VFS.  It's all behind the kernel-doc documented
      freeze_bdev and thaw_bdev interfaces.
      
      Based on an older patch from Chris Mason.
      137718ec
  8. 17 Apr, 2004 2 commits
    • Andrew Morton's avatar
      [PATCH] remove buffer_error() · 4f990f49
      Andrew Morton authored
      From: Jeff Garzik <jgarzik@pobox.com>
      
      It was debug code, no longer required.
      4f990f49
    • Andrew Morton's avatar
      [PATCH] kill submit_{bh,bio} return value · 01d86f02
      Andrew Morton authored
      From: Jeff Garzik <jgarzik@pobox.com>
      
      Nobody ever checks the return value of submit_bh(), and submit_bh() is the
      only caller that checks the submit_bio() return value.
      
      This changes the kernel I/O submission path -- a fast path -- so this
      cleanup is also a microoptimization.
      01d86f02
  9. 12 Apr, 2004 10 commits
    • Andrew Morton's avatar
      [PATCH] rmap 2 anon and swapcache · 4875a601
      Andrew Morton authored
      From: Hugh Dickins <hugh@veritas.com>
      
      Tracking anonymous pages by anon_vma,pgoff or mm,address needs a
      pointer,offset pair in struct page: mapping,index the natural choice.  But
      swapcache uses those for &swapper_space,swp_entry_t.
      
      It's trivial to separate swapcache from pagecache with radix tree; most of
      swapper_space is actually unused, just a fiction to pretend swap like file;
      and page->private is a good place to keep swp_entry_t, now that swap never
      uses bufferheads.
      
      Define PG_anon bit, page_add_rmap SetPageAnon and put an oopsable address in
      page->mapping to test that we're not confused by it.  Define
      page_mapping(page) macro to give NULL when PageAnon, whatever may be in
      page->mapping.  Define PG_swapcache bit, deduce swapper_space from that in
      the few places we need it.
      
      add_to_swap_cache now distinct from add_to_page_cache.  Separating the caches
      somewhat simplifies the tmpfs swizzling in swap_state.c, now the page can
      briefly be in both caches.
      
      The rmap method remains pte chains, no change to that yet.  But one small
      functional difference: the use of PageAnon implies that a page truncated
      while still mapped will no longer be found and freed (swapped out) by
      try_to_unmap, will only be freed by exit or munmap.  But normally pages are
      unmapped by vmtruncate: this should only affect nonlinear mappings, and a
      later patch not in this batch will fix that.
      4875a601
    • Andrew Morton's avatar
      [PATCH] per-backing dev unplugging · 6d27f67b
      Andrew Morton authored
      From: Jens Axboe <axboe@suse.de>,
            Chris Mason,
            me, others.
      
      The global unplug list causes horrid spinlock contention on many-disk
      many-CPU setups - throughput is worse than halved.
      
      The other problem with the global unplugging is of course that it will cause
      the unplugging of queues which are unrelated to the I/O upon which the caller
      is about to wait.
      
      So what we do to solve these problems is to remove the global unplug and set
      up the infrastructure under which the VFS can tell the block layer to unplug
      only those queues which are relevant to the page or buffer_head whcih is
      about to be waited upon.
      
      We do this via the very appropriate address_space->backing_dev_info structure.
      
      Most of the complexity is in devicemapper, MD and swapper_space, because for
      these backing devices, multiple queues may need to be unplugged to complete a
      page/buffer I/O.  In each case we ensure that data structures are in place to
      permit us to identify all the lower-level queues which contribute to the
      higher-level backing_dev_info.  Each contributing queue is told to unplug in
      response to a higher-level unplug.
      
      To simplify things in various places we also introduce the concept of a
      "synchronous BIO": it is tagged with BIO_RW_SYNC.  The block layer will
      perform an immediate unplug when it sees one of these go past.
      6d27f67b
    • Andrew Morton's avatar
      [PATCH] reiserfs: data=ordered support · bb0d9672
      Andrew Morton authored
      From: Chris Mason <mason@suse.com>
      
      reiserfs data=ordered support.
      bb0d9672
    • Andrew Morton's avatar
      [PATCH] laptop mode · 93d33a48
      Andrew Morton authored
      From: Bart Samwel <bart@samwel.tk>
      
      Adds /proc/sys/vm/laptop-mode: a special knob which says "this is a laptop".
      In this mode the kernel will attempt to avoid spinning disks up.
      
      Algorithm: the idea is to hold dirty data in memory for a long time, but to
      flush everything which has been accumulated if the disk happens to spin up
      for other reasons.
      
      - Whenever a disk request completes (read or write), schedule a timer a few
        seconds hence.  If the timer was already pending, reset it to a few seconds
        hence.
      
      - When the timer expires, write back the whole world.  We use
        sync_filesystems() for this because it will force ext3 journal commits as
        well.
      
      - In balance_dirty_pages(), kick off background writeback when we hit the
        high threshold (dirty_ratio), not when we hit the low threshold.  This has
        the effect of causing "lumpy" writeback which is something I spent a year
        fixing, but in laptop mode, it is desirable.
      
      - In try_to_free_pages(), only kick pdflush if the VM is getting into
        distress: we want to keep scanning for clean pages, deferring writeback.
      
      - In page reclaim, avoid writing back the odd random dirty page off the
        LRU: only start I/O if the scanning is working harder.
      
      The effect is to perform a sync() a few seconds after all I/O has ceased.
      
      The value which was written into /proc/sys/vm/laptop-mode determines, in
      seconds, the delay between the final I/O and the flush.
      
      Additionally, the patch adds tools which help answer the question "why the
      heck does my disk spin up all the time?".  The user may set
      /proc/sys/vm/block_dump to a non-zero value and the kernel will print out
      information which will identify the process which is performing disk reads or
      which is dirtying pagecache.
      
      The user should probably disable syslogd before setting block-dump.
      93d33a48
    • Andrew Morton's avatar
      [PATCH] don't allow background writes to hide dirty buffers · bd134f27
      Andrew Morton authored
      If pdflush hits a locked-and-clean buffer in __block_write_full_page() it
      will just pass over the buffer.  Typically the buffer is an ext3 data=ordered
      buffer which is being written by kjournald, but a similar thing can happen
      with blockdev buffers and ll_rw_block().
      
      This is bad because the buffer is still under I/O and a subsequent fsync's
      fdatawait() needs to know about it.
      
      It is not practical to tag the page for writeback - only the submitter of the
      I/O can do that, because the submitter has control of the end_io handler.
      
      So instead, redirty the page so a subsequent fsync's fdatawrite() will wait on
      the underway I/O.
      
      There is a risk that pdflush::background_writeout() will lock up, repeatedly
      trying and failing to write the same page.  This is prevented by ensuring
      that background_writeout() always throttles when it made no progress.
      bd134f27
    • Andrew Morton's avatar
      [PATCH] stop using the address_space dirty_pages list · 1d7d3304
      Andrew Morton authored
      Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY
      tag.  Remove address_space.dirty_pages.
      1d7d3304
    • Andrew Morton's avatar
      [PATCH] tag writeback pages as such in their radix tree · 40c8348e
      Andrew Morton authored
      Arrange for under-writeback pages to be marked thus in their pagecache radix
      tree.
      40c8348e
    • Andrew Morton's avatar
      [PATCH] tag dirty pages as such in the radix tree · 8ece6262
      Andrew Morton authored
      Arrange for all dirty pagecache pages to be tagged as dirty within their
      radix tree.
      8ece6262
    • Andrew Morton's avatar
      [PATCH] make the pagecache lock irq-safe. · 89261aab
      Andrew Morton authored
      Intro to these patches:
      
      - Major surgery against the pagecache, radix-tree and writeback code.  This
        work is to address the O_DIRECT-vs-buffered data exposure horrors which
        we've been struggling with for months.
      
        As a side-effect, 32 bytes are saved from struct inode and eight bytes
        are removed from struct page.  At a cost of approximately 2.5 bits per page
        in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
        populated.  Not all pages are pagecache; other pages gain the full 8 byte
        saving.
      
        This change will break any arch code which is using page->list and will
        also break any arch code which is using page->lru of memory which was
        obtained from slab.
      
        The basic problem which we (mainly Daniel McNeil) have been struggling
        with is in getting a really reliable fsync() across the page lists while
        other processes are performing writeback against the same file.  It's like
        juggling four bars of wet soap with your eyes shut while someone is
        whacking you with a baseball bat.  Daniel pretty much has the problem
        plugged but I suspect that's just because we don't have testcases to
        trigger the remaining problems.  The complexity and additional locking
        which those patches add is worrisome.
      
        So the approach taken here is to remove the page lists altogether and
        replace the list-based writeback and wait operations with in-order
        radix-tree walks.
      
        The radix-tree code has been enhanced to support "tagging" of pages, for
        later searches for pages which have a particular tag set.  This means that
        we can ask the radix tree code "find me the next 16 dirty pages starting at
        pagecache index N" and it will do that in O(log64(N)) time.
      
        This affects I/O scheduling potentially quite significantly.  It is no
        longer the case that the kernel will submit pages for I/O in the order in
        which the application dirtied them.  We instead submit them in file-offset
        order all the time.
      
        This is likely to be advantageous when applications are seeking all over
        a large file randomly writing small amounts of data.  I haven't performed
        much benchmarking, but tiobench random write throughput seems to be
        increased by 30%.  Other tests appear to be unaltered.  dbench may have got
        10-20% quicker, but it's variable.
      
        There is one large file which everyone seeks all over randomly writing
        small amounts of data: the blockdev mapping which caches filesystem
        metadata.  The kernel's IO submission patterns for this are now ideal.
      
      
        Because writeback and wait-for-writeback use a tree walk instead of a
        list walk they are no longer livelockable.  This probably means that we no
        longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
        fdatasync().  This may be beneficial for databases: multiple processes
        writing and syncing different parts of the same file at the same time can
        now all submit and wait upon writes to just their own little bit of the
        file, so we can get a lot more data into the queues.
      
        It is trivial to implement a part-file-fdatasync() as well, so
        applications can say "sync the file from byte N to byte M", and multiple
        applications can do this concurrently.  This is easy for ext2 filesystems,
        but probably needs lots of work for data-journalled filesystems and XFS and
        it probably doesn't offer much benefit over an i_semless O_SYNC write.
      
      
        These patches can end up making ext3 (even) slower:
      
      	for i in 1 2 3 4
      	do
      		dd if=/dev/zero of=$i bs=1M count=2000 &
      	done          
      
        runs awfully slow on SMP.  This is, yet again, because all the file
        blocks are jumbled up and the per-file linear writeout causes tons of
        seeking.  The above test runs sweetly on UP because the on UP we don't
        allocate blocks to different files in parallel.
      
        Mingming and Badari are working on getting block reservation working for
        ext3 (preallocation on steroids).  That should fix ext3 up.
      
      
      This patch:
      
      - Later, we'll need to access the radix trees from inside disk I/O
        completion handlers.  So make mapping->page_lock irq-safe.  And rename it
        to tree_lock to reliably break any missed conversions.
      89261aab
    • Andrew Morton's avatar
      [PATCH] Fix race between ll_rw_block() and block_write_full_page() · c2179a48
      Andrew Morton authored
      Fix a race which was identified by Daniel McNeil <daniel@osdl.org>
      
      If a buffer_head is under I/O due to JBD's ordered data writeout (which uses
      ll_rw_block()) then either filemap_fdatawrite() or filemap_fdatawait() need
      to wait on the buffer's existing I/O.
      
      Presently neither will do so, because __block_write_full_page() will not
      actually submit any I/O and will hence not mark the page as being under
      writeback.
      
      The best-performing fix would be to somehow mark the page as being under
      writeback and defer waiting for the ll_rw_block-initiated I/O until
      filemap_fdatawait()-time.  But this is hard, because in
      __block_write_full_page() we do not have control of the buffer_head's end_io
      handler.  Possibly we could make JBD call into end_buffer_async_write(), but
      that gets nasty.
      
      This patch makes __block_write_full_page() wait for any buffer_head I/O to
      complete before inspecting the buffer_head state.  It only does this in the
      case where __block_write_full_page() was called for a "data-integrity" write:
      (wbc->sync_mode != WB_SYNC_NONE).
      
      Probably it doesn't matter, because kjournald is currently submitting (or has
      already submitted) all dirty buffers anyway.
      c2179a48
  10. 19 Mar, 2004 1 commit
    • Rusty Russell's avatar
      [PATCH] Hotplug CPUs: Other CPU_DEAD Notifiers · 279ce7b2
      Rusty Russell authored
      Various files keep per-cpu caches which need to be freed/moved when a
      CPU goes down.  All under CONFIG_HOTPLUG_CPU ifdefs.
      
      scsi.c: drain dead cpu's scsi_done_q onto this cpu.
      
      buffer.c: brelse the bh_lrus queue for dead cpu.
      
      timer.c: migrate timers from dead cpu, being careful of lock order vs
      	__mod_timer.
      
      radix_tree.c: free dead cpu's radix_tree_preloads
      
      page_alloc.c: empty dead cpu's nr_pagecache_local into nr_pagecache, and
      	free pages on cpu's local cache.
      
      slab.c: stop reap_timer for dead cpu, adjust each cache's free limit, and
      	free each slab cache's per-cpu block.
      
      swap.c: drain dead cpu's lru_add_pvecs into ours, and empty its committed_space
      	counter into global counter.
      
      dev.c: drain device queues from dead cpu into this one.
      
      flow.c: drain dead cpu's flow cache.
      279ce7b2
  11. 06 Mar, 2004 3 commits
    • Andrew Morton's avatar
      [PATCH] Fix nobh_prepare_write() race · b12088bf
      Andrew Morton authored
      Dave Kleikamp <shaggy@austin.ibm.com> points out a race between
      nobh_prepare_write() and end_buffer_read_sync().  end_buffer_read_sync()
      calls unlock_buffer(), waking the nobh_prepare_write() thread, which
      immediately frees the buffer_head.  end_buffer_read_sync() then calls
      put_bh() which decrements b_count for the already freed structure.  The
      SLAB_DEBUG code detects the slab corruption.
      
      We fix this by giving nobh_prepare_write() a private buffer_head end_o
      handler which doesn't touch the buffer's contents after unlocking it.
      b12088bf
    • Andrew Morton's avatar
      [PATCH] CONFIG_LBD fixes · d67c0fd5
      Andrew Morton authored
      From: Eric Sandeen <sandeen@sgi.com>
      
      Several functions in buffer.c are using unsigned long where they should be
      using sector_t.
      
      Also, use pgoff_t in several places so it is easier to tell what is beingused
      as a pagecache index, what is being used as a disk index and what is being
      used as an offset-into-page.
      d67c0fd5
    • Andrew Morton's avatar
      [PATCH] fastcall / regparm fixes · 20e39386
      Andrew Morton authored
      From: Gerd Knorr <kraxel@suse.de>
      
      Current gcc's error out if a function's declaration and definition disagree
      about the register passing convention.
      
      The patch adds a new `fastcall' declatation primitive, and uses that in all
      the FASTCALL functions which we could find.  A number of inconsistencies were
      fixed up along the way.
      20e39386
  12. 18 Feb, 2004 1 commit
    • Andrew Morton's avatar
      [PATCH] Remove More Unneccessary CPU Notifiers · 79caa7d5
      Andrew Morton authored
      From: Rusty Russell <rusty@rustcorp.com.au>
      
      Three more removed CPU notifiers extracted from the hotplug CPU patch.
      
      kernel/softirq.c: the tasklet cpu prepration callback is useless:
      the vectors are already initialized to NULL.  Even with the hotplug
      CPU patches, they're of little or no use.
      
      fs/buffer.c: once again, they are already initialized to zero.
      
      mm/page_alloc.c: once again, already initialized to zero.
      79caa7d5
  13. 20 Jan, 2004 1 commit
  14. 19 Jan, 2004 3 commits
    • Andrew Morton's avatar
      [PATCH] Use for_each_cpu() Where It's Meant To Be · 012061cc
      Andrew Morton authored
      From: Rusty Russell <rusty@rustcorp.com.au>
      
      Some places use cpu_online() where they should be using cpu_possible, most
      commonly for tallying statistics.  This makes no difference without hotplug
      CPU.
      
      Use the for_each_cpu() macro in those places, providing good examples (and
      making the external hotplug CPU patch smaller).
      
      Some places use cpu_online() where they should be using cpu_possible, most
      commonly for tallying statistics.  This makes no difference without hotplug
      CPU.
      
      Use the for_each_cpu() macro in those places, providing good examples (and
      making the external hotplug CPU patch smaller).
      012061cc
    • Andrew Morton's avatar
      [PATCH] make try_to_free_pages walk zonelist · d5d4042d
      Andrew Morton authored
      From: Rik van Riel <riel@surriel.com>
      
      In 2.6.0 both __alloc_pages() and the corresponding wakeup_kswapd()s walk
      all zones in the zone list, possibly spanning multiple nodes in a low numa
      factor system like AMD64.
      
      Also, if lower_zone_protection is set in /proc, then it may be possible
      that kswapd never cleans out data in zones further down the zonelist and
      try_to_free_pages needs to do that.
      
      However, in 2.6.0 try_to_free_pages() only frees pages in the pgdat the
      first zone in the zonelist belongs to.
      
      This is probably the wrong behaviour, since both the page allocator and the
      kswapd wakeup free things from all zones on the zonelist.  The following
      patch makes try_to_free_pages() consistent with the allocator, by passing
      the zonelist as an argument and freeing pages from all zones in the list.
      
      I do not have any numa systems myself, so I have only tested it on my own
      little smp box.  Testing on NUMA systems may be useful, though the patch
      really only should have an impact in those rare cases where kswapd can't
      keep up with allocations...
      
      As a side effect, the patch shrinks the kernel by 2 lines and replaces some
      subtle magic by a simpler array walk.
      d5d4042d
    • Andrew Morton's avatar
      [PATCH] bdev: use correct mapping's i_sem · 54df7662
      Andrew Morton authored
      From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
      
      In a bunch of places we used file->f_dentry->d_inode->i_sem to protect
      fdatasync et.al.  Replaced with corrent file->f_mapping->host->i_sem - the
      object we are protecting is address_space, so we want an exclusion that would
      work for redirected ->i_mapping.  For normal files (not coda, not bdev) it's
      all the same, of course - there we have
      
       	file->f_mapping->host == file->f_dentry->d_inode
      
      and change above is an equivalent transfromation.
      54df7662
  15. 30 Dec, 2003 1 commit
  16. 29 Sep, 2003 1 commit
  17. 19 Aug, 2003 3 commits
    • Andrew Morton's avatar
      [PATCH] async write errors: fix spurious fs truncate errors · e89061de
      Andrew Morton authored
      From: Oliver Xymoron <oxymoron@waste.org>
      
      Currently, a writepage() which detects that it is writing outside i_size (due
      to concurrent truncate) will abandon the write, returning -EIO.
      
      The return value will bogusly cause an error to be recorded in the
      address_space.  So convert all those writepage() instances to return zero in
      this case.
      e89061de
    • Andrew Morton's avatar
      [PATCH] async write errors: use flags in address space · fcad2b42
      Andrew Morton authored
      From: Oliver Xymoron <oxymoron@waste.org>
      
      This patch just saves a few bytes in the inode by turning mapping->gfp_mask
      into an unsigned long mapping->flags.
      
      The mapping's gfp mask is placed in the 16 high bits of mapping->flags and
      two of the remaining 16 bits are used for tracking EIO and ENOSPC errors.
      
      This leaves 14 bits in the mapping for future use.  They should be accessed
      with the atomic bitops.
      fcad2b42
    • Andrew Morton's avatar
      [PATCH] async write errors: report truncate and io errors on · fe7e689f
      Andrew Morton authored
      From: Oliver Xymoron <oxymoron@waste.org>
      
      These patches add the infrastructure for reporting asynchronous write errors
      to block devices to userspace.  Error which are detected due to pdflush or VM
      writeout are reported at the next fsync, fdatasync, or msync on the given
      file, and on close if the error occurs in time.
      
      We do this by propagating any errors into page->mapping->error when they are
      detected.  In fsync(), msync(), fdatasync() and close() we return that error
      and zero it out.
      
      
      The Open Group say close() _may_ fail if an I/O error occurred while reading
      from or writing to the file system.  Well, in this implementation close() can
      return -EIO or -ENOSPC.  And in that case it will succeed, not fail - perhaps
      that is what they meant.
      
      
      There are three patches in this series and testing has only been performed
      with all three applied.
      fe7e689f
  18. 06 Aug, 2003 1 commit
    • Andrew Morton's avatar
      [PATCH] remove PF_READAHEAD · 7ec6fb01
      Andrew Morton authored
      The problem with PF_READAHEAD is that if someone does a non-GFP_ATOMIC memory
      allocation we can enter page reclaim and then call writepage, while
      PF_READAHEAD is set.  The block layer then drops writes or the wrong reads on
      the floor.  It can cause data loss.
      
      A fix is complex (well, intrusive).  Given that the readahead code is now
      skipping the entire readahead attempt if the queue is congested, the setting
      of PF_READAHEAD probably is not doing anything useful anyway, so simply
      remove it.
      7ec6fb01