An error occurred fetching the project authors.
  1. 19 Sep, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] fix suppression of page allocation failure warnings · d51832f3
      Andrew Morton authored
      Somebody somewhere is stomping on PF_NOWARN, and page allocation
      failure warnings are coming out of the wrong places.
      
      So change the handling of current->flags to be:
      
      int pf_flags = current->flags;
      
      current->flags |= PF_NOWARN;
      ...
      current->flags = pf_flags;
      
      which is a generally more robust approach.
      d51832f3
  2. 10 Sep, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] exact dirty state accounting · 1f90eedd
      Andrew Morton authored
      Some adjustments to global dirty page accounting.
      
      Previously, dirty page accounting counted all dirty pages.  Even dirty
      anonymous pages.  This has potential to upset the throttling logic in
      balance_dirty_pages().  Particularly as I suspect we should decrease
      the dirty memory writeback thresholds by a lot.
      
      So this patch changes it so that we only account for dirty pagecache
      pages which have backing store.  Not anonymous pages, not swapcache,
      not in-memory filesystem pages.
      
      To support this, the `memory_backed' boolean has been added to struct
      backing_dev_info.  When an address space's backing device is marked as
      memory-backed, the core kernel knows to not include that mapping's
      pages in the dirty memory accounting.
      
      For memory-backed mappings, dirtiness is a way of pinning the page, and
      there's nothing the kernel can to do clean the page to make it freeable.
      
      driverfs, tmpfs, and ranfs have been coverted to mark their mappings as
      memory-backed.
      
      The ramdisk driver hasn't been converted.  I have a separate patch for
      ramdisk, which fails to fix the longstanding problems in there :(
      
      With this patch, /bin/sync now sends /proc/meminfo:Dirty to zero, which
      is rather comforting.
      1f90eedd
  3. 30 Aug, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] batched freeing of anon pages · 8fd3d458
      Andrew Morton authored
      A reworked version of the batched page freeing and lock amortisation
      for VMA teardown.
      
      It walks the existing 507-page list in the mmu_gather_t in 16-page
      chunks, drops their refcounts in 16-page chunks, and de-LRUs and
      frees any resulting zero-count pages in up-to-16 page chunks.
      8fd3d458
  4. 15 Aug, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] batched addition of pages to the LRU · 9eb76ee2
      Andrew Morton authored
      The patch goes through the various places which were calling
      lru_cache_add() against bulk pages and batches them up.
      
      Also.  This whole patch series improves the behaviour of the system
      under heavy writeback load.  There is a reduction in page allocation
      failures, some reduction in loss of interactivity due to page
      allocators getting stuck on writeback from the VM.  (This is still bad
      though).
      
      I think it's due to the change here in mpage_writepages().  That
      function was originally unconditionally refiling written-back pages to
      the head of the inactive list.  The theory being that they should be
      moved out of the way of page allocators, who would end up waiting on
      them.
      
      It appears that this simply had the effect of pushing dirty, unwritten
      data closer to the tail of the inactive list, making things worse.
      
      So instead, if the caller is (typically) balance_dirty_pages() then
      leave the pages where they are on the LRU.
      
      If the caller is PF_MEMALLOC then the pages *have* to be refiled.  This
      is because VM writeback is clustered along mapping->dirty_pages, and
      it's almost certain that the pages which are being written are near the
      tail of the LRU.  If they were left there, page allocators would block
      on them too soon.  It would effectively become a synchronous write.
      9eb76ee2
    • Andrew Morton's avatar
      [PATCH] multithread page reclaim · 3aa1dc77
      Andrew Morton authored
      This patch multithreads the main page reclaim function, shrink_cache().
      
      This function used to run under pagemap_lru_lock.  Instead, we grab
      that lock, put 32 pages from the LRU into a private list, drop the
      pagemap_lru_lock and then proceed to attempt to free those pages.
      
      Any pages which were succesfully reclaimed are batch-freed.  Pages
      which were not reclaimed are re-added to the LRU.
      
      This patch reduces pagemap_lru_lock contention on the 4-way by a factor
      of thirty.
      
      The shrink_cache() code has been simplified somewhat.
      
      refill_inactive() was being called too often - often just to process
      two or three pages.  Fiddled with that so it processes pages at the
      same rate, but works on 32 pages at a time.
      
      Added a couple of mark_page_accessed() calls into mm/memory.c from 2.4.
      They seem appropriate.
      
      Change the shrink_caches() logic so that it will still trickle through
      the active list (via refill_inactive) even if the inactive list is much
      larger than the active list.
      3aa1dc77
  5. 10 Aug, 2002 1 commit
    • Christoph Hellwig's avatar
      [PATCH] misc pagecache cleanups / tweaks · 8b1763fb
      Christoph Hellwig authored
      - inline grab_cache_page() in pagemap.h, it's just a simple wrapper
        around find_or_create_page()
      - rename (__)remove_inode_page to (__)remove_from_page_cache and
        move them from mm.h and swap.h to pagemap.h because they reverse
        add_to_page_cache and that's where they belong.
      8b1763fb
  6. 01 Aug, 2002 1 commit
  7. 29 Jul, 2002 1 commit
  8. 28 Jul, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] show_free_areas() cleanup · c1ab3459
      Andrew Morton authored
      Cleanup to show_free_areas() from Bill Irwin:
      
      show_free_areas() and show_free_areas_core() is a mess.
      (1) it uses a bizarre and ugly form of list iteration to walk buddy lists
              use standard list functions instead
      (2) it prints the same information repeatedly once per-node
              rationalize the braindamaged iteration logic
      (3) show_free_areas_node() is useless and not called anywhere
              remove it entirely
      (4) show_free_areas() itself just calls show_free_areas_core()
              remove show_free_areas_core() and do the stuff directly
      (5) SWAP_CACHE_INFO is always #defined, remove it
      (6) INC_CACHE_INFO() doesn't use the do { } while (0) construct
      
      This patch also includes Matthew Dobson's patch which removes
      mm/numa.c:node_lock.  The consensus is that it doesn't do anything now
      that show_free_areas_node() isn't there.
      c1ab3459
  9. 19 Jul, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] remove add_to_page_cache_unique() · cad46d66
      Andrew Morton authored
      A tasty patch from Hugh Dickens.  radix_tree_insert() fails if something
      was already present at the target index, so that error can be
      propagated back through add_to_page_cache().  Hence
      add_to_page_cache_unique() is obsolete.
      
      Hugh's patch removes add_to_page_cache_unique() and cleans up a bunch of
      stuff.
      cad46d66
    • Andrew Morton's avatar
      [PATCH] minimal rmap · c48c43e6
      Andrew Morton authored
      This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
      Kulsea.
      
      Basically,
      
      before: When the page reclaim code decides that is has scanned too many
      unreclaimable pages on the LRU it does a scan of process virtual
      address spaces for pages to add to swapcache.  ptes pointing at the
      page are unmapped as the scan proceeds.  When all ptes referring to a
      page have been unmapped and it has been written to swap the page is
      reclaimable.
      
      after: When an anonymous page is encountered on the tail of the LRU we
      use the rmap to see if it hasn't been referenced lately.  If so then
      add it to swapcache.  When the page is again encountered on the LRU, if
      it is still unreferenced then try to unmap all ptes which refer to it
      in one hit, and if it is clean (ie: on swap) then free it.
      
      The rest of the VM - list management, the classzone concept, etc
      remains unchanged.
      
      There are a number of things which the per-page pte chain could be
      used for.  Bill Irwin has identified the following.
      
      
      (1)  page replacement no longer goes around randomly unmapping things
      
      (2)  referenced bits are more accurate because there aren't several ms
              or even seconds between find the multiple pte's mapping a page
      
      (3)  reduces page replacement from O(total virtually mapped) to O(physical)
      
      (4)  enables defragmentation of physical memory
      
      (5)  enables cooperative offlining of memory for friendly guest instance
              behavior in UML and/or LPAR settings
      
      (6)  demonstrable benefit in performance of swapping which is common in
              end-user interactive workstation workloads (I don't like the word
              "desktop"). c.f. Craig Kulesa's post wrt. swapping performance
      
      (7)  evidence from 2.4-based rmap trees indicates approximate parity
              with mainline in kernel compiles with appropriate locking bits
      
      (8)  partitioning of physical memory can reduce the complexity of page
              replacement searches by scanning only the "interesting" zones
              implemented and merged in 2.4-based rmap
      
      (9)  partitioning of physical memory can increase the parallelism of page
              replacement searches by independently processing different zones
              implemented, but not merged in 2.4-based rmap
      
      (10) the reverse mappings may be used for efficiently keeping pte cache
              attributes coherent
      
      (11) they may be used for virtual cache invalidation (with changes)
      
      (12) the reverse mappings enable proper RSS limit enforcement
              implemented and merged in 2.4-based rmap
      
      
      
      The code adds a pointer to struct page, consumes additional storage for
      the pte chains and adds computational expense to the page reclaim code
      (I measured it at 3% additional load during streaming I/O).  The
      benefits which we get back for all this are, I must say, theoretical
      and unproven.  If it has real advantages (or, indeed, disadvantages)
      then why has nobody demonstrated them?
      
      
      
      There are a number of things remaining to be done:
      
      1: Demonstrate the above advantages.
      
      2: Make it work with pte-highmem  (Bill Irwin is signed up for this)
      
      3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
         patch does this)
      
      4: Move the pte_chains into highmem too (Bill, I guess)
      
      5: per-cpu pte_chain freelists (Rik?)
      
      6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)
      
      7: multithread the page reclaim code.  (I have patches).
      
      8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
         often well-ordered-by-virtual-address on the LRU, so it "just
         works" for benchmarky loads.  But there may be some other loads...
      
      9: Fix bad IO latency in page reclaim (I have lame patches)
      
      10: Develop tuning tools, use them.
      
      11: The nightly updatedb run is still evicting everything.
      c48c43e6
  10. 04 Jul, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] always update page->flags atomically · a2b41d23
      Andrew Morton authored
      move_from_swap_cache() and move_to_swap_cache() are playing with
      page->flags nonatomically.  The page is on the LRU at the time and
      another CPU could be altering page->flags concurrently.
      
      The patch converts those functions to use atomic operations.
      
      It also rationalises the number of bits which are cleared.  It's not
      really clear to me what page flags we really want to set to a known
      state in there.
      
      It had no right to go clearing PG_arch_1.  I'm now clearing PG_arch_1
      inside rmqueue() which is still a bit presumptious.
      
      btw: shmem uses PAGE_CACHE_SIZE and swapper_space uses PAGE_SIZE.  I've
      been carefully maintaining the distinction, but it looks like shmem
      will break if we ever do make these values different.
      
      
      Also, __add_to_page_cache() was performing a non-atomic RMW against
      page->flags, under the assumption that it was a newly allocated page
      which no other CPU would look at.  Not true - this function is used for
      moving anon pages into swapcache.  Those anon pages are on the LRU -
      other CPUs can be performing operations against page->flags while
      __add_to_swap_cache is stomping on them.  This had me running around in
      circles for two days.
      
      So let's move the initialisation of the page state into rmqueue(),
      where the page really is new (could do it in page_cache_alloc,
      perhaps).
      
      The SetPageLocked() in __add_to_page_cache() is also rather curious.
      Seems OK for both pagecache and swapcache so I covered that with a
      comment.
      
      
      2.4 has the same problem.  Basically, add_to_swap_cache() can stomp on
      another CPU's manipulation of page->flags.  After a quick review of the
      code there, it is barely conceivable that a concurrent refill_inactve()
      could get its PG_referenced and PG_active bits scribbled on.  Rather
      unlikely because swap_out() will probably see PageActive() and bale
      out.
      
      Also, mark_dirty_kiobuf() could have its PG_dirty bit accidentally
      cleared (but try_to_swap_out() sets it again later).
      
      But there may be other code paths.  Really, I think this needs fixing
      in 2.4 - it's horrid.
      a2b41d23
    • Andrew Morton's avatar
      [PATCH] misc cleanups and fixes · 06be3a5e
      Andrew Morton authored
      - Comment and documentation fixlets
      
      - Remove some unneeded fields from swapper_inode (these are a
        leftover from when I had swap using the filesystem IO functions).
      
      - fix a printk bug in pci/pool.c: when dma_addr_t is 64 bit it
        generates a compile warning, and will print out garbage.  Cast it to
        unsigned long long.
      
      - Convert some writeback #defines into enums (Steven Augart)
      06be3a5e
  11. 18 Jun, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] direct-to-BIO I/O for swapcache pages · 88c4650a
      Andrew Morton authored
      This patch changes the swap I/O handling.  The objectives are:
      
      - Remove swap special-casing
      - Stop using buffer_heads -> direct-to-BIO
      - Make S_ISREG swapfiles more robust.
      
      I've spent quite some time with swap.  The first patches converted swap to
      use block_read/write_full_page().  These were discarded because they are
      still using buffer_heads, and a reasonable amount of otherwise unnecessary
      infrastructure had to be added to the swap code just to make it look like a
      regular fs.  So this code just has a custom direct-to-BIO path for swap,
      which seems to be the most comfortable approach.
      
      A significant thing here is the introduction of "swap extents".  A swap
      extent is a simple data structure which maps a range of swap pages onto a
      range of disk sectors.  It is simply:
      
      	struct swap_extent {
      		struct list_head list;
      		pgoff_t start_page;
      		pgoff_t nr_pages;
      		sector_t start_block;
      	};
      
      At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
      and the block numbers are parsed to generate the device's swap extent list.
      This extent list is quite compact - a 512 megabyte swapfile generates about
      130 nodes in the list.  That's about 4 kbytes of storage.  The conversion
      from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
      time.
      
      At swapon time (for an S_ISBLK swapfile), we install a single swap extent
      which describes the entire device.
      
      The advantages of the swap extents are:
      
      1: We never have to run bmap() (ie: read from disk) at swapout time.  So
         S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.
      
      2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
         handled at swapon time.  During normal operation, we just don't care.
         Both types of swapfiles are handled the same way.
      
      3: The extent lists always operate in PAGE_SIZE units.  So the problems of
         going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
         operating code doesn't need to care.
      
      4: Because we don't have to fiddle with different blocksizes, we can go
         direct-to-BIO for swap_readpage() and swap_writepage().  This introduces
         the kernel-wide invariant "anonymous pages never have buffers attached",
         which cleans some things up nicely.  All those block_flushpage() calls in
         the swap code simply go away.
      
      5: The kernel no longer has to allocate both buffer_heads and BIOs to
         perform swapout.  Just a BIO.
      
      6: It permits us to perform swapcache writeout and throttling for
         GFP_NOFS allocations (a later patch).
      
      (Well, there is one sort of anon page which can have buffers: the pages which
      are cast adrift in truncate_complete_page() because do_invalidatepage()
      failed.  But these pages are never added to swapcache, and nobody except the
      VM LRU has to deal with them).
      
      The swapfile parser in setup_swap_extents() will attempt to extract the
      largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
      disk from the S_ISREG swapfile.  Any stray blocks (due to file
      discontiguities) are simply discarded - we never swap to those.
      
      If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
      the swapon attempt will fail.
      
      The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
      swapfile).  It needs to be consulted once for each page within
      swap_readpage() and swap_writepage().  Hence there is a risk that we could
      blow significant amounts of CPU walking that list.  However I have
      implemented a "where we found the last block" cache, which is used as the
      starting point for the next search.  Empirical testing indicates that this is
      wildly effective - the average length of the list walk in map_swap_page() is
      0.3 iterations per page, with a 130-element list.
      
      It _could_ be that some workloads do start suffering long walks in that code,
      and perhaps a tree would be needed there.  But I doubt that, and if this is
      happening then it means that we're seeking all over the disk for swap I/O,
      and the list walk is the least of our problems.
      
      rw_swap_page_nolock() now takes a page*, not a kernel virtual address.  It
      has been renamed to rw_swap_page_sync() and it takes care of locking and
      unlocking the page itself.  Which is all a much better interface.
      
      Support for type 0 swap has been removed.  Current versions of mkwap(8) seem
      to never produce v0 swap unless you explicitly ask for it, so I doubt if this
      will affect anyone.  If you _do_ have a type 0 swapfile, swapon will fail and
      the message
      
      	version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3
      
      is printed.  We can remove that code for real later on.  Really, all that
      swapfile header parsing should be pushed out to userspace.
      
      This code always uses single-page BIOs for swapin and swapout.  I have an
      additional patch which converts swap to use mpage_writepages(), so we swap
      out in 16-page BIOs.  It works fine, but I don't intend to submit that.
      There just doesn't seem to be any significant advantage to it.
      
      I can't see anything in sys_swapon()/sys_swapoff() which needs the
      lock_kernel() calls, so I deleted them.
      
      If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
      subsequent swapout will destroy the filesystem.  It was always thus, but it
      is much, much easier to do now.  Not really a kernel problem, but swapon(8)
      should not be allowing the kernel to use swapfiles which are modifiable by
      unprivileged users.
      88c4650a
    • Andrew Morton's avatar
      [PATCH] leave swapcache pages unlocked during writeout · 3ab86fb0
      Andrew Morton authored
      Convert swap pages so that they are PageWriteback and !PageLocked while
      under writeout, like all other block-backed pages.  (Network
      filesystems aren't doing this yet - their pages are still locked while
      under writeout)
      3ab86fb0
  12. 02 Jun, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] swapcache bugfixes · 91cb02b7
      Andrew Morton authored
      Fixes a few lock ranking bugs (and deadlocks) related to
      swap_list_lock(), swap_device_lock(), mapping->page_lock and
      mapping->private_lock.
      
      - Cannot call block_flushpage->try_to_free_buffers() inside
        mapping->page_lock.  Because __set_page_dirty_buffers() takes
        ->page_lock inside ->private-lock.
      
      - Cannot call swap_free->swap_list_lock/swap_device_lock inside
        mapping->page_lock because exclusive_swap_page() takes ->page_lock
        inside swap_info_get().
      
      
      The patch also removes all the block_flushpage() calls from the swap
      code in favour of a direct call to try_to_free_buffers().
      
      The theory is that the page is locked, there is no I/O underway, nobody
      else has access to the buffers so they MUST be freeable.  A bunch of
      BUG() checks have been added, and unless someone manages to trigger
      one, the "block_flushpage() inside spinlock" problem is fixed.
      91cb02b7
    • Andrew Morton's avatar
      [PATCH] give swapper_space a set_page_dirty a_op · 3aeb30b0
      Andrew Morton authored
      Give swapper_space a ->set_page_dirty() address_space_operation.
      
      So swapcache pages do not need special-casing in
      set_page_dirty_buffers().
      3aeb30b0
  13. 27 May, 2002 2 commits
    • Andrew Morton's avatar
      [PATCH] rename writeback_mapping to writepages · 7d608fac
      Andrew Morton authored
      Spot the difference:
      
      aops.readpage
      aops.readpages
      aops.writepage
      aops.writeback_mapping
      
      The patch renames `writeback_mapping' to `writepages'
      7d608fac
    • Andrew Morton's avatar
      [PATCH] mark swapout pages PageWriteback() · 357f5a5e
      Andrew Morton authored
      Pages which are under writeout to swap are locked, and not
      PageWriteback().  So page allocators do not throttle against them in
      shrink_caches().
      
      This causes enormous list scans and general coma under really heavy
      swapout loads.
      
      One fix would be to teach shrink_cache() to wait on PG_locked for swap
      pages.  The other approach is to set both PG_locked and PG_writeback
      for swap pages so they can be handled in the same manner as file-backed
      pages in shrink_cache().
      
      This patch takes the latter approach.
      357f5a5e
  14. 23 May, 2002 1 commit
  15. 19 May, 2002 3 commits
    • Andrew Morton's avatar
      [PATCH] writeback tuning · acb5f6f9
      Andrew Morton authored
      Tune up the VM-based writeback a bit.
      
      - Always use the multipage clustered-writeback function from within
        shrink_cache(), even if the page's mapping has a NULL ->vm_writeback().  So
        clustered writeback is turned on for all address_spaces, not just ext2.
      
        Subtle effect of this change: it is now the case that *all* writeback
        proceeds along the mapping->dirty_pages list.  The orderedness of the page
        LRUs no longer has an impact on disk scheduling.  So we only have one list
        to keep well-sorted rather than two, and churning pages around on the LRU
        will no longer damage write bandwidth - it's all up to the filesystem.
      
      - Decrease the clustered writeback from 1024 pages(!) to 32 pages.
      
        (1024 was a leftover from when this code was always dispatching writeback
        to a pdflush thread).
      
      - Fix wakeup_bdflush() so that it actually does write something (duh).
      
        do_wp_page() needs to call balance_dirty_pages_ratelimited(), so we
        throttle mmap page-dirtiers in the same way as write(2) page-dirtiers.
        This may make wakeup_bdflush() obsolete, but it doesn't hurt.
      
      - Converts generic_vm_writeback() to directly call ->writeback_mapping(),
        rather that going through writeback_single_inode().  This prevents memory
        allocators from blocking on the inode's I_LOCK.  But it does mean that two
        processes can be writing pages from the same mapping at the same time.  If
        filesystems care about this (for layout reasons) then they should serialise
        in their ->writeback_mapping a_op.
      
        This means that memory-allocators will writeback only pages, not pages
        and inodes.  There are no locks in that writeback path (except for request
        queue exhaustion).  Reduces memory allocation latency.
      
      - Implement new background_writeback function, which when kicked off
        will perform writeback until dirty memory falls below the background
        threshold.
      
      - Put written-back pages onto the remote end of the page LRU.  It
        does this in the slow-and-stupid way at present.  pagemap_lru_lock
        stress-relief is planned...
      
      - Remove the funny writeback_unused_inodes() stuff from prune_icache().
        Writeback from wakeup_bdflush() and the `kupdate' function now just
        naturally cleanses the oldest inodes so we don't need to do anything
        there.
      
      - Dirty memory balancing is still using magic numbers: "after you
        dirtied your 1,000th page, go write 1,500".  Obviously, this needs
        more work.
      acb5f6f9
    • Andrew Morton's avatar
      [PATCH] fix dirty page management · 0f9268b8
      Andrew Morton authored
      This fixes a bug in ext3 - when ext3 decides that it wants to fail its
      writepage(), it is running SetPageDirty().  But ->writepage has just put
      the page on ->clean_pages().  The page ends up dirty, on ->clean_pages
      and the normal writeback paths don't know about it any more.
      
      So run set_page_dirty() instead, to place the page back on the dirty
      list.
      
      And in move_from_swap_cache(), shuffle the page across to ->dirty_pages
      so that it's eligible for writeout.  ___add_to_page_cache() forgets to
      look at the page state when deciding which list to attach it to.
      
      All SetPageDirty() callers otherwise look OK.
      0f9268b8
    • Andrew Morton's avatar
      [PATCH] i_dirty_buffers locking fix · 43152186
      Andrew Morton authored
      This fixes a race between try_to_free_buffers' call to
      __remove_inode_queue() and other users of b_inode_buffers
      (fsync_inode_buffers and mark_buffer_dirty_inode()).  They are
      presently taking different locks.
      
      The patch relocates and redefines and clarifies(?) the role of
      inode.i_dirty_buffers.
      
      The 2.4 definition of i_dirty_buffers is "a list of random buffers
      which is protected by a kernel-wide lock".  This definition needs to be
      narrowed in the 2.5 context.  It is now
      
      "a list of buffers from a different mapping, protected by a lock within
      that mapping".  This list of buffers is specifically for fsync().
      
      As this is a "data plane" operation, all the structures have been moved
      out of the inode and into the address_space.  So address_space now has:
      
      list_head private_list;
      
           A list, available to the address_space for any purpose.  If
           that address_space chooses to use the helper functions
           mark_buffer_dirty_inode and sync_mapping_buffers() then this list
           will contain buffer_heads, attached via
           buffer_head.b_assoc_buffers.
      
           If the address_space does not call those helper functions
           then the list is free for other usage.  The only requirement is
           that the list be list_empty() at destroy_inode() time.
      
           At least, this is the objective.  At present,
           generic_file_write() will call generic_osync_inode(), which
           expects that list to contain buffer_heads.  So private_list isn't
           useful for anything else yet.
      
      spinlock_t private_lock;
      
           A spinlock, available to the address_space.
      
           If the address_space is using try_to_free_buffers(),
           mark_inode_dirty_buffers() and fsync_inode_buffers() then this
           lock is used to protect the private_list of *other* mappings which
           have listed buffers from *this* mapping onto themselves.
      
           That is: for buffer_heads, mapping_A->private_lock does not
           protect mapping_A->private_list!  It protects the b_assoc_buffers
           list from buffers which are backed by mapping_A and it protects
           mapping_B->private_list, mapping_C->private_list, ...
      
           So what we have here is a cross-mapping association.  S_ISREG
           mappings maintain a list of buffers from the blockdev's
           address_space which they need to know about for a successful
           fsync().  The locking follows the buffers: the lock in in the
           blockdev's mapping, not in the S_ISREG file's mapping.
      
           For address_spaces which use try_to_free_buffers,
           private_lock is also (and quite unrelatedly) used for protection
           of the buffer ring at page->private.  Exclusion between
           try_to_free_buffers(), __get_hash_table() and
           __set_page_dirty_buffers().  This is in fact its major use.
      
      address_space *assoc_mapping
      
          Sigh.  This is the address of the mapping which backs the
          buffers which are attached to private_list.  It's here so that
          generic_osync_inode() can locate the lock which protects this
          mapping's private_list.  Will probably go away.
      
      
      A consequence of all the above is that:
      
          a) All the buffers at a mapping_A's ->private_list must come
             from the same mapping, mapping_B.  There is no requirement that
             mapping_B be a blockdev mapping, but that's how it's used.
      
             There is a BUG() check in mark_buffer_dirty_inode() for this.
      
          b) blockdev mappings never have any buffers on ->private_list.
             It just never happens, and doesn't make a lot of sense.
      
      reiserfs is using b_inode_buffers for attaching dependent buffers to its
      journal and that caused a few problems.  Fixed in reiserfs_releasepage.patch
      43152186
  16. 30 Apr, 2002 3 commits
    • Andrew Morton's avatar
      [PATCH] cleanup page flags · aa78091f
      Andrew Morton authored
      page->flags cleanup.
      
      Moves the definitions of the page->flags bits and all the PageFoo
      macros into linux/page-flags.h.  That file is currently included from
      mm.h, but the stage is set to remove that and include page-flags.h
      direct in all .c files which require that.  (120 of them).
      
      The patch also makes all the page flag macros and functions consistent:
      
      For PG_foo, the following functions are defined:
      
      	SetPageFoo
      	ClearPageFoo
      	TestSetPageFoo
      	TestClearPageFoo
      	PageFoo
      
      and that's it.
      
      - Page_Uptodate is renamed to PageUptodate
      
      - LockPage is removed.  All users updated to use SetPageLocked
      
      - UnlockPage is removed.  All callers updated to use unlock_page().
        it's a real function - there's no need to hide that fact.
      
      - PageTestandClearReferenced renamed to TestClearPageReferenced
      
      - PageSetSlab renamed to SetPageSlab
      
      - __SetPageReserved is removed.  It's an infinitesimally small
         microoptimisation, and is inconsistent.
      
      - TryLockPage is renamed to TestSetPageLocked
      
      - PageSwapCache() is renamed to page_swap_cache(), so it doesn't
        pretend to be a page->flags bit test.
      aa78091f
    • Andrew Morton's avatar
      [PATCH] writeback from address spaces · 090da372
      Andrew Morton authored
      [ I reversed the order in which writeback walks the superblock's
        dirty inodes.  It sped up dbench's unlink phase greatly.  I'm
        such a sleaze ]
      
      The core writeback patch.  Switches file writeback from the dirty
      buffer LRU over to address_space.dirty_pages.
      
      - The buffer LRU is removed
      
      - The buffer hash is removed (uses blockdev pagecache lookups)
      
      - The bdflush and kupdate functions are implemented against
        address_spaces, via pdflush.
      
      - The relationship between pages and buffers is changed.
      
        - If a page has dirty buffers, it is marked dirty
        - If a page is marked dirty, it *may* have dirty buffers.
        - A dirty page may be "partially dirty".  block_write_full_page
          discovers this.
      
      - A bunch of consistency checks of the form
      
      	if (!something_which_should_be_true())
      		buffer_error();
      
        have been introduced.  These fog the code up but are important for
        ensuring that the new buffer/page code is working correctly.
      
      - New locking (inode.i_bufferlist_lock) is introduced for exclusion
        from try_to_free_buffers().  This is needed because set_page_dirty
        is called under spinlock, so it cannot lock the page.  But it
        needs access to page->buffers to set them all dirty.
      
        i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
      
      - fs/inode.c has been split: all the code related to file data writeback
        has been moved into fs/fs-writeback.c
      
      - Code related to file data writeback at the address_space level is in
        the new mm/page-writeback.c
      
      - try_to_free_buffers() is now non-blocking
      
      - Switches vmscan.c over to understand that all pages with dirty data
        are now marked dirty.
      
      - Introduces a new a_op for VM writeback:
      
      	->vm_writeback(struct page *page, int *nr_to_write)
      
        this is a bit half-baked at present.  The intent is that the address_space
        is given the opportunity to perform clustered writeback.  To allow it to
        opportunistically write out disk-contiguous dirty data which may be in other zones.
        To allow delayed-allocate filesystems to get good disk layout.
      
      - Added address_space.io_pages.  Pages which are being prepared for
        writeback.  This is here for two reasons:
      
        1: It will be needed later, when BIOs are assembled direct
           against pagecache, bypassing the buffer layer.  It avoids a
           deadlock which would occur if someone moved the page back onto the
           dirty_pages list after it was added to the BIO, but before it was
           submitted.  (hmm.  This may not be a problem with PG_writeback logic).
      
        2: Avoids a livelock which would occur if some other thread is continually
           redirtying pages.
      
      - There are two known performance problems in this code:
      
        1: Pages which are locked for writeback cause undesirable
           blocking when they are being overwritten.  A patch which leaves
           pages unlocked during writeback comes later in the series.
      
        2: While inodes are under writeback, they are locked.  This
           causes namespace lookups against the file to get unnecessarily
           blocked in wait_on_inode().  This is a fairly minor problem.
      
           I don't have a fix for this at present - I'll fix this when I
           attach dirty address_spaces direct to super_blocks.
      
      - The patch vastly increases the amount of dirty data which the
        kernel permits highmem machines to maintain.  This is because the
        balancing decisions are made against the amount of memory in the
        machine, not against the amount of buffercache-allocatable memory.
      
        This may be very wrong, although it works fine for me (2.5 gigs).
      
        We can trivially go back to the old-style throttling with
        s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
        balance_dirty_pages().  But better would be to allow blockdev
        mappings to use highmem (I'm thinking about this one, slowly).  And
        to move writer-throttling and writeback decisions into the VM (modulo
        the file-overwriting problem).
      
      - Drops 24 bytes from struct buffer_head.  More to come.
      
      - There's some gunk like super_block.flags:MS_FLUSHING which needs to
        be killed.  Need a better way of providing collision avoidance
        between pdflush threads, to prevent more than one pdflush thread
        working a disk at the same time.
      
        The correct way to do that is to put a flag in the request queue to
        say "there's a pdlfush thread working this disk".  This is easy to
        do: just generalise the "ra_pages" pointer to point at a struct which
        includes ra_pages and the new collision-avoidance flag.
      090da372
    • Andrew Morton's avatar
      [PATCH] page accounting · d878155c
      Andrew Morton authored
      This patch provides global accounting of locked and dirty pages.  It
      does this via lightweight per-CPU data structures.  The page_cache_size
      accounting has been changed to use this facility as well.
      
      Locked and dirty page accounting is needed for making writeback and
      throttling decisions.
      
      The patch also starts to move code which is related to page->flags
      out of linux/mm.h and into linux/page-flags.h
      d878155c
  17. 10 Apr, 2002 1 commit
    • Andrew Morton's avatar
      [PATCH] Velikov/Hellwig radix-tree pagecache · 3d30a6cc
      Andrew Morton authored
      Before the mempool was added, the VM was getting many, many
      0-order allocation failures due to the atomic ratnode
      allocations inside swap_out.  That monster mempool is
      doing its job - drove a 256meg machine a gigabyte into
      swap with no ratnode allocation failures at all.
      
      So we do need to trim that pool a bit, and also handle
      the case where swap_out fails, and not just keep
      pointlessly calling it.
      3d30a6cc
  18. 05 Feb, 2002 13 commits
    • Linus Torvalds's avatar
      v2.4.13.6 -> v2.4.13.7 · 595cf06f
      Linus Torvalds authored
        - me: reinstate "delete swap cache on low swap" code
        - David Miller: ksoftirqd startup race fix
        - Hugh Dickins: make tmpfs free swap cache entries proactively
      595cf06f
    • Linus Torvalds's avatar
      v2.4.13.4 -> v2.4.13.5 · 22a160fb
      Linus Torvalds authored
        - Andrew Morton: remove stale UnlockPage
        - me: swap cache page locking update
      22a160fb
    • Linus Torvalds's avatar
      v2.4.13.3 -> v2.4.13.4 · f97f22cb
      Linus Torvalds authored
        - Mikael Pettersson: fix P4 boot with APIC enabled
        - me: fix device queuing thinko, clean up VM locking
      f97f22cb
    • Linus Torvalds's avatar
      v2.4.13 -> v2.4.13.1 · 980adcb2
      Linus Torvalds authored
        - Michael Warfield: computone serial driver update
        - Alexander Viro: cdrom module race fixes
        - David Miller: Acenic driver fix
        - Andrew Grover: ACPI update
        - Kai Germaschewski: ISDN update
        - Tim Waugh: parport update
        - David Woodhouse: JFFS garbage collect sleep
      980adcb2
    • Linus Torvalds's avatar
      v2.4.10.5 -> v2.4.10.6 · 0a528ace
      Linus Torvalds authored
        - various: fix some module exports uncovered by stricter error checking
        - Urban Widmark: make smbfs use same error define names as samba and win32
        - Greg KH: USB update
        - Tom Rini: MPC8xx ppc update
        - Matthew Wilcox: rd.c page cache flushing fix
        - Richard Gooch: devfs race fix: rwsem for symlinks
        - Björn Wesen: Cris arch update
        - Nikita Danilov: reiserfs cleanup
        - Tim Waugh: parport update
        - Peter Rival: update alpha SMP bootup to match wait_init_idle fixes
        - Trond Myklebust: lockd/grace period fix
      0a528ace
    • Linus Torvalds's avatar
      v2.4.10.3 -> v2.4.10.4 · 1d23a518
      Linus Torvalds authored
        - Al Viro: separate out superblocks and FS namespaces: fs/super.c fathers
        fs/namespace.c
        - David Woodhouse: large MTD and JFFS[2] update
        - Marcelo Tosatti: resurrect oom handling
        - Hugh Dickins: add_to_swap_cache racefix cleanup
        - Jean Tourrilhes: IrDA update
        - Martin Bligh: support clustered logical APIC for >8 CPU x86 boxes
        - Richard Henderson: alpha update
      1d23a518
    • Linus Torvalds's avatar
      v2.4.9.14 -> v2.4.9.15 · e2f6721a
      Linus Torvalds authored
        - Jan Harkes: make Coda work with arbitrary host filesystems, not
        just filesystems that use generic_file_read/write
        - Al Viro: block device cleanups
        - Hugh Dickins: swap device lock fixes - fix swap readahead race
        - me, Andrea: more reference bit cleanups
      e2f6721a
    • Linus Torvalds's avatar
      v2.4.9.12 -> v2.4.9.13 · a27c6530
      Linus Torvalds authored
        - Manfred Spraul: /proc/pid/maps cleanup (and bugfix for non-x86)
        - Al Viro: "block device fs" - cleanup of page cache handling
        - Hugh Dickins: VM/shmem cleanups and swap search speedup
        - David Miller: sparc updates, soc driver typo fix, net updates
        - Jeff Garzik: network driver updates (dl2k, yellowfin and tulip)
        - Neil Brown: knfsd cleanups and fixues
        - Ben LaHaise: zap_page_range merge from -ac
      a27c6530
    • Linus Torvalds's avatar
      v2.4.9.10 -> v2.4.9.11 · a880f45a
      Linus Torvalds authored
        - Neil Brown: md cleanups/fixes
        - Andrew Morton: console locking merge
        - Andrea Arkangeli: major VM merge
      a880f45a
    • Linus Torvalds's avatar
      v2.4.9.3 -> v2.4.9.4 · 991b3ae8
      Linus Torvalds authored
        - Hugh Dickins: swapoff cleanups and speedups
        - Matthew Dharm: USB storage update
        - Keith Owens: Makefile fixes
        - Tom Rini: MPC8xx build fix
        - Nikita Danilov: reiserfs update
        - Jakub Jelinek: ELF loader fix for ET_DYN
        - Andrew Morton: reparent_to_init() for kernel threads
        - Christoph Hellwig: VxFS and SysV updates, vfs_permission fix
      991b3ae8
    • Linus Torvalds's avatar
      v2.4.6.7 -> v2.4.6.8 · fff10634
      Linus Torvalds authored
        - Chris Mason: reiserfs update
        - Paul Mackerras: PPC updates (softirq)
        - Kai Germaschewski: ISDN updates
        - various: workaround for cpuid inline asm problem with egcs-2.91.66
      fff10634
    • Linus Torvalds's avatar
      v2.4.6.3 -> v2.4.6.4 · ccb6dd87
      Linus Torvalds authored
        - David Miller: sparc and networking updates
        - Al Viro: SysV FS add_link off-by-two bogosity.
        - Jeff Garzik: merge D-Link DL2k GigE driver, other network driver cleanups
        - Kai Germaschewski: ISDN update
        - Alan Cox: more merging (MPT fusion core)
        - Johannes Erdfelt: USB updates
        - Stas Sergeev: make sure we return out of vm86 mode when interrupts
        get re.enabled
        - Rusty Russell: netfilter fixes for ipt_unclean and ip_queue
        - me: initialize page->age when adding it to the swap cache
        - Paul Mackerras: PPC updates
        - some subtle fs/buffer.c race conditions (Andrew Morton, me)
      ccb6dd87
    • Linus Torvalds's avatar
      v2.4.5.1 -> v2.4.5.2 · 4fdbe71c
      Linus Torvalds authored
        - Takanori Kawano: brlock indexing bugfix
        - Ingo Molnar, Jeff Garzik: softirq updates and fixes
        - Al Viro: rampage of superblock cleanups.
        - Jean Tourrilhes: Orinoco driver update v6, IrNET update
        - Trond Myklebust: NFS brown-paper-bag thing
        - Tim Waugh: parport update
        - David Miller: networking and sparc updates
        - Jes Sorensen: m68k update.
        - Ben Fennema: UDF update
        - Geert Uytterhoeven: fbdev logo updates
        - Willem Riede: osst driver updates
        - Paul Mackerras: PPC update
        - Marcelo Tosatti: unlazy swap cache
        - Mikulas Patocka: hpfs update
      4fdbe71c