1. 30 Oct, 2005 40 commits
    • Hugh Dickins's avatar
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins authored
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • Hugh Dickins's avatar
      [PATCH] mm: ptd_alloc inline and out · 1bb3630e
      Hugh Dickins authored
      It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
      calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
      pte_alloc_map and pte_alloc_kernel are entirely out-of-line.  Though it does
      add a little to kernel size, change them to macros testing inline, calling
      __pte_alloc or __pte_alloc_kernel to allocate out-of-line.  Mark none of them
      as fastcalls, leave that to CONFIG_REGPARM or not.
      
      It also seems more natural for the out-of-line functions to leave the offset
      calculation and map to the inline, which has to do it anyway for the common
      case.  At least mremap move wants __pte_alloc without _map.
      
      Macros rather than inline functions, certainly to avoid the header file issues
      which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
      architectures I haven't built would have other such problems.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1bb3630e
    • Hugh Dickins's avatar
      [PATCH] mm: init_mm without ptlock · 872fec16
      Hugh Dickins authored
      First step in pushing down the page_table_lock.  init_mm.page_table_lock has
      been used throughout the architectures (usually for ioremap): not to serialize
      kernel address space allocation (that's usually vmlist_lock), but because
      pud_alloc,pmd_alloc,pte_alloc_kernel expect caller holds it.
      
      Reverse that: don't lock or unlock init_mm.page_table_lock in any of the
      architectures; instead rely on pud_alloc,pmd_alloc,pte_alloc_kernel to take
      and drop it when allocating a new one, to check lest a racing task already
      did.  Similarly no page_table_lock in vmalloc's map_vm_area.
      
      Some temporary ugliness in __pud_alloc and __pmd_alloc: since they also handle
      user mms, which are converted only by a later patch, for now they have to lock
      differently according to whether or not it's init_mm.
      
      If sources get muddled, there's a danger that an arch source taking
      init_mm.page_table_lock will be mixed with common source also taking it (or
      neither take it).  So break the rules and make another change, which should
      break the build for such a mismatch: remove the redundant mm arg from
      pte_alloc_kernel (ppc64 scrapped its distinct ioremap_mm in 2.6.13).
      
      Exceptions: arm26 used pte_alloc_kernel on user mm, now pte_alloc_map; ia64
      used pte_alloc_map on init_mm, now pte_alloc_kernel; parisc had bad args to
      pmd_alloc and pte_alloc_kernel in unused USE_HPPA_IOREMAP code; ppc64
      map_io_page forgot to unlock on failure; ppc mmu_mapin_ram and ppc64 im_free
      took page_table_lock for no good reason.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      872fec16
    • Hugh Dickins's avatar
      [PATCH] mm: ia64 use expand_upwards · 46dea3d0
      Hugh Dickins authored
      ia64 has expand_backing_store function for growing its Register Backing Store
      vma upwards.  But more complete code for this purpose is found in the
      CONFIG_STACK_GROWSUP part of mm/mmap.c.  Uglify its #ifdefs further to provide
      expand_upwards for ia64 as well as expand_stack for parisc.
      
      The Register Backing Store vma should be marked VM_ACCOUNT.  Implement the
      intention of growing it only a page at a time, instead of passing an address
      outside of the vma to handle_mm_fault, with unknown consequences.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      46dea3d0
    • Hugh Dickins's avatar
      [PATCH] mm: mm_struct hiwaters moved · f449952b
      Hugh Dickins authored
      Slight and timid rearrangement of mm_struct: hiwater_rss and hiwater_vm were
      tacked on the end, but it seems better to keep them near _file_rss, _anon_rss
      and total_vm, in the same cacheline on those arches verified.
      
      There are likely to be more profitable rearrangements, but less obvious (is it
      good or bad that saved_auxv[AT_VECTOR_SIZE] isolates cpu_vm_mask and context
      from many others?), needing serious instrumentation.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f449952b
    • Hugh Dickins's avatar
      [PATCH] mm: update_hiwaters just in time · 365e9c87
      Hugh Dickins authored
      update_mem_hiwater has attracted various criticisms, in particular from those
      concerned with mm scalability.  Originally it was called whenever rss or
      total_vm got raised.  Then many of those callsites were replaced by a timer
      tick call from account_system_time.  Now Frank van Maarseveen reports that to
      be found inadequate.  How about this?  Works for Frank.
      
      Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
      update_hiwater_rss and update_hiwater_vm.  Don't attempt to keep
      mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
      by 1): those are hot paths.  Do the opposite, update only when about to lower
      rss (usually by many), or just before final accounting in do_exit.  Handle
      mm->hiwater_vm in the same way, though it's much less of an issue.  Demand
      that whoever collects these hiwater statistics do the work of taking the
      maximum with rss or total_vm.
      
      And there has been no collector of these hiwater statistics in the tree.  The
      new convention needs an example, so match Frank's usage by adding a VmPeak
      line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
      (High-Water-Mark or High-Water-Memory).
      
      There was a particular anomaly during mremap move, that hiwater_vm might be
      captured too high.  A fleeting such anomaly remains, but it's quickly
      corrected now, whereas before it would stick.
      
      What locking?  None: if the app is racy then these statistics will be racy,
      it's not worth any overhead to make them exact.  But whenever it suits,
      hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
      page_table_lock (for now) or with preemption disabled (later on): without
      going to any trouble, minimize the time between reading current values and
      updating, to minimize those occasions when a racing thread bumps a count up
      and back down in between.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      365e9c87
    • Hugh Dickins's avatar
      [PATCH] mm: zap_pte out of line · 861f2fb8
      Hugh Dickins authored
      There used to be just one call to zap_pte, but it shouldn't be inline now
      there are two.  Check for the common case pte_none before calling, and move
      its rss accounting up into install_page or install_file_pte - which helps the
      next patch.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      861f2fb8
    • Hugh Dickins's avatar
      [PATCH] mm: do_mremap current mm · d0de32d9
      Hugh Dickins authored
      Cleanup: relieve do_mremap from its surfeit of current->mms.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d0de32d9
    • Hugh Dickins's avatar
      [PATCH] mm: do_swap_page race major · 9e9bef07
      Hugh Dickins authored
      Small adjustment: do_swap_page should report its !pte_same race as a major
      fault if it had to read into swap cache, because whatever raced with it will
      have found page already in cache and reported minor fault.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9e9bef07
    • Hugh Dickins's avatar
      [PATCH] mm: zap_pte_range dec rss · 86d912f4
      Hugh Dickins authored
      Small adjustment: zap_pte_range decrement its rss counts from 0 then finally
      add, avoiding negations - we don't have or need a sub_mm_rss.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      86d912f4
    • Hugh Dickins's avatar
      [PATCH] mm: copy_one_pte inc rss · 8c103762
      Hugh Dickins authored
      Small adjustment, following Nick's suggestion: it's more straightforward for
      copy_pte_range to let copy_one_pte do the rss incrementation, than use an
      index it passed back.  Saves a #define, and 16 bytes of .text.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8c103762
    • Nick Piggin's avatar
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin authored
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      
      PageReserved special casing is removed from get_page and put_page.
      
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      
      Refcount bug fix for filemap_xip.c
      Signed-off-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b5810039
    • Hugh Dickins's avatar
      [PATCH] mm: m68k kill stram swap · f9c98d02
      Hugh Dickins authored
      Please, please now delete the Atari CONFIG_STRAM_SWAP code.  It may be
      excellent and ingenious code, but its reference to swap_vfsmnt betrays that it
      hasn't been built since 2.5.1 (four years old come December), it's delving
      deep into matters which are the preserve of core mm code, its only purpose is
      to give the more conscientious mm guys an anxiety attack from time to time;
      yet we keep on breaking it more and more.
      
      If you want to use RAM for swap, then if the MTD driver does not already
      provide just what you need, I'm sure David could be persuaded to add the
      extra.  But you'd also like to be able to allocate extents of that swap for
      other use: we can give you a core interface for that if you need.  But unbuilt
      for four years suggests to me that there's no need at all.
      
      I cannot swear the patch below won't break your build, but believe so.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f9c98d02
    • Hugh Dickins's avatar
      [PATCH] mm: sh64 hugetlbpage.c · 147efea8
      Hugh Dickins authored
      The sh64 hugetlbpage.c seems to be erroneous, left over from a bygone age,
      clashing with the common hugetlb.c.  Replace it by a copy of the sh
      hugetlbpage.c.  Except, delete that mk_pte_huge macro neither uses.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Acked-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      147efea8
    • Hugh Dickins's avatar
      [PATCH] mm: dup_mmap down new mmap_sem · 7ee78232
      Hugh Dickins authored
      One anomaly remains from when Andrea rationalized the responsibilities of
      mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its
      page_table_lock, but not the mmap_sem which normally guards the vma list and
      rbtree.  Which could be an issue for unuse_mm: though since it just walks down
      the list (today with page_table_lock, tomorrow not), it's probably okay.  Will
      need a memory barrier?  Oh, keep it simple, Nick and I agreed, no harm in
      taking child's mmap_sem here.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7ee78232
    • Hugh Dickins's avatar
      [PATCH] mm: dup_mmap use oldmm more · fd3e42fc
      Hugh Dickins authored
      Use the parent's oldmm throughout dup_mmap, instead of perversely going back
      to current->mm.  (Can you hear the sigh of relief from those mpnts?  Usually I
      squash them, but not today.)
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fd3e42fc
    • Hugh Dickins's avatar
      [PATCH] mm: batch updating mm_counters · ae859762
      Hugh Dickins authored
      tlb_finish_mmu used to batch zap_pte_range's update of mm rss, which may be
      worthwhile if the mm is contended, and would reduce atomic operations if the
      counts were atomic.  Let zap_pte_range now batch its updates to file_rss and
      anon_rss, per page-table in case we drop the lock outside; and copy_pte_range
      batch them too.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ae859762
    • Hugh Dickins's avatar
      [PATCH] mm: rss = file_rss + anon_rss · 4294621f
      Hugh Dickins authored
      I was lazy when we added anon_rss, and chose to change as few places as
      possible.  So currently each anonymous page has to be counted twice, in rss
      and in anon_rss.  Which won't be so good if those are atomic counts in some
      configurations.
      
      Change that around: keep file_rss and anon_rss separately, and add them
      together (with get_mm_rss macro) when the total is needed - reading two
      atomics is much cheaper than updating two atomics.  And update anon_rss
      upfront, typically in memory.c, not tucked away in page_add_anon_rmap.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4294621f
    • Hugh Dickins's avatar
      [PATCH] mm: mm_init set_mm_counters · 404351e6
      Hugh Dickins authored
      How is anon_rss initialized?  In dup_mmap, and by mm_alloc's memset; but
      that's not so good if an mm_counter_t is a special type.  And how is rss
      initialized?  By set_mm_counter, all over the place.  Come on, we just need to
      initialize them both at once by set_mm_counter in mm_init (which follows the
      memcpy when forking).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      404351e6
    • Hugh Dickins's avatar
      [PATCH] mm: tlb_finish_mmu forget rss · fc2acab3
      Hugh Dickins authored
      zap_pte_range has been counting the pages it frees in tlb->freed, then
      tlb_finish_mmu has used that to update the mm's rss.  That got stranger when I
      added anon_rss, yet updated it by a different route; and stranger when rss and
      anon_rss became mm_counters with special access macros.  And it would no
      longer be viable if we're relying on page_table_lock to stabilize the
      mm_counter, but calling tlb_finish_mmu outside that lock.
      
      Remove the mmu_gather's freed field, let tlb_finish_mmu stick to its own
      business, just decrement the rss mm_counter in zap_pte_range (yes, there was
      some point to batching the update, and a subsequent patch restores that).  And
      forget the anal paranoia of first reading the counter to avoid going negative
      - if rss does go negative, just fix that bug.
      
      Remove the mmu_gather's flushes and avoided_flushes from arm and arm26: no use
      was being made of them.  But arm26 alone was actually using the freed, in the
      way some others use need_flush: give it a need_flush.  arm26 seems to prefer
      spaces to tabs here: respect that.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fc2acab3
    • Hugh Dickins's avatar
      [PATCH] mm: tlb_is_full_mm was obscure · 4d6ddfa9
      Hugh Dickins authored
      tlb_is_full_mm?  What does that mean?  The TLB is full?  No, it means that the
      mm's last user has gone and the whole mm is being torn down.  And it's an
      inline function because sparc64 uses a different (slightly better)
      "tlb_frozen" name for the flag others call "fullmm".
      
      And now the ptep_get_and_clear_full macro used in zap_pte_range refers
      directly to tlb->fullmm, which would be wrong for sparc64.  Rather than
      correct that, I'd prefer to scrap tlb_is_full_mm altogether, and change
      sparc64 to just use the same poor name as everyone else - is that okay?
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4d6ddfa9
    • Hugh Dickins's avatar
      [PATCH] mm: tlb_gather_mmu get_cpu_var · 15a23ffa
      Hugh Dickins authored
      tlb_gather_mmu dates from before kernel preemption was allowed, and uses
      smp_processor_id or __get_cpu_var to find its per-cpu mmu_gather.  That works
      because it's currently only called after getting page_table_lock, which is not
      dropped until after the matching tlb_finish_mmu.  But don't rely on that, it
      will soon change: now disable preemption internally by proper get_cpu_var in
      tlb_gather_mmu, put_cpu_var in tlb_finish_mmu.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      15a23ffa
    • Hugh Dickins's avatar
      [PATCH] mm: move_page_tables by extents · 7be7a546
      Hugh Dickins authored
      Speeding up mremap's moving of ptes has never been a priority, but the locking
      will get more complicated shortly, and is already too baroque.
      
      Scrap the current one-by-one moving, do an extent at a time: curtailed by end
      of src and dst pmds (have to use PMD_SIZE: the way pmd_addr_end gets elided
      doesn't match this usage), and by latency considerations.
      
      One nice property of the old method is lost: it never allocated a page table
      unless absolutely necessary, so you could free empty page tables by mremapping
      to and fro.  Whereas this way, it allocates a dst table wherever there was a
      src table.  I keep diving in to reinstate the old behaviour, then come out
      preferring not to clutter how it now is.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7be7a546
    • Hugh Dickins's avatar
      [PATCH] mm: page fault handlers tidyup · 65500d23
      Hugh Dickins authored
      Impose a little more consistency on the page fault handlers do_wp_page,
      do_swap_page, do_anonymous_page, do_no_page, do_file_page: why not pass their
      arguments in the same order, called the same names?
      
      break_cow is all very well, but what it did was inlined elsewhere: easier to
      compare if it's brought back into do_wp_page.
      
      do_file_page's fallback to do_no_page dates from a time when we were testing
      pte_file by using it wherever possible: currently it's peculiar to nonlinear
      vmas, so just check that.  BUG_ON if not?  Better not, it's probably page
      table corruption, so just show the pte: hmm, there's a pte_ERROR macro, let's
      use that for do_wp_page's invalid pfn too.
      
      Hah!  Someone in the ppc64 world noticed pte_ERROR was unused so removed it:
      restored (and say "pud" not "pmd" in its pud_ERROR).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      65500d23
    • Hugh Dickins's avatar
      [PATCH] mm: exit_mmap need not reset · 7c1fd6b9
      Hugh Dickins authored
      exit_mmap resets various mm_struct fields, but the mm is well on its way out,
      and none of those fields matter by this point.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7c1fd6b9
    • Hugh Dickins's avatar
      [PATCH] mm: unlink_file_vma, remove_vma · a8fb5618
      Hugh Dickins authored
      Divide remove_vm_struct into two parts: first anon_vma_unlink plus
      unlink_file_vma, to unlink the vma from the list and tree by which rmap or
      vmtruncate might find it; then remove_vma to close, fput and free.
      
      The intention here is to do the anon_vma_unlink and unlink_file_vma earlier,
      in free_pgtables before freeing any page tables: so we can be sure that any
      page tables traversed by rmap and vmtruncate are stable (and other, ordinary
      cases are stabilized by holding mmap_sem).
      
      This will be crucial to traversing pgd,pud,pmd without page_table_lock.  But
      testing the split-out patch showed that lifting the page_table_lock is
      symbiotically necessary to make this change - the lock ordering is wrong to
      move those unlinks into free_pgtables while it's under ptlock.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a8fb5618
    • Hugh Dickins's avatar
      [PATCH] mm: remove_vma_list consolidation · 2c0b3814
      Hugh Dickins authored
      unmap_vma doesn't amount to much, let's put it inside unmap_vma_list.  Except
      it doesn't unmap anything, unmap_region just did the unmapping: rename it to
      remove_vma_list.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2c0b3814
    • Hugh Dickins's avatar
      [PATCH] mm: vm_stat_account unshackled · ab50b8ed
      Hugh Dickins authored
      The original vm_stat_account has fallen into disuse, with only one user, and
      only one user of vm_stat_unaccount.  It's easier to keep track if we convert
      them all to __vm_stat_account, then free it from its __shackles.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ab50b8ed
    • Hugh Dickins's avatar
      [PATCH] mm: anon is already wrprotected · 72866f6f
      Hugh Dickins authored
      do_anonymous_page's pte_wrprotect causes some confusion: in such a case,
      vm_page_prot must already be forcing COW, so must omit write permission, and
      so the pte_wrprotect is redundant.  Replace it by a comment to that effect,
      and reword the comment on unuse_pte which also caused confusion.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      72866f6f
    • Hugh Dickins's avatar
      [PATCH] mm: zap_pte_range dont dirty anon · 6237bcd9
      Hugh Dickins authored
      zap_pte_range already avoids wasting time to mark_page_accessed on anon pages:
      it can also skip anon set_page_dirty - the page only needs to be marked dirty
      if shared with another mm, but that will say pte_dirty too.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6237bcd9
    • Hugh Dickins's avatar
      [PATCH] mm: msync_pte_range progress · 0c942a45
      Hugh Dickins authored
      Use latency breaking in msync_pte_range like that in copy_pte_range, instead
      of the ugly CONFIG_PREEMPT filemap_msync alternatives.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0c942a45
    • Hugh Dickins's avatar
      [PATCH] mm: copy_pte_range progress fix · e040f218
      Hugh Dickins authored
      My latency breaking in copy_pte_range didn't work as intended: instead of
      checking at regularish intervals, after the first interval it checked every
      time around the loop, too impatient to be preempted.  Fix that.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e040f218
    • Christoph Lameter's avatar
      [PATCH] slab: add additional debugging to detect slabs from the wrong node · 09ad4bbc
      Christoph Lameter authored
      This patch adds some stack dumps if the slab logic is processing slab
      blocks from the wrong node.  This is necessary in order to detect
      situations as encountered by Petr.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      09ad4bbc
    • Lee Schermerhorn's avatar
      [PATCH] shrink_list(): skip anon pages if not may_swap · c340010e
      Lee Schermerhorn authored
      Martin Hicks' page cache reclaim patch added the 'may_swap' flag to the
      scan_control struct; and modified shrink_list() not to add anon pages to
      the swap cache if may_swap is not asserted.
      
      Ref:  http://marc.theaimsgroup.com/?l=linux-mm&m=111461480725322&w=4
      
      However, further down, if the page is mapped, shrink_list() calls
      try_to_unmap() which will call try_to_unmap_one() via try_to_unmap_anon ().
       try_to_unmap_one() will BUG_ON() an anon page that is NOT in the swap
      cache.  Martin says he never encountered this path in his testing, but
      agrees that it might happen.
      
      This patch modifies shrink_list() to skip anon pages that are not already
      in the swap cache when !may_swap, rather than just not adding them to the
      cache.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c340010e
    • OGAWA Hirofumi's avatar
      [PATCH] mm/msync.c cleanup · b57b98d1
      OGAWA Hirofumi authored
      This is not problem actually, but sync_page_range() is using for exported
      function to filesystems.
      
      The msync_xxx is more readable at least to me.
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b57b98d1
    • Andi Kleen's avatar
      [PATCH] Remove near all BUGs in mm/mempolicy.c · 662f3a0b
      Andi Kleen authored
      Most of them can never be triggered and were only for development.
      Signed-off-by: default avatar"Andi Kleen" <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      662f3a0b
    • Andi Kleen's avatar
      [PATCH] Convert mempolicies to nodemask_t · dfcd3c0d
      Andi Kleen authored
      The NUMA policy code predated nodemask_t so it used open coded bitmaps.
      Convert everything to nodemask_t.  Big patch, but shouldn't have any actual
      behaviour changes (except I removed one unnecessary check against
      node_online_map and one unnecessary BUG_ON)
      Signed-off-by: default avatar"Andi Kleen" <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dfcd3c0d
    • Seth, Rohit's avatar
      [PATCH] mm: set per-cpu-pages lower threshold to zero · e46a5e28
      Seth, Rohit authored
      Set the low water mark for hot pages in pcp to zero.
      
      (akpm: for the life of me I cannot remember why we created pcp->low.  Neither
      can Martin and the changelog is silent.  Maybe it was just a brainfart, but I
      have this feeling that there was a reason.  If not, we should remove the
      fields completely.  We'll see.)
      Signed-off-by: default avatarRohit Seth <rohit.seth@intel.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e46a5e28
    • Seth, Rohit's avatar
      [PATCH] mm: page_alloc: increase size of per-cpu-pages · ba56e91c
      Seth, Rohit authored
      Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB.
      
      Over 100+ runs for a workload, the difference in mean is about 2%.  The best
      results for both are almost same.  Though the max variation in results with
      1/2MB is only 2.2%, whereas with 1/4MB it is 12%.
      Signed-off-by: default avatarRohit Seth <rohit.seth@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ba56e91c
    • Rik Van Riel's avatar
      [PATCH] swaptoken tuning · fcdae29a
      Rik Van Riel authored
      It turns out that the original swap token implementation, by Song Jiang, only
      enforced the swap token while the task holding the token is handling a page
      fault.  This patch approximates that, without adding an additional flag to the
      mm_struct, by checking whether the mm->mmap_sem is held for reading, like the
      page fault code does.
      
      This patch has the effect of automatically, and gradually, disabling the
      enforcement of the swap token when there is little or no paging going on, and
      "turning up" the intensity of the swap token code the more the task holding
      the token is thrashing.
      
      Thanks to Song Jiang for pointing out this aspect of the token based thrashing
      control concept.
      
      The new code shows a slight degradation over the old swap token code, but
      still a big win over running without the swap token.
      
      2.6.12+ swap token disabled
      
      $ for i in `seq 10` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
      101.74user 23.13system 8:26.91elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (38597major+430315minor)pagefaults 0swaps
      101.98user 24.91system 8:03.06elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (33939major+430457minor)pagefaults 0swaps
      101.93user 22.12system 7:34.90elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (33166major+421267minor)pagefaults 0swaps
      101.82user 22.38system 8:31.40elapsed 24%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (39338major+433262minor)pagefaults 0swaps
      
      2.6.12+ swap token enabled, timeout 300 seconds
      
      $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
      102.58user 16.08system 3:41.44elapsed 53%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (19707major+285786minor)pagefaults 0swaps
      102.07user 19.56system 4:00.64elapsed 50%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (19012major+299259minor)pagefaults 0swaps
      102.64user 18.25system 4:07.31elapsed 48%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (21990major+304831minor)pagefaults 0swaps
      101.39user 19.41system 5:15.81elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (24850major+323321minor)pagefaults 0swaps
      
      2.6.12+ with new swap token code, timeout 300 seconds
      
      $ for i in `seq 4` ; do /usr/bin/time ./qsbench -n 30000000 -p 3 ; done
      101.87user 24.66system 5:53.20elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (26848major+363497minor)pagefaults 0swaps
      102.83user 19.95system 4:17.25elapsed 47%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (19946major+305722minor)pagefaults 0swaps
      102.09user 19.46system 5:12.57elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (25461major+334994minor)pagefaults 0swaps
      101.67user 20.61system 4:52.97elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
      0inputs+0outputs (22190major+329508minor)pagefaults 0swaps
      Signed-off-by: default avatarRik Van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fcdae29a