1. 22 Sep, 2009 2 commits
    • Hugh Dickins's avatar
      ksm: add some documentation · 7701c9c0
      Hugh Dickins authored
      
      Add Documentation/vm/ksm.txt: how to use the Kernel Samepage Merging feature
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7701c9c0
    • Hugh Dickins's avatar
      ksm: the mm interface to ksm · f8af4da3
      Hugh Dickins authored
      
      This patch presents the mm interface to a dummy version of ksm.c, for
      better scrutiny of that interface: the real ksm.c follows later.
      
      When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
      MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
      pretending that they can be serviced.  But when CONFIG_KSM=y, accept them
      even if KSM is not currently running, and even on areas which KSM will not
      touch (e.g.  hugetlb or shared file or special driver mappings).
      
      Like other madvices, report ENOMEM despite success if any area in the
      range is unmapped, and use EAGAIN to report out of memory.
      
      Define vma flag VM_MERGEABLE to identify an area on which KSM may try
      merging pages: leave it to ksm_madvise() to decide whether to set it.
      Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
      VM_MERGEABLE areas, to minimize callouts when forking or exiting.
      
      Based upon earlier patches by Chris Wright and Izik Eidus.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8af4da3
  2. 16 Sep, 2009 2 commits
    • Andi Kleen's avatar
      HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs · cae681fc
      Andi Kleen authored
      
      Useful for some testing scenarios, although specific testing is often
      done better through MADV_POISON
      
      This can be done with the x86 level MCE injector too, but this interface
      allows it to do independently from low level x86 changes.
      
      v2: Add module license (Haicheng Li)
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      cae681fc
    • Andi Kleen's avatar
      HWPOISON: The high level memory error handler in the VM v7 · 6a46079c
      Andi Kleen authored
      
      Add the high level memory handler that poisons pages
      that got corrupted by hardware (typically by a two bit flip in a DIMM
      or a cache) on the Linux level. The goal is to prevent everyone
      from accessing these pages in the future.
      
      This done at the VM level by marking a page hwpoisoned
      and doing the appropriate action based on the type of page
      it is.
      
      The code that does this is portable and lives in mm/memory-failure.c
      
      To quote the overview comment:
      
      High level machine check handler. Handles pages reported by the
      hardware as being corrupted usually due to a 2bit ECC memory or cache
      failure.
      
      This focuses on pages detected as corrupted in the background.
      When the current CPU tries to consume corruption the currently
      running process can just be killed directly instead. This implies
      that if the error cannot be handled for some reason it's safe to
      just ignore it because no corruption has been consumed yet. Instead
      when that happens another machine check will happen.
      
      Handles page cache pages in various states. The tricky part
      here is that we can access any page asynchronous to other VM
      users, because memory failures could happen anytime and anywhere,
      possibly violating some of their assumptions. This is why this code
      has to be extremely careful. Generally it tries to use normal locking
      rules, as in get the standard locks, even if that means the
      error handling takes potentially a long time.
      
      Some of the operations here are somewhat inefficient and have non
      linear algorithmic complexity, because the data structures have not
      been optimized for this case. This is in particular the case
      for the mapping from a vma to a process. Since this case is expected
      to be rare we hope we can get away with this.
      
      There are in principle two strategies to kill processes on poison:
      - just unmap the data and wait for an actual reference before
      killing
      - kill as soon as corruption is detected.
      Both have advantages and disadvantages and should be used
      in different situations. Right now both are implemented and can
      be switched with a new sysctl vm.memory_failure_early_kill
      The default is early kill.
      
      The patch does some rmap data structure walking on its own to collect
      processes to kill. This is unusual because normally all rmap data structure
      knowledge is in rmap.c only. I put it here for now to keep
      everything together and rmap knowledge has been seeping out anyways
      
      Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
      Nick Piggin (who did a lot of great work) and others.
      
      Cc: npiggin@suse.de
      Cc: riel@redhat.com
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      6a46079c
  3. 31 Aug, 2009 1 commit
    • H. Peter Anvin's avatar
      mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition set · a269cca9
      H. Peter Anvin authored
      
      CONFIG_PAGEFLAGS_EXTENDED disables a trick to conserve pageflags.
      This trick is indended to be enabled when the pressure on page flags
      is very high.
      
      The previous condition was:
      
      -       depends on 64BIT || SPARSEMEM_VMEMMAP || !NUMA || !SPARSEMEM
      
      ... however, the sparsemem code already has a way to crowd out the
      node number from the pageflags, which means that !NUMA actually
      doesn't contribute to hard pageflags exhaustion.
      
      This is required for the new PG_uncached flag to not cause pageflags
      exhaustion on x86_32 + PAE + SPARSEMEM + !NUMA.
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      LKML-Reference: <4A9828F4.4040905@zytor.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Suresh Siddha <suresh.siddha@intel.com>
      a269cca9
  4. 17 Aug, 2009 1 commit
    • Eric Paris's avatar
      Security/SELinux: seperate lsm specific mmap_min_addr · 788084ab
      Eric Paris authored
      
      Currently SELinux enforcement of controls on the ability to map low memory
      is determined by the mmap_min_addr tunable.  This patch causes SELinux to
      ignore the tunable and instead use a seperate Kconfig option specific to how
      much space the LSM should protect.
      
      The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
      permissions will always protect the amount of low memory designated by
      CONFIG_LSM_MMAP_MIN_ADDR.
      
      This allows users who need to disable the mmap_min_addr controls (usual reason
      being they run WINE as a non-root user) to do so and still have SELinux
      controls preventing confined domains (like a web server) from being able to
      map some area of low memory.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      788084ab
  5. 05 Aug, 2009 1 commit
    • Eric Paris's avatar
      Security/SELinux: seperate lsm specific mmap_min_addr · a2551df7
      Eric Paris authored
      
      Currently SELinux enforcement of controls on the ability to map low memory
      is determined by the mmap_min_addr tunable.  This patch causes SELinux to
      ignore the tunable and instead use a seperate Kconfig option specific to how
      much space the LSM should protect.
      
      The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
      permissions will always protect the amount of low memory designated by
      CONFIG_LSM_MMAP_MIN_ADDR.
      
      This allows users who need to disable the mmap_min_addr controls (usual reason
      being they run WINE as a non-root user) to do so and still have SELinux
      controls preventing confined domains (like a web server) from being able to
      map some area of low memory.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      a2551df7
  6. 17 Jun, 2009 1 commit
  7. 16 Jun, 2009 1 commit
  8. 04 Jun, 2009 1 commit
  9. 06 May, 2009 1 commit
    • David Howells's avatar
      nommu: make the initial mmap allocation excess behaviour Kconfig configurable · fc4d5c29
      David Howells authored
      
      NOMMU mmap() has an option controlled by a sysctl variable that determines
      whether the allocations made by do_mmap_private() should have the excess
      space trimmed off and returned to the allocator.  Make the initial setting
      of this variable a Kconfig configuration option.
      
      The reason there can be excess space is that the allocator only allocates
      in power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a
      power of 2.
      
      There are two alternatives:
      
       (1) Keep the excess as dead space.  The dead space then remains unused for the
           lifetime of the mapping.  Mappings of shared objects such as libc, ld.so
           or busybox's text segment may retain their dead space forever.
      
       (2) Return the excess to the allocator.  This means that the dead space is
           limited to less than a page per mapping, but it means that for a transient
           process, there's more chance of fragmentation as the excess space may be
           reused fairly quickly.
      
      During the boot process, a lot of transient processes are created, and
      this can cause a lot of fragmentation as the pagecache and various slabs
      grow greatly during this time.
      
      By turning off the trimming of excess space during boot and disabling
      batching of frees, Coldfire can manage to boot.
      
      A better way of doing things might be to have /sbin/init turn this option
      off.  By that point libc, ld.so and init - which are all long-duration
      processes - have all been loaded and trimmed.
      Reported-by: default avatarLanttor Guo <lanttor.guo@freescale.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarLanttor Guo <lanttor.guo@freescale.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc4d5c29
  10. 13 Apr, 2009 1 commit
  11. 01 Apr, 2009 2 commits
  12. 06 Jan, 2009 1 commit
  13. 20 Oct, 2008 1 commit
    • Lee Schermerhorn's avatar
      Unevictable LRU Infrastructure · 894bc310
      Lee Schermerhorn authored
      
      When the system contains lots of mlocked or otherwise unevictable pages,
      the pageout code (kswapd) can spend lots of time scanning over these
      pages.  Worse still, the presence of lots of unevictable pages can confuse
      kswapd into thinking that more aggressive pageout modes are required,
      resulting in all kinds of bad behaviour.
      
      Infrastructure to manage pages excluded from reclaim--i.e., hidden from
      vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
      maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
      them from vmscan.
      
      Kosaki Motohiro added the support for the memory controller unevictable
      lru list.
      
      Pages on the unevictable list have both PG_unevictable and PG_lru set.
      Thus, PG_unevictable is analogous to and mutually exclusive with
      PG_active--it specifies which LRU list the page is on.
      
      The unevictable infrastructure is enabled by a new mm Kconfig option
      [CONFIG_]UNEVICTABLE_LRU.
      
      A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
      not a page may be evictable.  Subsequent patches will add the various
      !evictable tests.  We'll want to keep these tests light-weight for use in
      shrink_active_list() and, possibly, the fault path.
      
      To avoid races between tasks putting pages [back] onto an LRU list and
      tasks that might be moving the page from non-evictable to evictable state,
      the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
      -- tests the "evictability" of a page after placing it on the LRU, before
      dropping the reference.  If the page has become unevictable,
      putback_lru_page() will redo the 'putback', thus moving the page to the
      unevictable list.  This way, we avoid "stranding" evictable pages on the
      unevictable list.
      
      [akpm@linux-foundation.org: fix fallout from out-of-order merge]
      [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
      [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
      [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
      [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
      [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
      [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
      [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: default avatarBenjamin Kidwell <benjkidwell@yahoo.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      894bc310
  14. 16 Oct, 2008 1 commit
  15. 14 Sep, 2008 1 commit
  16. 12 Aug, 2008 1 commit
  17. 28 Jul, 2008 1 commit
    • Andrea Arcangeli's avatar
      mmu-notifiers: core · cddb8a5c
      Andrea Arcangeli authored
      
      With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
       There are secondary MMUs (with secondary sptes and secondary tlbs) too.
      sptes in the kvm case are shadow pagetables, but when I say spte in
      mmu-notifier context, I mean "secondary pte".  In GRU case there's no
      actual secondary pte and there's only a secondary tlb because the GRU
      secondary MMU has no knowledge about sptes and every secondary tlb miss
      event in the MMU always generates a page fault that has to be resolved by
      the CPU (this is not the case of KVM where the a secondary tlb miss will
      walk sptes in hardware and it will refill the secondary tlb transparently
      to software if the corresponding spte is present).  The same way
      zap_page_range has to invalidate the pte before freeing the page, the spte
      (and secondary tlb) must also be invalidated before any page is freed and
      reused.
      
      Currently we take a page_count pin on every page mapped by sptes, but that
      means the pages can't be swapped whenever they're mapped by any spte
      because they're part of the guest working set.  Furthermore a spte unmap
      event can immediately lead to a page to be freed when the pin is released
      (so requiring the same complex and relatively slow tlb_gather smp safe
      logic we have in zap_page_range and that can be avoided completely if the
      spte unmap event doesn't require an unpin of the page previously mapped in
      the secondary MMU).
      
      The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
      when the VM is swapping or freeing or doing anything on the primary MMU so
      that the secondary MMU code can drop sptes before the pages are freed,
      avoiding all page pinning and allowing 100% reliable swapping of guest
      physical address space.  Furthermore it avoids the code that teardown the
      mappings of the secondary MMU, to implement a logic like tlb_gather in
      zap_page_range that would require many IPI to flush other cpu tlbs, for
      each fixed number of spte unmapped.
      
      To make an example: if what happens on the primary MMU is a protection
      downgrade (from writeable to wrprotect) the secondary MMU mappings will be
      invalidated, and the next secondary-mmu-page-fault will call
      get_user_pages and trigger a do_wp_page through get_user_pages if it
      called get_user_pages with write=1, and it'll re-establishing an updated
      spte or secondary-tlb-mapping on the copied page.  Or it will setup a
      readonly spte or readonly tlb mapping if it's a guest-read, if it calls
      get_user_pages with write=0.  This is just an example.
      
      This allows to map any page pointed by any pte (and in turn visible in the
      primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
      full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
      with kvm), or a remote DMA in software like XPMEM (hence needing of
      schedule in XPMEM code to send the invalidate to the remote node, while no
      need to schedule in kvm/gru as it's an immediate event like invalidating
      primary-mmu pte).
      
      At least for KVM without this patch it's impossible to swap guests
      reliably.  And having this feature and removing the page pin allows
      several other optimizations that simplify life considerably.
      
      Dependencies:
      
      1) mm_take_all_locks() to register the mmu notifier when the whole VM
         isn't doing anything with "mm".  This allows mmu notifier users to keep
         track if the VM is in the middle of the invalidate_range_begin/end
         critical section with an atomic counter incraese in range_begin and
         decreased in range_end.  No secondary MMU page fault is allowed to map
         any spte or secondary tlb reference, while the VM is in the middle of
         range_begin/end as any page returned by get_user_pages in that critical
         section could later immediately be freed without any further
         ->invalidate_page notification (invalidate_range_begin/end works on
         ranges and ->invalidate_page isn't called immediately before freeing
         the page).  To stop all page freeing and pagetable overwrites the
         mmap_sem must be taken in write mode and all other anon_vma/i_mmap
         locks must be taken too.
      
      2) It'd be a waste to add branches in the VM if nobody could possibly
         run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
         CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
         mmu notifiers, but this already allows to compile a KVM external module
         against a kernel with mmu notifiers enabled and from the next pull from
         kvm.git we'll start using them.  And GRU/XPMEM will also be able to
         continue the development by enabling KVM=m in their config, until they
         submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
         also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
         This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
         are all =n.
      
      The mmu_notifier_register call can fail because mm_take_all_locks may be
      interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
      is used when a driver startup, a failure can be gracefully handled.  Here
      an example of the change applied to kvm to register the mmu notifiers.
      Usually when a driver startups other allocations are required anyway and
      -ENOMEM failure paths exists already.
      
       struct  kvm *kvm_arch_create_vm(void)
       {
              struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
      +       int err;
      
              if (!kvm)
                      return ERR_PTR(-ENOMEM);
      
              INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
      
      +       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
      +       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
      +       if (err) {
      +               kfree(kvm);
      +               return ERR_PTR(err);
      +       }
      +
              return kvm;
       }
      
      mmu_notifier_unregister returns void and it's reliable.
      
      The patch also adds a few needed but missing includes that would prevent
      kernel to compile after these changes on non-x86 archs (x86 didn't need
      them by luck).
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
      [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
      Signed-off-by: default avatarAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cddb8a5c
  18. 26 Jul, 2008 1 commit
    • Nick Piggin's avatar
      x86: lockless get_user_pages_fast() · 8174c430
      Nick Piggin authored
      
      Implement get_user_pages_fast without locking in the fastpath on x86.
      
      Do an optimistic lockless pagetable walk, without taking mmap_sem or any
      page table locks or even mmap_sem.  Page table existence is guaranteed by
      turning interrupts off (combined with the fact that we're always looking
      up the current mm, means we can do the lockless page table walk within the
      constraints of the TLB shootdown design).  Basically we can do this
      lockless pagetable walk in a similar manner to the way the CPU's pagetable
      walker does not have to take any locks to find present ptes.
      
      This patch (combined with the subsequent ones to convert direct IO to use
      it) was found to give about 10% performance improvement on a 2 socket 8
      core Intel Xeon system running an OLTP workload on DB2 v9.5
      
       "To test the effects of the patch, an OLTP workload was run on an IBM
        x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
        2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel.  Comparing
        runs with and without the patch resulted in an overall performance
        benefit of ~9.8%.  Correspondingly, oprofiles showed that samples from
        __up_read and __down_read routines that is seen during thread contention
        for system resources was reduced from 2.8% down to .05%.  Monitoring the
        /proc/vmstat output from the patched run showed that the counter for
        fast_gup contained a very high number while the fast_gup_slow value was
        zero."
      
      (fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
      counter we had for the number of times the slowpath was invoked).
      
      The main reason for the improvement is that DB2 has multiple threads each
      issuing direct-IO.  Direct-IO uses get_user_pages, and thus the threads
      contend the mmap_sem cacheline, and can also contend on page table locks.
      
      I would anticipate larger performance gains on larger systems, however I
      think DB2 uses an adaptive mix of threads and processes, so it could be
      that thread contention remains pretty constant as machine size increases.
      In which case, we stuck with "only" a 10% gain.
      
      The downside of using get_user_pages_fast is that if there is not a pte
      with the correct permissions for the access, we end up falling back to
      get_user_pages and so the get_user_pages_fast is a bit of extra work.
      However this should not be the common case in most performance critical
      code.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: Kconfig fix]
      [akpm@linux-foundation.org: Makefile fix/cleanup]
      [akpm@linux-foundation.org: warning fix]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Reviewed-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8174c430
  19. 24 Jul, 2008 1 commit
  20. 14 Jul, 2008 1 commit
  21. 02 Jul, 2008 1 commit
  22. 28 Apr, 2008 1 commit
    • Christoph Lameter's avatar
      PAGEFLAGS_EXTENDED and separate page flags for Head and Tail · e20b8cca
      Christoph Lameter authored
      
      Having separate page flags for the head and the tail of a compound page allows
      the compiler to use bitops instead of operations on a word to check for a tail
      page.  That is f.e.  important for virt_to_head_page() which is used in
      various critical code paths (kfree for example):
      
      Code for PageTail(page)
      
      Before:
      
       mov    (%rdi),%rdx		page->flags
       mov    %rdx,%rax		3 bytes
       and    $0x12000,%eax		5 bytes
       cmp    $0x12000,%rax		6 bytes
       je     897 <kfree+0xa7>
      
      After:
      
       mov    (%rdi),%rax
       test   $0x40,%ah			(3 bytes)
       jne    887 <kfree+0x97>
      
      So we go from 14 bytes to 3 bytes and from 3 instructions to one.  From the
      use of 2 registers we go to none.
      
      We can only use page flags for this if we have page flags available.  This
      patch introduces CONFIG_PAGEFLAGS_EXTENDED that is set if pageflags are not
      scarce due to SPARSEMEM using page flags for its sectionid on 32 bit NUMA
      platforms.
      
      Additional page flag definitions can be added to the CONFIG_PAGEFLAGS_EXTENDED
      section in page-flags.h if the functionality depends on PAGEFLAGS_EXTENDED or
      if more page flag overlapping tricks are used for the !PAGEFLAGS_EXTENDED
      fallback (the upcoming virtual compound patch may hook in here and Rik's/Lee's
      additional page flags to solve the reclaim issues could also be added there
      [hint...  hint...  where are these patchsets?]).
      
      Avoiding the overlaying of Pg_reclaim also clears the way for possible use of
      compound pages for the pagecache or on the LRU.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e20b8cca
  23. 28 Jan, 2008 1 commit
  24. 18 Dec, 2007 1 commit
  25. 20 Oct, 2007 1 commit
  26. 16 Oct, 2007 3 commits
    • Jeremy Fitzhardinge's avatar
      xen: lock pte pages while pinning/unpinning · 74260714
      Jeremy Fitzhardinge authored
      
      When a pagetable is created, it is made globally visible in the rmap
      prio tree before it is pinned via arch_dup_mmap(), and remains in the
      rmap tree while it is unpinned with arch_exit_mmap().
      
      This means that other CPUs may race with the pinning/unpinning
      process, and see a pte between when it gets marked RO and actually
      pinned, causing any pte updates to fail with write-protect faults.
      
      As a result, all pte pages must be properly locked, and only unlocked
      once the pinning/unpinning process has finished.
      
      In order to avoid taking spinlocks for the whole pagetable - which may
      overflow the PREEMPT_BITS portion of preempt counter - it locks and pins
      each pte page individually, and then finally pins the whole pagetable.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickens <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Keir Fraser <keir@xensource.com>
      Cc: Jan Beulich <jbeulich@novell.com>
      74260714
    • KAMEZAWA Hiroyuki's avatar
      memory unplug: page offline · 0c0e6195
      KAMEZAWA Hiroyuki authored
      
      Logic.
       - set all pages in  [start,end)  as isolated migration-type.
         by this, all free pages in the range will be not-for-use.
       - Migrate all LRU pages in the range.
       - Test all pages in the range's refcnt is zero or not.
      
      Todo:
       - allocate migration destination page from better area.
       - confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
       (I don't like this kind of page but..
       - Find out pages which cannot be migrated.
       - more running tests.
       - Use reclaim for unplugging other memory type area.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c0e6195
    • Andy Whitcroft's avatar
      vmemmap: generify initialisation via helpers · 29c71111
      Andy Whitcroft authored
      
      Convert the common vmemmap population into initialisation helpers for use by
      architecture vmemmap populators.  All architecture implementing the
      SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
      initialiser, which may make use of the helpers.
      
      This allows us to clean up and remove the initialisation Kconfig entries.
      With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
      indicate use of that variant.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29c71111
  27. 06 Oct, 2007 1 commit
    • Jeremy Fitzhardinge's avatar
      xen: disable split pte locks for now · 67dd5a25
      Jeremy Fitzhardinge authored
      
      When pinning and unpinning pagetables, we must protect them against
      being used by other CPUs, lest they see the pagetable in an
      intermediate read-only-but-not-pinned state.
      
      When using split pte locks, doing this properly would require taking
      all the pte locks for the pagetable while pinning, but this may overflow
      the PREEMPT_BITS part of the preempt counter if the process has mapped
      more than about 512M of memory.
      
      However, failing to take the pte locks causes write-protect faults when
      the pageout code is trying to clear the Access bit on a pte which is part
      of a freshy created and still being pinned process after fork.
      
      This is a short-term fix until the problem is solved properly.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Keir Fraser <keir@xensource.com>
      Cc: Jan Beulich <jbeulich@novell.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67dd5a25
  28. 29 Jul, 2007 1 commit
  29. 17 Jul, 2007 1 commit
  30. 16 Jul, 2007 1 commit
  31. 08 Jun, 2007 1 commit
  32. 14 May, 2007 1 commit
  33. 09 May, 2007 1 commit
  34. 07 May, 2007 1 commit
    • Christoph Lameter's avatar
      Quicklists for page table pages · 6225e937
      Christoph Lameter authored
      On x86_64 this cuts allocation overhead for page table pages down to a
      fraction (kernel compile / editing load.  TSC based measurement of times spend
      in each function):
      
      no quicklist
      
      pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
      pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
      pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
      pgd_alloc                260022 1s(920ns/4us/263.1us)
      
      quicklist:
      
      pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
      pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
      pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
      pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)
      
      pgd allocations are the most complex and there we see the most dramatic
      improvement (may be we can cut down the amount of pgds cached somewhat?).  But
      even the pte allocations still see a doubling of performance.
      
      1. Proven code from the IA64 arch.
      
      	The method used here has been fine tuned for years and
      	is NUMA aware. It is base...
      6225e937
  35. 11 Feb, 2007 1 commit
    • Christoph Lameter's avatar
      [PATCH] Set CONFIG_ZONE_DMA for arches with GENERIC_ISA_DMA · 5ac6da66
      Christoph Lameter authored
      
      As Andi pointed out: CONFIG_GENERIC_ISA_DMA only disables the ISA DMA
      channel management.  Other functionality may still expect GFP_DMA to
      provide memory below 16M.  So we need to make sure that CONFIG_ZONE_DMA is
      set independent of CONFIG_GENERIC_ISA_DMA.  Undo the modifications to
      mm/Kconfig where we made ZONE_DMA dependent on GENERIC_ISA_DMA and set
      theses explicitly in each arches Kconfig.
      
      Reviews must occur for each arch in order to determine if ZONE_DMA can be
      switched off.  It can only be switched off if we know that all devices
      supported by a platform are capable of performing DMA transfers to all of
      memory (Some arches already support this: uml, avr32, sh sh64, parisc and
      IA64/Altix).
      
      In order to switch ZONE_DMA off conditionally, one would have to establish
      a scheme by which one can assure that no drivers are enabled that are only
      capable of doing I/O to a part of memory, or one needs to provide an
      alternate means of performing an allocation from a specific range of memory
      (like provided by alloc_pages_range()) and insure that all drivers use that
      call.  In that case the arches alloc_dma_coherent() may need to be modified
      to call alloc_pages_range() instead of relying on GFP_DMA.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ac6da66