1. 11 Dec, 2012 30 commits
    • Mel Gorman's avatar
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman authored
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      03c5a6e1
    • Peter Zijlstra's avatar
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra authored
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      4b96a29b
    • Mel Gorman's avatar
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · 9f40604c
      Mel Gorman authored
      By accounting against the present PTEs, scanning speed reflects the
      actual present (mapped) memory.
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      9f40604c
    • Peter Zijlstra's avatar
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra authored
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      6e5fb223
    • Peter Zijlstra's avatar
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra authored
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      cbee9f88
    • Mel Gorman's avatar
      mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now · a720094d
      Mel Gorman authored
      The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
      explicitly request lazy migration is a good idea but the actual
      API has not been well reviewed and once released we have to support it.
      For now this patch prevents an application using the services. This
      will need to be revisited.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      a720094d
    • Mel Gorman's avatar
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · 4b10e7d5
      Mel Gorman authored
      This patch converts change_prot_numa() to use change_protection(). As
      pte_numa and friends check the PTE bits directly it is necessary for
      change_protection() to use pmd_mknuma(). Hence the required
      modifications to change_protection() are a little clumsy but the
      end result is that most of the numa page table helpers are just one or
      two instructions.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      4b10e7d5
    • Lee Schermerhorn's avatar
      mm: mempolicy: Add MPOL_MF_LAZY · b24f53a0
      Lee Schermerhorn authored
      NOTE: Once again there is a lot of patch stealing and the end result
      	is sufficiently different that I had to drop the signed-offs.
      	Will re-add if the original authors are ok with that.
      
      This patch adds another mbind() flag to request "lazy migration".  The
      flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
      pages are marked PROT_NONE. The pages will be migrated in the fault
      path on "first touch", if the policy dictates at that time.
      
      "Lazy Migration" will allow testing of migrate-on-fault via mbind().
      Also allows applications to specify that only subsequently touched
      pages be migrated to obey new policy, instead of all pages in range.
      This can be useful for multi-threaded applications working on a
      large shared data area that is initialized by an initial thread
      resulting in all pages on one [or a few, if overflowed] nodes.
      After PROT_NONE, the pages in regions assigned to the worker threads
      will be automatically migrated local to the threads on 1st touch.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      b24f53a0
    • Mel Gorman's avatar
      mm: mempolicy: Use _PAGE_NUMA to migrate pages · 4daae3b4
      Mel Gorman authored
      Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
      	sufficiently different that the signed-off-bys were dropped
      
      Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
      pieces into an effective migrate on fault scheme.
      
      Note that (on x86) we rely on PROT_NONE pages being !present and avoid
      the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
      page-migration performance.
      Based-on-work-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      4daae3b4
    • Mel Gorman's avatar
      mm: migrate: Drop the misplaced pages reference count if the target node is full · 149c33e1
      Mel Gorman authored
      If we have to avoid migrating to a node that is nearly full, put page
      and return zero.
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      149c33e1
    • Peter Zijlstra's avatar
      mm: migrate: Introduce migrate_misplaced_page() · 7039e1db
      Peter Zijlstra authored
      Note: This was originally based on Peter's patch "mm/migrate: Introduce
      	migrate_misplaced_page()" but borrows extremely heavily from Andrea's
      	"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
      	collection". The end result is barely recognisable so signed-offs
      	had to be dropped. If original authors are ok with it, I'll
      	re-add the signed-off-bys.
      
      Add migrate_misplaced_page() which deals with migrating pages from
      faults.
      Based-on-work-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Based-on-work-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Based-on-work-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      7039e1db
    • Lee Schermerhorn's avatar
      mm: mempolicy: Check for misplaced page · 771fb4d8
      Lee Schermerhorn authored
      This patch provides a new function to test whether a page resides
      on a node that is appropriate for the mempolicy for the vma and
      address where the page is supposed to be mapped.  This involves
      looking up the node where the page belongs.  So, the function
      returns that node so that it may be used to allocated the page
      without consulting the policy again.
      
      A subsequent patch will call this function from the fault path.
      Because of this, I don't want to go ahead and allocate the page, e.g.,
      via alloc_page_vma() only to have to free it if it has the correct
      policy.  So, I just mimic the alloc_page_vma() node computation
      logic--sort of.
      
      Note:  we could use this function to implement a MPOL_MF_STRICT
      behavior when migrating pages to match mbind() mempolicy--e.g.,
      to ensure that pages in an interleaved range are reinterleaved
      rather than left where they are when they reside on any page in
      the interleave nodemask.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      [ Added MPOL_F_LAZY to trigger migrate-on-fault;
        simplified code now that we don't have to bother
        with special crap for interleaved ]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      771fb4d8
    • Lee Schermerhorn's avatar
      mm: mempolicy: Add MPOL_NOOP · d3a71033
      Lee Schermerhorn authored
      This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
      to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
      flags, mbind() will map the pages PROT_NONE so that they will be
      migrated on the next touch.
      
      This allows an application to prepare for a new phase of operation
      where different regions of shared storage will be assigned to
      worker threads, w/o changing policy.  Note that we could just use
      "default" policy in this case.  However, this also allows an
      application to request that pages be migrated, only if necessary,
      to follow any arbitrary policy that might currently apply to a
      range of pages, without knowing the policy, or without specifying
      multiple mbind()s for ranges with different policies.
      
      [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]
      Bug-Reported-by: default avatarReported-by: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      d3a71033
    • Peter Zijlstra's avatar
      mm: mempolicy: Make MPOL_LOCAL a real policy · 479e2802
      Peter Zijlstra authored
      Make MPOL_LOCAL a real and exposed policy such that applications that
      relied on the previous default behaviour can explicitly request it.
      Requested-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      479e2802
    • Mel Gorman's avatar
      mm: numa: Create basic numa page hinting infrastructure · d10e63f2
      Mel Gorman authored
      Note: This patch started as "mm/mpol: Create special PROT_NONE
      	infrastructure" and preserves the basic idea but steals *very*
      	heavily from "autonuma: numa hinting page faults entry points" for
      	the actual fault handlers without the migration parts.	The end
      	result is barely recognisable as either patch so all Signed-off
      	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
      	this version, I will re-add the signed-offs-by to reflect the history.
      
      In order to facilitate a lazy -- fault driven -- migration of pages, create
      a special transient PAGE_NUMA variant, we can then use the 'spurious'
      protection faults to drive our migrations from.
      
      The meaning of PAGE_NUMA depends on the architecture but on x86 it is
      effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
      NUMA faults for the reason that the page fault code checks the permission on
      the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
      before it ever calls handle_mm_fault.
      
      [dhillf@gmail.com: Fix typo]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      d10e63f2
    • Andrea Arcangeli's avatar
      mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte · 1ba6e0b5
      Andrea Arcangeli authored
      When we split a transparent hugepage, transfer the NUMA type from the
      pmd to the pte if needed.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      1ba6e0b5
    • Andrea Arcangeli's avatar
      mm: numa: Support NUMA hinting page faults from gup/gup_fast · 0b9d7052
      Andrea Arcangeli authored
      Introduce FOLL_NUMA to tell follow_page to check
      pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
      so because it always invokes handle_mm_fault and retries the
      follow_page later.
      
      KVM secondary MMU page faults will trigger the NUMA hinting page
      faults through gup_fast -> get_user_pages -> follow_page ->
      handle_mm_fault.
      
      Other follow_page callers like KSM should not use FOLL_NUMA, or they
      would fail to get the pages if they use follow_page instead of
      get_user_pages.
      
      [ This patch was picked up from the AutoNUMA tree. ]
      Originally-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ ported to this tree. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      0b9d7052
    • Andrea Arcangeli's avatar
      mm: numa: pte_numa() and pmd_numa() · be3a7284
      Andrea Arcangeli authored
      Implement pte_numa and pmd_numa.
      
      We must atomically set the numa bit and clear the present bit to
      define a pte_numa or pmd_numa.
      
      Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
      a thread touches a virtual address in the corresponding virtual range,
      a NUMA hinting page fault will trigger. The NUMA hinting page fault
      will clear the NUMA bit and set the present bit again to resolve the
      page fault.
      
      The expectation is that a NUMA hinting page fault is used as part
      of a placement policy that decides if a page should remain on the
      current node or migrated to a different node.
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      be3a7284
    • Andrea Arcangeli's avatar
      mm: numa: define _PAGE_NUMA · dbe4d203
      Andrea Arcangeli authored
      The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
      faults to identify the per NUMA node working set of the thread at
      runtime.
      
      Arming the NUMA hinting page fault mechanism works similarly to
      setting up a mprotect(PROT_NONE) virtual range: the present bit is
      cleared at the same time that _PAGE_NUMA is set, so when the fault
      triggers we can identify it as a NUMA hinting page fault.
      
      _PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
      could also use a different bitflag, it's up to the architecture to
      decide).
      
      It would be confusing to call the "NUMA hinting page faults" as
      "do_prot_none faults". They're different events and _PAGE_NUMA doesn't
      alter the semantics of mprotect(PROT_NONE) in any way.
      
      Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
      things: it requires us to ensure the code paths executed by
      _PAGE_PROTNONE remains mutually exclusive to the code paths executed
      by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
      step into each other toes.
      
      Because we want to be able to set this bitflag in any established pte
      or pmd (while clearing the present bit at the same time) without
      losing information, this bitflag must never be set when the pte and
      pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
      be used by the swap entry format.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      dbe4d203
    • Mel Gorman's avatar
      mm: compaction: Add scanned and isolated counters for compaction · 397487db
      Mel Gorman authored
      Compaction already has tracepoints to count scanned and isolated pages
      but it requires that ftrace be enabled and if that information has to be
      written to disk then it can be disruptive. This patch adds vmstat counters
      for compaction called compact_migrate_scanned, compact_free_scanned and
      compact_isolated.
      
      With these counters, it is possible to define a basic cost model for
      compaction. This approximates of how much work compaction is doing and can
      be compared that with an oprofile showing TLB misses and see if the cost of
      compaction is being offset by THP for example. Minimally a compaction patch
      can be evaluated in terms of whether it increases or decreases cost. The
      basic cost model looks like this
      
      Fundamental unit u:	a word	sizeof(void *)
      
      Ca  = cost of struct page access = sizeof(struct page) / u
      
      Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
      Cmf = Cost migrate failure   = Ca * 2
      Ci  = Cost page isolation    = (Ca + Wi)
      	where Wi is a constant that should reflect the approximate
      	cost of the locking operation.
      
      Csm = Cost migrate scanning = Ca
      Csf = Cost free    scanning = Ca
      
      Overall cost =	(Csm * compact_migrate_scanned) +
      	      	(Csf * compact_free_scanned)    +
      	      	(Ci  * compact_isolated)	+
      		(Cmc * pgmigrate_success)	+
      		(Cmf * pgmigrate_failed)
      
      Where the values are read from /proc/vmstat.
      
      This is very basic and ignores certain costs such as the allocation cost
      to do a migrate page copy but any improvement to the model would still
      use the same vmstat counters.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      397487db
    • Mel Gorman's avatar
      mm: migrate: Add a tracepoint for migrate_pages · 7b2a2d4a
      Mel Gorman authored
      The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
      about migration activity but not the type or the reason. This patch adds
      a tracepoint to identify the type of page migration and why the page is
      being migrated.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      7b2a2d4a
    • Mel Gorman's avatar
      mm: compaction: Move migration fail/success stats to migrate.c · 5647bc29
      Mel Gorman authored
      The compact_pages_moved and compact_pagemigrate_failed events are
      convenient for determining if compaction is active and to what
      degree migration is succeeding but it's at the wrong level. Other
      users of migration may also want to know if migration is working
      properly and this will be particularly true for any automated
      NUMA migration. This patch moves the counters down to migration
      with the new events called pgmigrate_success and pgmigrate_fail.
      The compact_blocks_moved counter is removed because while it was
      useful for debugging initially, it's worthless now as no meaningful
      conclusions can be drawn from its value.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      5647bc29
    • Ingo Molnar's avatar
      mm: Optimize the TLB flush of sys_mprotect() and change_protection() users · 1233d588
      Ingo Molnar authored
      Reuse the NUMA code's 'modified page protections' count that
      change_protection() computes and skip the TLB flush if there's
      no changes to a range that sys_mprotect() modifies.
      
      Given that mprotect() already optimizes the same-flags case
      I expected this optimization to dominantly trigger on
      CONFIG_NUMA_BALANCING=y kernels - but even with that feature
      disabled it triggers rather often.
      
      There's two reasons for that:
      
      1)
      
      While sys_mprotect() already optimizes the same-flag case:
      
              if (newflags == oldflags) {
                      *pprev = vma;
                      return 0;
              }
      
      and this test works in many cases, but it is too sharp in some
      others, where it differentiates between protection values that the
      underlying PTE format makes no distinction about, such as
      PROT_EXEC == PROT_READ on x86.
      
      2)
      
      Even where the pte format over vma flag changes necessiates a
      modification of the pagetables, there might be no pagetables
      yet to modify: they might not be instantiated yet.
      
      During a regular desktop bootup this optimization hits a couple
      of hundred times. During a Java test I measured thousands of
      hits.
      
      So this optimization improves sys_mprotect() in general, not just
      CONFIG_NUMA_BALANCING=y kernels.
      
      [ We could further increase the efficiency of this optimization if
        change_pte_range() and change_huge_pmd() was a bit smarter about
        recognizing exact-same-value protection masks - when the hardware
        can do that safely. This would probably further speed up mprotect(). ]
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1233d588
    • Peter Zijlstra's avatar
      mm: Count the number of pages affected in change_protection() · 7da4d641
      Peter Zijlstra authored
      This will be used for three kinds of purposes:
      
       - to optimize mprotect()
      
       - to speed up working set scanning for working set areas that
         have not been touched
      
       - to more accurately scan per real working set
      
      No change in functionality from this patch.
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7da4d641
    • Mel Gorman's avatar
      mm: Check if PTE is already allocated during page fault · 4fd01770
      Mel Gorman authored
      With transparent hugepage support, handle_mm_fault() has to be careful
      that a normal PMD has been established before handling a PTE fault. To
      achieve this, it used __pte_alloc() directly instead of pte_alloc_map
      as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map()
      is called once it is known the PMD is safe.
      
      pte_alloc_map() is smart enough to check if a PTE is already present
      before calling __pte_alloc but this check was lost. As a consequence,
      PTEs may be allocated unnecessarily and the page table lock taken.
      Thi useless PTE does get cleaned up but it's a performance hit which
      is visible in page_test from aim9.
      
      This patch simply re-adds the check normally done by pte_alloc_map to
      check if the PTE needs to be allocated before taking the page table
      lock. The effect is noticable in page_test from aim9.
      
       AIM9
                       2.6.38-vanilla 2.6.38-checkptenone
       creat-clo      446.10 ( 0.00%)   424.47 (-5.10%)
       page_test       38.10 ( 0.00%)    42.04 ( 9.37%)
       brk_test        52.45 ( 0.00%)    51.57 (-1.71%)
       exec_test      382.00 ( 0.00%)   456.90 (16.39%)
       fork_test       60.11 ( 0.00%)    67.79 (11.34%)
       MMTests Statistics: duration
       Total Elapsed Time (seconds)                611.90    612.22
      
      (While this affects 2.6.38, it is a performance rather than a
      functional bug and normally outside the rules -stable. While the big
      performance differences are to a microbench, the difference in fork
      and exec performance may be significant enough that -stable wants to
      consider the patch)
      Reported-by: default avatarRaz Ben Yehuda <raziebe@gmail.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      [ Picked this up from the AutoNUMA tree to help
        it upstream and to allow apples-to-apples
        performance comparisons. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4fd01770
    • Rik van Riel's avatar
      mm: Only flush the TLB when clearing an accessible pte · 8d1acce4
      Rik van Riel authored
      If ptep_clear_flush() is called to clear a page table entry that is
      accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
      there is no need to flush the TLB on remote CPUs.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8d1acce4
    • Rik van Riel's avatar
      x86/mm: Introduce pte_accessible() · 2c3cf556
      Rik van Riel authored
      We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
      the pte is associated with a page.
      
      However, for TLB flushing purposes, we would like to know whether the pte
      points to an actually accessible page.  This allows us to skip remote TLB
      flushes for pages that are not actually accessible.
      
      Fill in this method for x86 and provide a safe (but slower) method
      on other architectures.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Fixed-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org
      [ Added Linus's review fixes. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2c3cf556
    • Rik van Riel's avatar
      mm,generic: only flush the local TLB in ptep_set_access_flags · cef23d9d
      Rik van Riel authored
      The function ptep_set_access_flags is only ever used to upgrade
      access permissions to a page. That means the only negative side
      effect of not flushing remote TLBs is that other CPUs may incur
      spurious page faults, if they happen to access the same address,
      and still have a PTE with the old permissions cached in their
      TLB.
      
      Having another CPU maybe incur a spurious page fault is faster
      than always incurring the cost of a remote TLB flush, so replace
      the remote TLB flush with a purely local one.
      
      This should be safe on every architecture that correctly
      implements flush_tlb_fix_spurious_fault() to actually invalidate
      the local TLB entry that caused a page fault, as well as on
      architectures where the hardware invalidates TLB entries that
      cause page faults.
      
      In the unlikely event that you are hitting what appears to be
      an infinite loop of page faults, and 'git bisect' took you to
      this changeset, your architecture needs to implement
      flush_tlb_fix_spurious_fault to actually flush the TLB entry.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      cef23d9d
    • Rik van Riel's avatar
      x86: mm: drop TLB flush from ptep_set_access_flags · e4a1cc56
      Rik van Riel authored
      Intel has an architectural guarantee that the TLB entry causing
      a page fault gets invalidated automatically. This means
      we should be able to drop the local TLB invalidation.
      
      Because of the way other areas of the page fault code work,
      chances are good that all x86 CPUs do this.  However, if
      someone somewhere has an x86 CPU that does not invalidate
      the TLB entry causing a page fault, this one-liner should
      be easy to revert.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      e4a1cc56
    • Rik van Riel's avatar
      x86: mm: only do a local tlb flush in ptep_set_access_flags() · 0f9a921c
      Rik van Riel authored
      The function ptep_set_access_flags() is only ever invoked to set access
      flags or add write permission on a PTE.  The write bit is only ever set
      together with the dirty bit.
      
      Because we only ever upgrade a PTE, it is safe to skip flushing entries on
      remote TLBs. The worst that can happen is a spurious page fault on other
      CPUs, which would flush that TLB entry.
      
      Lazily letting another CPU incur a spurious page fault occasionally is
      (much!) cheaper than aggressively flushing everybody else's TLB.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      0f9a921c
  2. 17 Nov, 2012 2 commits
  3. 16 Nov, 2012 8 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (Fixes from Andrew) · 0cad3ff4
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (12 patches)
        revert "mm: fix-up zone present pages"
        tmpfs: change final i_blocks BUG to WARNING
        tmpfs: fix shmem_getpage_gfp() VM_BUG_ON
        mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address
        mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"
        rapidio: fix kernel-doc warnings
        swapfile: fix name leak in swapoff
        memcg: fix hotplugged memory zone oops
        mips, arc: fix build failure
        memcg: oom: fix totalpages calculation for memory.swappiness==0
        mm: fix build warning for uninitialized value
        mm: add anon_vma_lock to validate_mm()
      0cad3ff4
    • Andrew Morton's avatar
      revert "mm: fix-up zone present pages" · 5576646f
      Andrew Morton authored
      Revert commit 7f1290f2 ("mm: fix-up zone present pages")
      
      That patch tried to fix a issue when calculating zone->present_pages,
      but it caused a regression on 32bit systems with HIGHMEM.  With that
      change, reset_zone_present_pages() resets all zone->present_pages to
      zero, and fixup_zone_present_pages() is called to recalculate
      zone->present_pages when the boot allocator frees core memory pages into
      buddy allocator.  Because highmem pages are not freed by bootmem
      allocator, all highmem zones' present_pages becomes zero.
      
      Various options for improving the situation are being discussed but for
      now, let's return to the 3.6 code.
      
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarChris Clayton <chris2553@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5576646f
    • Hugh Dickins's avatar
      tmpfs: change final i_blocks BUG to WARNING · 0f3c42f5
      Hugh Dickins authored
      Under a particular load on one machine, I have hit shmem_evict_inode()'s
      BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
      race between swapout and eviction.
      
      It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
      and the lack of coherent locking between mapping's nrpages and shmem's
      swapped count.  There's a window in shmem_writepage(), between lowering
      nrpages in shmem_delete_from_page_cache() and then raising swapped
      count, when the freed count appears to be +1 when it should be 0, and
      then the asymmetry stops it from being corrected with -1 before hitting
      the BUG.
      
      One answer is coherent locking: using tree_lock throughout, without
      info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
      used_blocks makes that messier than expected.  Another answer may be a
      further effort to eliminate the weird shmem_recalc_inode() altogether,
      but previous attempts at that failed.
      
      So far undecided, but for now change the BUG_ON to WARN_ON: in usual
      circumstances it remains a useful consistency check.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f3c42f5
    • Hugh Dickins's avatar
      tmpfs: fix shmem_getpage_gfp() VM_BUG_ON · 215c02bc
      Hugh Dickins authored
      Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora
      has converted to WARNING) in shmem_getpage_gfp():
      
        WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
        Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
        Call Trace:
          warn_slowpath_common+0x7f/0xc0
          warn_slowpath_null+0x1a/0x20
          shmem_getpage_gfp+0xa5c/0xa70
          shmem_fault+0x4f/0xa0
          __do_fault+0x71/0x5c0
          handle_pte_fault+0x97/0xae0
          handle_mm_fault+0x289/0x350
          __do_page_fault+0x18e/0x530
          do_page_fault+0x2b/0x50
          page_fault+0x28/0x30
          tracesys+0xe1/0xe6
      
      Thanks to Johannes for pointing to truncation: free_swap_and_cache()
      only does a trylock on the page, so the page lock we've held since
      before confirming swap is not enough to protect against truncation.
      
      What cleanup is needed in this case? Just delete_from_swap_cache(),
      which takes care of the memcg uncharge.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      215c02bc
    • Will Deacon's avatar
      mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address · 498c2280
      Will Deacon authored
      kmap_to_page returns the corresponding struct page for a virtual address
      of an arbitrary mapping.  This works by checking whether the address
      falls in the pkmap region and using the pkmap page tables instead of the
      linear mapping if appropriate.
      
      Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is
      incorrectly treated as a highmem address and we can end up walking off
      the end of pkmap_page_table and subsequently passing junk to pte_page.
      
      This patch fixes the bound check to stay within the pkmap tables.
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      498c2280
    • Mel Gorman's avatar
      mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" · 96710098
      Mel Gorman authored
      Jiri Slaby reported the following:
      
      	(It's an effective revert of "mm: vmscan: scale number of pages
      	reclaimed by reclaim/compaction based on failures".) Given kswapd
      	had hours of runtime in ps/top output yesterday in the morning
      	and after the revert it's now 2 minutes in sum for the last 24h,
      	I would say, it's gone.
      
      The intention of the patch in question was to compensate for the loss of
      lumpy reclaim.  Part of the reason lumpy reclaim worked is because it
      aggressively reclaimed pages and this patch was meant to be a sane
      compromise.
      
      When compaction fails, it gets deferred and both compaction and
      reclaim/compaction is deferred avoid excessive reclaim.  However, since
      commit c6543459 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
      each time and continues reclaiming which was not taken into account when
      the patch was developed.
      
      Attempts to address the problem ended up just changing the shape of the
      problem instead of fixing it.  The release window gets closer and while
      a THP allocation failing is not a major problem, kswapd chewing up a lot
      of CPU is.
      
      This patch reverts commit 83fde0f2 ("mm: vmscan: scale number of
      pages reclaimed by reclaim/compaction based on failures") and will be
      revisited in the future.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Tested-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96710098
    • Randy Dunlap's avatar
      rapidio: fix kernel-doc warnings · 2ca3cb50
      Randy Dunlap authored
      Fix rapidio kernel-doc warnings:
      
        Warning(drivers/rapidio/rio.c:415): No description found for parameter 'local'
        Warning(drivers/rapidio/rio.c:415): Excess function parameter 'lstart' description in 'rio_map_inb_region'
        Warning(include/linux/rio.h:290): No description found for parameter 'switches'
        Warning(include/linux/rio.h:290): No description found for parameter 'destid_table'
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Acked-by: default avatarAlexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ca3cb50
    • Xiaotian Feng's avatar
      swapfile: fix name leak in swapoff · f58b59c1
      Xiaotian Feng authored
      There's a name leak introduced by commit 91a27b2a ("vfs: define
      struct filename and have getname() return it").  Add the missing
      putname.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: default avatarXiaotian Feng <dannyfeng@tencent.com>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f58b59c1