1. 25 May, 2010 40 commits
    • Haicheng Li's avatar
      mem-hotplug: fix potential race while building zonelist for new populated zone · 4eaf3f64
      Haicheng Li authored
      Add global mutex zonelists_mutex to fix the possible race:
      
           CPU0                                  CPU1                    CPU2
      (1) zone->present_pages += online_pages;
      (2)                                       build_all_zonelists();
      (3)                                                               alloc_page();
      (4)                                                               free_page();
      (5) build_all_zonelists();
      (6)   __build_all_zonelists();
      (7)     zone->pageset = alloc_percpu();
      
      In step (3,4), zone->pageset still points to boot_pageset, so bad
      things may happen if 2+ nodes are in this state. Even if only 1 node
      is accessing the boot_pageset, (3) may still consume too much memory
      to fail the memory allocations in step (7).
      
      Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
      since there is a new fresh memory block added in step(6).
      
      [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eaf3f64
    • Haicheng Li's avatar
      mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset · 1f522509
      Haicheng Li authored
      For each new populated zone of hotadded node, need to update its pagesets
      with dynamically allocated per_cpu_pageset struct for all possible CPUs:
      
          1) Detach zone->pageset from the shared boot_pageset
             at end of __build_all_zonelists().
      
          2) Use mutex to protect zone->pageset when it's still
             shared in onlined_pages()
      
      Otherwises, multiple zones of different nodes would share same boot strapping
      boot_pageset for same CPU, which will finally cause below kernel panic:
      
        ------------[ cut here ]------------
        kernel BUG at mm/page_alloc.c:1239!
        invalid opcode: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
         [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
         [<ffffffff81128407>] __page_cache_alloc+0x67/0x70
         [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
         [<ffffffff81132751>] ra_submit+0x21/0x30
         [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
         [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
         [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
         [<ffffffff81266cfa>] nfs_file_read+0xca/0x130
         [<ffffffff8117b20a>] do_sync_read+0xfa/0x140
         [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
         [<ffffffff8117c151>] sys_read+0x51/0x80
         [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
        RIP  [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
         RSP <ffff88000d1e78a8>
        ---[ end trace 4bda28328b9990db ]
      
      [akpm@linux-foundation.org: merge fix]
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f522509
    • Wu Fengguang's avatar
      mem-hotplug: separate setup_per_cpu_pageset() into separate functions · 319774e2
      Wu Fengguang authored
      No behavior change here.
      
      Move some of setup_per_cpu_pageset() code into a new function
      setup_zone_pageset() that will be useful for memory hotplug.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarHaicheng Li <haicheng.li@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      319774e2
    • Marcelo Roberto Jimenez's avatar
      mm: fix NR_SECTION_ROOTS == 0 when using using sparsemem extreme. · 0faa5638
      Marcelo Roberto Jimenez authored
      Got this while compiling for ARM/SA1100:
      
      mm/sparse.c: In function '__section_nr':
      mm/sparse.c:135: warning: 'root' is used uninitialized in this function
      
      This patch follows Russell King's suggestion for a new calculation for
      NR_SECTION_ROOTS.  Thanks also to Sergei Shtylyov for pointing out the
      existence of the macro DIV_ROUND_UP.
      
      Atsushi Nemoto observed:
      : This fix doesn't just silence the warning - it fixes a real problem.
      :
      : Without this fix, mem_section[] might have 0 size so mem_section[0]
      : will share other variable area.  For example, I got:
      :
      : c030c700 b __warned.16478
      : c030c700 B mem_section
      : c030c701 b __warned.16483
      :
      : This might cause very strange behavior.  Your patch actually fixes it.
      Signed-off-by: default avatarMarcelo Roberto Jimenez <mroberto@cpti.cetuc.puc-rio.br>
      Cc: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Sergei Shtylyov <sshtylyov@mvista.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0faa5638
    • Akinobu Mita's avatar
      highmem: remove unneeded #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT for debug_kmap_atomic() · ff3d58c2
      Akinobu Mita authored
      In f4112de6 ("mm: introduce
      debug_kmap_atomic") I said that debug_kmap_atomic() needs
      CONFIG_TRACE_IRQFLAGS_SUPPORT.
      
      It was wrong.  (I thought irqs_disabled() is only available when the
      architecture has CONFIG_TRACE_IRQFLAGS_SUPPORT)
      
      Remove the #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT check to enable
      kmap_atomic() debugging for the architectures which do not have
      CONFIG_TRACE_IRQFLAGS_SUPPORT.
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff3d58c2
    • matt mooney's avatar
      include/linux/gfp.h: fix coding style · fd23855e
      matt mooney authored
      Add parenthesis in a define.  This doesn't change functionality.
      
      checkpatch errors:
      1) white space fixes
      2) add spaces after comas
      Signed-off-by: default avatarmatt mooney <mfm@muteddisk.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fd23855e
    • matt mooney's avatar
      include/linux/gfp.h: spelling fixes · 263ff5d8
      matt mooney authored
      Fix minor spelling errors in a few comments; no code changes.
      Signed-off-by: default avatarmatt mooney <mfm@muteddisk.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      263ff5d8
    • minskey guo's avatar
      cpu/mem hotplug: enable CPUs online before local memory online · cf23422b
      minskey guo authored
      Enable users to online CPUs even if the CPUs belongs to a numa node which
      doesn't have onlined local memory.
      
      The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either
      in system boot/init period, or at the time of local memory online.  For a
      numa node without onlined local memory, its zonelists are not initialized
      at present.  As a result, any memory allocation operations executed by
      CPUs within this node will fail.  In fact, an out-of-memory error is
      triggered when attempt to online CPUs before memory comes to online.
      
      This patch tries to create zonelists for such numa nodes, so that the
      memory allocation for this node can be fallback'ed to other nodes.
      
      [akpm@linux-foundation.org: remove unneeded export]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: minskey guo<chaohong.guo@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf23422b
    • Johannes Weiner's avatar
      vmscan: remove isolate_pages callback scan control · 8b25c6d2
      Johannes Weiner authored
      For now, we have global isolation vs.  memory control group isolation, do
      not allow the reclaim entry function to set an arbitrary page isolation
      callback, we do not need that flexibility.
      
      And since we already pass around the group descriptor for the memory
      control group isolation case, just use it to decide which one of the two
      isolator functions to use.
      
      The decisions can be merged into nearby branches, so no extra cost there.
      In fact, we save the indirect calls.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b25c6d2
    • Johannes Weiner's avatar
      vmscan: remove all_unreclaimable scan control · 0aeb2339
      Johannes Weiner authored
      This scan control is abused to communicate a return value from
      shrink_zones().  Write this idiomatically and remove the knob.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0aeb2339
    • Johannes Weiner's avatar
      mm: document follow_page() · 142762bd
      Johannes Weiner authored
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      142762bd
    • Richard Kennedy's avatar
      fs-writeback: check sync bit earlier in inode_wait_for_writeback · 58a9d3d8
      Richard Kennedy authored
      When wb_writeback() hasn't written anything it will re-acquire the inode
      lock before calling inode_wait_for_writeback.
      
      This change tests the sync bit first so that is doesn't need to drop &
      re-acquire the lock if the inode became available while wb_writeback() was
      waiting to get the lock.
      Signed-off-by: default avatarRichard Kennedy <richard@rsk.demon.co.uk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58a9d3d8
    • KOSAKI Motohiro's avatar
      mm: introduce free_pages_prepare() · ec95f53a
      KOSAKI Motohiro authored
      free_hot_cold_page() and __free_pages_ok() have very similar freeing
      preparation.  Consolidate them.
      
      [akpm@linux-foundation.org: fix busted coding style]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec95f53a
    • KOSAKI Motohiro's avatar
      vmscan: page_check_references(): check low order lumpy reclaim properly · 5f53e762
      KOSAKI Motohiro authored
      If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
      for making contenious free pages.  but current page_check_references()
      doesn't.
      
      Fix it.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f53e762
    • Huang Shijie's avatar
      readahead.c: fix comment · bf8abe8b
      Huang Shijie authored
      Fix a wrong comment over page_cache_async_readahead().
      Signed-off-by: default avatarHuang Shijie <shijie8@gmail.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf8abe8b
    • Shaohua Li's avatar
      vmscan: prevent get_scan_ratio() rounding errors · 76a33fc3
      Shaohua Li authored
      get_scan_ratio() calculates percentage and if the percentage is < 1%, it
      will round percentage down to 0% and cause we completely ignore scanning
      anon/file pages to reclaim memory even the total anon/file pages are very
      big.
      
      To avoid underflow, we don't use percentage, instead we directly calculate
      how many pages should be scaned.  In this way, we should get several
      scanned pages for < 1% percent.
      
      This has some benefits:
      
      1. increase our calculation precision
      
      2.  making our scan more smoothly.  Without this, if percent[x] is
         underflow, shrink_zone() doesn't scan any pages and suddenly it scans
         all pages when priority is zero.  With this, even priority isn't zero,
         shrink_zone() gets chance to scan some pages.
      
      Note, this patch doesn't really change logics, but just increase
      precision.  For system with a lot of memory, this might slightly changes
      behavior.  For example, in a sequential file read workload, without the
      patch, we don't swap any anon pages.  With it, if anon memory size is
      bigger than 16G, we will see one anon page swapped.  The 16G is calculated
      as PAGE_SIZE * priority(4096) * (fp/ap).  fp/ap is assumed to be 1024
      which is common in this workload.  So the impact sounds not a big deal.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76a33fc3
    • Greg Thelen's avatar
      mm: consider the entire user address space during node migration · 6ec3a127
      Greg Thelen authored
      Use mm->task_size instead of TASK_SIZE to ensure that the entire user
      address space is migrated.  mm->task_size is independent of the calling
      task context.  TASK SIZE may be dependant on the address space size of the
      calling process.  Usage of TASK_SIZE can lead to partial address space
      migration if the calling process was 32 bit and the migrating process was
      64 bit.
      
      Here is the test script used on 64 system with a 32 bit echo process:
      
        mount -t cgroup none /cgroup -o cpuset
        cd /cgroup
      
        mkdir 0
        echo 1 > 0/cpuset.cpus
        echo 0 > 0/cpuset.mems
        echo 1 > 0/cpuset.memory_migrate
      
        mkdir 1
        echo 1 > 1/cpuset.cpus
        echo 1 > 1/cpuset.mems
        echo 1 > 1/cpuset.memory_migrate
      
        echo $$ > 0/tasks
        64_bit_process &
        pid=$!
      
        echo $pid > 1/tasks   # This does not migrate all process pages without
                              # this patch.  If 64 bit echo is used or this patch is
                              # applied, then the full address space of $pid is
                              # migrated.
      
      To check memory migration, I watched:
        grep MemUsed /sys/devices/system/node/node*/meminfo
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ec3a127
    • Mel Gorman's avatar
      mm: compaction: defer compaction using an exponential backoff when compaction fails · 4f92e258
      Mel Gorman authored
      The fragmentation index may indicate that a failure is due to external
      fragmentation but after a compaction run completes, it is still possible
      for an allocation to fail.  There are two obvious reasons as to why
      
        o Page migration cannot move all pages so fragmentation remains
        o A suitable page may exist but watermarks are not met
      
      In the event of compaction followed by an allocation failure, this patch
      defers further compaction in the zone (1 << compact_defer_shift) times.
      If the next compaction attempt also fails, compact_defer_shift is
      increased up to a maximum of 6.  If compaction succeeds, the defer
      counters are reset again.
      
      The zone that is deferred is the first zone in the zonelist - i.e.  the
      preferred zone.  To defer compaction in the other zones, the information
      would need to be stored in the zonelist or implemented similar to the
      zonelist_cache.  This would impact the fast-paths and is not justified at
      this time.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f92e258
    • Mel Gorman's avatar
      mm: compaction: add a tunable that decides when memory should be compacted and... · 5e771905
      Mel Gorman authored
      mm: compaction: add a tunable that decides when memory should be compacted and when it should be reclaimed
      
      The kernel applies some heuristics when deciding if memory should be
      compacted or reclaimed to satisfy a high-order allocation.  One of these
      is based on the fragmentation.  If the index is below 500, memory will not
      be compacted.  This choice is arbitrary and not based on data.  To help
      optimise the system and set a sensible default for this value, this patch
      adds a sysctl extfrag_threshold.  The kernel will only compact memory if
      the fragmentation index is above the extfrag_threshold.
      
      [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5e771905
    • Mel Gorman's avatar
      mm: compaction: direct compact when a high-order allocation fails · 56de7263
      Mel Gorman authored
      Ordinarily when a high-order allocation fails, direct reclaim is entered
      to free pages to satisfy the allocation.  With this patch, it is
      determined if an allocation failed due to external fragmentation instead
      of low memory and if so, the calling process will compact until a suitable
      page is freed.  Compaction by moving pages in memory is considerably
      cheaper than paging out to disk and works where there are locked pages or
      no swap.  If compaction fails to free a page of a suitable size, then
      reclaim will still occur.
      
      Direct compaction returns as soon as possible.  As each block is
      compacted, it is checked if a suitable page has been freed and if so, it
      returns.
      
      [akpm@linux-foundation.org: Fix build errors]
      [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56de7263
    • Mel Gorman's avatar
      mm: compaction: add /sys trigger for per-node memory compaction · ed4a6d7f
      Mel Gorman authored
      Add a per-node sysfs file called compact.  When the file is written to,
      each zone in that node is compacted.  The intention that this would be
      used by something like a job scheduler in a batch system before a job
      starts so that the job can allocate the maximum number of hugepages
      without significant start-up cost.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed4a6d7f
    • Mel Gorman's avatar
      mm: compaction: add /proc trigger for memory compaction · 76ab0f53
      Mel Gorman authored
      Add a proc file /proc/sys/vm/compact_memory.  When an arbitrary value is
      written to the file, all zones are compacted.  The expected user of such a
      trigger is a job scheduler that prepares the system before the target
      application runs.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76ab0f53
    • Mel Gorman's avatar
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman authored
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
    • Mel Gorman's avatar
      mm: move definition for LRU isolation modes to a header · c175a0ce
      Mel Gorman authored
      Currently, vmscan.c defines the isolation modes for __isolate_lru_page().
      Memory compaction needs access to these modes for isolating pages for
      migration.  This patch exports them.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c175a0ce
    • Mel Gorman's avatar
      mm: export fragmentation index via debugfs · f1a5ab12
      Mel Gorman authored
      The fragmentation fragmentation index, is only meaningful if an allocation
      would fail and indicates what the failure is due to.  A value of -1 such
      as in many of the examples above states that the allocation would succeed.
       If it would fail, the value is between 0 and 1.  A value tending towards
      0 implies the allocation failed due to a lack of memory.  A value tending
      towards 1 implies it failed due to external fragmentation.
      
      For the most part, the huge page size will be the size of interest but not
      necessarily so it is exported on a per-order and per-zo basis via
      /sys/kernel/debug/extfrag/extfrag_index
      
      > cat /sys/kernel/debug/extfrag/extfrag_index
      Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
      Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1a5ab12
    • Mel Gorman's avatar
      mm: export unusable free space index via debugfs · d7a5752c
      Mel Gorman authored
      The unusable free space index measures how much of the available free
      memory cannot be used to satisfy an allocation of a given size and is a
      value between 0 and 1.  The higher the value, the more of free memory is
      unusable and by implication, the worse the external fragmentation is.  For
      the most part, the huge page size will be the size of interest but not
      necessarily so it is exported on a per-order and per-zone basis via
      /sys/kernel/debug/extfrag/unusable_index.
      
      > cat /sys/kernel/debug/extfrag/unusable_index
      Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
      Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
      
      [akpm@linux-foundation.org: Fix allnoconfig]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7a5752c
    • Mel Gorman's avatar
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during... · a8bef8ff
      Mel Gorman authored
      mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks
      
      Page migration requires rmap to be able to find all ptes mapping a page
      at all times, otherwise the migration entry can be instantiated, but it
      is possible to leave one behind if the second rmap_walk fails to find
      the page.  If this page is later faulted, migration_entry_to_page() will
      call BUG because the page is locked indicating the page was migrated by
      the migration PTE not cleaned up. For example
      
        kernel BUG at include/linux/swapops.h:105!
        invalid opcode: 0000 [#1] PREEMPT SMP
        ...
        Call Trace:
         [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
         [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
         [<ffffffff813099b5>] page_fault+0x25/0x30
         [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
         [<ffffffff8111329b>] search_binary_handler+0x173/0x313
         [<ffffffff81114896>] do_execve+0x219/0x30a
         [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
         [<ffffffff8100320a>] stub_execve+0x6a/0xc0
        RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
      
      There is a race between shift_arg_pages and migration that triggers this
      bug.  A temporary stack is setup during exec and later moved.  If
      migration moves a page in the temporary stack and the VMA is then removed
      before migration completes, the migration PTE may not be found leading to
      a BUG when the stack is faulted.
      
      This patch causes pages within the temporary stack during exec to be
      skipped by migration.  It does this by marking the VMA covering the
      temporary stack with an otherwise impossible combination of VMA flags.
      These flags are cleared when the temporary stack is moved to its final
      location.
      
      [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8bef8ff
    • Mel Gorman's avatar
      mm: allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove · e9e96b39
      Mel Gorman authored
      CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
      being able to hot-remove memory.  The main users of page migration such as
      sys_move_pages(), sys_migrate_pages() and cpuset process migration are
      only beneficial on NUMA so it makes sense.
      
      As memory compaction will operate within a zone and is useful on both NUMA
      and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
      user selects CONFIG_COMPACTION as an option.
      
      [akpm@linux-foundation.org: Depend on CONFIG_HUGETLB_PAGE]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9e96b39
    • Mel Gorman's avatar
      mm: migration: allow the migration of PageSwapCache pages · 3fe2011f
      Mel Gorman authored
      PageAnon pages that are unmapped may or may not have an anon_vma so are
      not currently migrated.  However, a swap cache page can be migrated and
      fits this description.  This patch identifies page swap caches and allows
      them to be migrated but ensures that no attempt to made to remap the pages
      would would potentially try to access an already freed anon_vma.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fe2011f
    • Mel Gorman's avatar
      mm: migration: do not try to migrate unmapped anonymous pages · 67b9509b
      Mel Gorman authored
      rmap_walk_anon() was triggering errors in memory compaction that look like
      use-after-free errors.  The problem is that between the page being
      isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
      page dropped to 0 and the anon_vma gets freed.  This can happen during
      memory compaction if pages being migrated belong to a process that exits
      before migration completes.  Hence, the use-after-free race looks like
      
       1. Page isolated for migration
       2. Process exits
       3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
       4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
       4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
          is garbage.
      
      This patch checks the mapcount after the rcu lock is taken.  If the
      mapcount is zero, the anon_vma is assumed to be freed and no further
      action is taken.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67b9509b
    • Mel Gorman's avatar
      mm: migration: share the anon_vma ref counts between KSM and page migration · 7f60c214
      Mel Gorman authored
      For clarity of review, KSM and page migration have separate refcounts on
      the anon_vma.  While clear, this is a waste of memory.  This patch gets
      KSM and page migration to share their toys in a spirit of harmony.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f60c214
    • Mel Gorman's avatar
      mm: migration: take a reference to the anon_vma before migrating · 3f6c8272
      Mel Gorman authored
      This patchset is a memory compaction mechanism that reduces external
      fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
      pageblocks.  The term "compaction" was chosen as there are is a number of
      mechanisms that are not mutually exclusive that can be used to defragment
      memory.  For example, lumpy reclaim is a form of defragmentation as was
      slub "defragmentation" (really a form of targeted reclaim).  Hence, this
      is called "compaction" to distinguish it from other forms of
      defragmentation.
      
      In this implementation, a full compaction run involves two scanners
      operating within a zone - a migration and a free scanner.  The migration
      scanner starts at the beginning of a zone and finds all movable pages
      within one pageblock_nr_pages-sized area and isolates them on a
      migratepages list.  The free scanner begins at the end of the zone and
      searches on a per-area basis for enough free pages to migrate all the
      pages on the migratepages list.  As each area is respectively migrated or
      exhausted of free pages, the scanners are advanced one area.  A compaction
      run completes within a zone when the two scanners meet.
      
      This method is a bit primitive but is easy to understand and greater
      sophistication would require maintenance of counters on a per-pageblock
      basis.  This would have a big impact on allocator fast-paths to improve
      compaction which is a poor trade-off.
      
      It also does not try relocate virtually contiguous pages to be physically
      contiguous.  However, assuming transparent hugepages were in use, a
      hypothetical khugepaged might reuse compaction code to isolate free pages,
      split them and relocate userspace pages for promotion.
      
      Memory compaction can be triggered in one of three ways.  It may be
      triggered explicitly by writing any value to /proc/sys/vm/compact_memory
      and compacting all of memory.  It can be triggered on a per-node basis by
      writing any value to /sys/devices/system/node/nodeN/compact where N is the
      node ID to be compacted.  When a process fails to allocate a high-order
      page, it may compact memory in an attempt to satisfy the allocation
      instead of entering direct reclaim.  Explicit compaction does not finish
      until the two scanners meet and direct compaction ends if a suitable page
      becomes available that would meet watermarks.
      
      The series is in 14 patches.  The first three are not "core" to the series
      but are important pre-requisites.
      
      Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
      	patch, it's possible to use anon_vma after free if the caller is
      	not holding a VMA or mmap_sem for the pages in question. While
      	there should be no existing user that causes this problem,
      	it's a requirement for memory compaction to be stable. The patch
      	is at the start of the series for bisection reasons.
      Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
      	but would be slightly harder to review.
      Patch 3 skips over unmapped anon pages during migration as there are no
      	guarantees about the anon_vma existing. There is a window between
      	when a page was isolated and migration started during which anon_vma
      	could disappear.
      Patch 4 notes that PageSwapCache pages can still be migrated even if they
      	are unmapped.
      Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
      Patch 6 exports a "unusable free space index" via debugfs. It's
      	a measure of external fragmentation that takes the size of the
      	allocation request into account. It can also be calculated from
      	userspace so can be dropped if requested
      Patch 7 exports a "fragmentation index" which only has meaning when an
      	allocation request fails. It determines if an allocation failure
      	would be due to a lack of memory or external fragmentation.
      Patch 8 moves the definition for LRU isolation modes for use by compaction
      Patch 9 is the compaction mechanism although it's unreachable at this point
      Patch 10 adds a means of compacting all of memory with a proc trgger
      Patch 11 adds a means of compacting a specific node with a sysfs trigger
      Patch 12 adds "direct compaction" before "direct reclaim" if it is
      	determined there is a good chance of success.
      Patch 13 adds a sysctl that allows tuning of the threshold at which the
      	kernel will compact or direct reclaim
      Patch 14 temporarily disables compaction if an allocation failure occurs
      	after compaction.
      
      Testing of compaction was in three stages.  For the test, debugging,
      preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
      popped out.  min_free_kbytes was tuned as recommended by hugeadm to help
      fragmentation avoidance and high-order allocations.  It was tested on X86,
      X86-64 and PPC64.
      
      Ths first test represents one of the easiest cases that can be faced for
      lumpy reclaim or memory compaction.
      
      1. Machine freshly booted and configured for hugepage usage with
      	a) hugeadm --create-global-mounts
      	b) hugeadm --pool-pages-max DEFAULT:8G
      	c) hugeadm --set-recommended-min_free_kbytes
      	d) hugeadm --set-recommended-shmmax
      
      	The min_free_kbytes here is important. Anti-fragmentation works best
      	when pageblocks don't mix. hugeadm knows how to calculate a value that
      	will significantly reduce the worst of external-fragmentation-related
      	events as reported by the mm_page_alloc_extfrag tracepoint.
      
      2. Load up memory
      	a) Start updatedb
      	b) Create in parallel a X files of pagesize*128 in size. Wait
      	   until files are created. By parallel, I mean that 4096 instances
      	   of dd were launched, one after the other using &. The crude
      	   objective being to mix filesystem metadata allocations with
      	   the buffer cache.
      	c) Delete every second file so that pageblocks are likely to
      	   have holes
      	d) kill updatedb if it's still running
      
      	At this point, the system is quiet, memory is full but it's full with
      	clean filesystem metadata and clean buffer cache that is unmapped.
      	This is readily migrated or discarded so you'd expect lumpy reclaim
      	to have no significant advantage over compaction but this is at
      	the POC stage.
      
      3. In increments, attempt to allocate 5% of memory as hugepages.
      	   Measure how long it took, how successful it was, how many
      	   direct reclaims took place and how how many compactions. Note
      	   the compaction figures might not fully add up as compactions
      	   can take place for orders other than the hugepage size
      
      X86				vanilla		compaction
      Final page count                    913                916 (attempted 1002)
      pages reclaimed                   68296               9791
      
      X86-64				vanilla		compaction
      Final page count:                   901                902 (attempted 1002)
      Total pages reclaimed:           112599              53234
      
      PPC64				vanilla		compaction
      Final page count:                    93                 94 (attempted 110)
      Total pages reclaimed:           103216              61838
      
      There was not a dramatic improvement in success rates but it wouldn't be
      expected in this case either.  What was important is that fewer pages were
      reclaimed in all cases reducing the amount of IO required to satisfy a
      huge page allocation.
      
      The second tests were all performance related - kernbench, netperf, iozone
      and sysbench.  None showed anything too remarkable.
      
      The last test was a high-order allocation stress test.  Many kernel
      compiles are started to fill memory with a pressured mix of unmovable and
      movable allocations.  During this, an attempt is made to allocate 90% of
      memory as huge pages - one at a time with small delays between attempts to
      avoid flooding the IO queue.
      
                                                   vanilla   compaction
      Percentage of request allocated X86               98           99
      Percentage of request allocated X86-64            95           98
      Percentage of request allocated PPC64             55           70
      
      This patch:
      
      rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
      locking an anon_vma and it does not appear to have sufficient locking to
      ensure the anon_vma does not disappear from under it.
      
      This patch copies an approach used by KSM to take a reference on the
      anon_vma while pages are being migrated.  This should prevent rmap_walk()
      running into nasty surprises later because anon_vma has been freed.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f6c8272
    • David Rientjes's avatar
      mm: default to node zonelist ordering when nodes have only lowmem · e325c90f
      David Rientjes authored
      There are two types of zonelist ordering methodologies:
      
       - node order, preferring allocations on a node to stay local to and
      
       - zone order, preferring allocations come from a higher zone to avoid
         allocating in lowmem zones even though they may not be local.
      
      The ordering technique used by the kernel is configurable on the command
      line, but also has some logic to determine what the default should be.
      
      This logic currently lacks knowledge of systems where a node may only have
      lowmem.  For such systems, it is necessary to use node order so that
      GFP_KERNEL allocations may be satisfied by nodes consisting of only
      lowmem.
      
      If zone order is used, GFP_KERNEL allocations to such nodes are actually
      allocated on a node with local affinity that includes ZONE_NORMAL.
      
      This change defaults to node zonelist ordering if any node lacks
      ZONE_NORMAL.
      
      To force zone order, append 'numa_zonelist_order=zone' to the kernel
      command line.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e325c90f
    • Naoya Horiguchi's avatar
      pagemap: add #ifdefs CONFIG_HUGETLB_PAGE on code walking hugetlb vma · 1a5cb814
      Naoya Horiguchi authored
      If !CONFIG_HUGETLB_PAGE, pagemap_hugetlb_range() is never called.  So put
      it (and its calling function) into #ifdef block.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a5cb814
    • Johannes Weiner's avatar
      mincore: do nested page table walks · e48293fd
      Johannes Weiner authored
      Do page table walks with the well-known nested loops we use in several
      other places already.
      
      This avoids doing full page table walks after every pte range and also
      allows to handle unmapped areas bigger than one pte range in one go.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e48293fd
    • Johannes Weiner's avatar
      mincore: pass ranges as start,end address pairs · 25ef0e50
      Johannes Weiner authored
      Instead of passing a start address and a number of pages into the helper
      functions, convert them to use a start and an end address.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25ef0e50
    • Johannes Weiner's avatar
      mincore: break do_mincore() into logical pieces · f4884010
      Johannes Weiner authored
      Split out functions to handle hugetlb ranges, pte ranges and unmapped
      ranges, to improve readability but also to prepare the file structure for
      nested page table walks.
      
      No semantic changes intended.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4884010
    • Johannes Weiner's avatar
      mincore: cleanups · 6a60f1b3
      Johannes Weiner authored
      This fixes some minor issues that bugged me while going over the code:
      
      o adjust argument order of do_mincore() to match the syscall
      o simplify range length calculation
      o drop superfluous shift in huge tlb calculation, address is page aligned
      o drop dead nr_huge calculation
      o check pte_none() before pte_present()
      o comment and whitespace fixes
      
      No semantic changes intended.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a60f1b3
    • Miao Xie's avatar
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie authored
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
    • Miao Xie's avatar
      mempolicy: restructure rebinding-mempolicy functions · 708c1bbc
      Miao Xie authored
      Nick Piggin reported that the allocator may see an empty nodemask when
      changing cpuset's mems[1].  It happens only on the kernel that do not do
      atomic nodemask_t stores.  (MAX_NUMNODES > BITS_PER_LONG)
      
      But I found that there is also a problem on the kernel that can do atomic
      nodemask_t stores.  The problem is that the allocator can't find a node to
      alloc page when changing cpuset's mems though there is a lot of free
      memory.  The reason is like this:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      I can use the attached program reproduce it by the following step:
      
      # mkdir /dev/cpuset
      # mount -t cpuset cpuset /dev/cpuset
      # mkdir /dev/cpuset/1
      # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
      # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
      # echo $$ > /dev/cpuset/1/tasks
      # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> &
         <nr_tasks> = max(nr_cpus - 1, 1)
      # killall -s SIGUSR1 cpuset_mem_hog
      # ./change_mems.sh
      
      several hours later, oom will happen though there is a lot of free memory.
      
      This patchset fixes this problem by expanding the nodes range first(set
      newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
      we use a variable to tell the write-side task that read-side task is
      reading nodemask, and the write-side task clears newly disallowed nodes
      after read-side task ends the current memory allocation.
      
      This patch:
      
      In order to fix no node to alloc memory, when we want to update mempolicy
      and mems_allowed, we expand the set of nodes first (set all the newly
      nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
      mempolicy's rebind functions may breaks the expanding.
      
      So we restructure the mempolicy's rebind functions and split the rebind
      work to two steps, just like the update of cpuset's mems: The 1st step:
      expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
      the mempolicy's nodes.  It is used when there is no real lock to protect
      the mempolicy in the read-side.  Otherwise we can do rebind work at once.
      
      In order to implement it, we define
      
      	enum mpol_rebind_step {
      		MPOL_REBIND_ONCE,
      		MPOL_REBIND_STEP1,
      		MPOL_REBIND_STEP2,
      		MPOL_REBIND_NSTEP,
      	};
      
      If the mempolicy needn't be updated by two steps, we can pass
      MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
      MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
      MPOL_REBIND_STEP2 to do the second step work.
      
      Besides that, it maybe long time between these two step and we have to
      release the lock that protects mempolicy and mems_allowed.  If we hold the
      lock once again, we must check whether the current mempolicy is under the
      rebinding (the first step has been done) or not, because the task may
      alloc a new mempolicy when we don't hold the lock.  So we defined the
      following flag to identify it:
      
      #define MPOL_F_REBINDING (1 << 2)
      
      The new functions will be used in the next patch.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      708c1bbc