1. 07 Feb, 2008 40 commits
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: per-zone-lock for cgroup · 072c56c1
      KAMEZAWA Hiroyuki authored
      Now, lru is per-zone.
      
      Then, lru_lock can be (should be) per-zone, too.
      This patch implementes per-zone lru lock.
      
      lru_lock is placed into mem_cgroup_per_zone struct.
      
      lock can be accessed by
         mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
         &mz->lru_lock
      
         or
         mz = page_cgroup_zoneinfo(page_cgroup);
         &mz->lru_lock
      Signed-off-by: default avatarKAMEZAWA hiroyuki <kmaezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      072c56c1
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: per zone lru for cgroup · 1ecaab2b
      KAMEZAWA Hiroyuki authored
      This patch implements per-zone lru for memory cgroup.
      This patch makes use of mem_cgroup_per_zone struct for per zone lru.
      
      LRU can be accessed by
      
         mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
         &mz->active_list
         &mz->inactive_list
      
         or
         mz = page_cgroup_zoneinfo(page_cgroup);
         &mz->active_list
         &mz->inactive_list
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ecaab2b
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: modifies vmscan.c for... · 1cfb419b
      KAMEZAWA Hiroyuki authored
      per-zone and reclaim enhancements for memory controller: modifies vmscan.c for isolate globa/cgroup lru activity
      
      When using memory controller, there are 2 levels of memory reclaim.
       1. zone memory reclaim because of system/zone memory shortage.
       2. memory cgroup memory reclaim because of hitting limit.
      
      These two can be distinguished by sc->mem_cgroup parameter.
      (scan_global_lru() macro)
      
      This patch tries to make memory cgroup reclaim routine avoid affecting
      system/zone memory reclaim. This patch inserts if (scan_global_lru()) and
      hook to memory_cgroup reclaim support functions.
      
      This patch can be a help for isolating system lru activity and group lru
      activity and shows what additional functions are necessary.
      
       * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup.
       * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in
                                              cgroup.
       * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to
                                      be scanned in this priority in mem_cgroup.
      
       * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages
                                      to be scanned in this priority in mem_cgroup.
      
       * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable
                                           or not.
       * mem_cgroup_get_reclaim_priority() ...
       * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal)
       * mem_cgroup_remember_reclaim_priority()
                                   .... record reclaim priority as
                                        zone->prev_priority.
                                        This value is used for calc reclaim_mapped.
      
      [akpm@linux-foundation.org: fix unused var warning]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cfb419b
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: calculate the number... · cc38108e
      KAMEZAWA Hiroyuki authored
      per-zone and reclaim enhancements for memory controller: calculate the number of pages to be scanned per cgroup
      
      Define function for calculating the number of scan target on each Zone/LRU.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc38108e
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: remember reclaim priority in memory cgroup · 6c48a1d0
      KAMEZAWA Hiroyuki authored
      Functions to remember reclaim priority per cgroup (as zone->prev_priority)
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: more build fixes]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c48a1d0
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: calculate... · 5932f367
      KAMEZAWA Hiroyuki authored
      per-zone and reclaim enhancements for memory controller: calculate active/inactive imbalance per cgroup
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5932f367
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: calculate mapper_ratio per cgroup · 58ae83db
      KAMEZAWA Hiroyuki authored
      Define function for calculating mapped_ratio in memory cgroup.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58ae83db
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: per-zone active inactive counter · 6d12e2d8
      KAMEZAWA Hiroyuki authored
      This patch adds per-zone status in memory cgroup.  These values are often read
      (as per-zone value) by page reclaiming.
      
      In current design, per-zone stat is just a unsigned long value and not an
      atomic value because they are modified only under lru_lock.  (So, atomic_ops
      is not necessary.)
      
      This patch adds ACTIVE and INACTIVE per-zone status values.
      
      For handling per-zone status, this patch adds
        struct mem_cgroup_per_zone {
      		...
        }
      and some helper functions. This will be useful to add per-zone objects
      in mem_cgroup.
      
      This patch turns memory controller's early_init to be 0 for calling
      kmalloc() in initialization.
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d12e2d8
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: nid/zid helper function for cgroup · c0149530
      KAMEZAWA Hiroyuki authored
      Add macro to get node_id and zone_id of page_cgroup.  Will be used in
      per-zone-xxx patches and others.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0149530
    • KAMEZAWA Hiroyuki's avatar
      per-zone and reclaim enhancements for memory controller: add scan_global_lru macro · 91a45470
      KAMEZAWA Hiroyuki authored
      This is used to detect which scan_control scans global lru or mem_cgroup lru.
      And compiled to be static value (1) when memory controller is not configured.
      This may make the meaning obvious.
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91a45470
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: implicit force_empty() at rmdir · df878fb0
      KAMEZAWA Hiroyuki authored
      Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at
      rmdir().
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df878fb0
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: add- pre_destroy() handler · 4fca88c8
      KAMEZAWA Hiroyuki authored
      Add a handler "pre_destroy" to cgroup_subsys.  It is called before
      cgroup_rmdir() checks all subsys's refcnt.
      
      I think this is useful for subsys which have some extra refs even if there
      are no tasks in cgroup.  By adding pre_destroy(), the kernel keeps the rule
      "destroy() against subsystem is called only when refcnt=0." and allows css
      ref to be used by other objects than tasks.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4fca88c8
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: add memory.stat file · d2ceb9b7
      KAMEZAWA Hiroyuki authored
      Show accounted information of memory cgroup by memory.stat file
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: default avatarYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2ceb9b7
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: add status accounting function for memory cgroup · d52aa412
      KAMEZAWA Hiroyuki authored
      Add statistics account infrastructure for memory controller.  All account
      information is stored per-cpu and caller will not have to take lock or use
      atomic ops.  This will be used by memory.stat file later.
      
      CACHE includes swapcache now. I'd like to divide it to
      PAGECACHE and SWAPCACHE later.
      
      This patch adds 3 functions for accounting.
       * __mem_cgroup_stat_add() ... for usual routine.
       * __mem_cgroup_stat_add_safe ... for calling under irq_disabled section.
       * mem_cgroup_read_stat() ... for reading stat value.
       * renamed PAGECACHE to CACHE (because it may include swapcache *now*)
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix smp_processor_id-in-preemptible]
      [akpm@linux-foundation.org: uninline things]
      [akpm@linux-foundation.org: remove dead code]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d52aa412
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: remember "a page is on active list of cgroup or not" · 3564c7c4
      KAMEZAWA Hiroyuki authored
      Remember page_cgroup is on active_list or not in page_cgroup->flags.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3564c7c4
    • Hugh Dickins's avatar
      memcgroup: fix hang with shmem/tmpfs · 82369553
      Hugh Dickins authored
      The memcgroup regime relies upon a cgroup reclaiming pages from itself within
      add_to_page_cache: which may involve some waiting.  Whereas shmem and tmpfs
      rely upon using add_to_page_cache while holding a spinlock: when it cannot
      wait.  The consequence is that when a cgroup reaches its limit, shmem_getpage
      just hangs - unless there is outside memory pressure too, neither kswapd nor
      radix_tree_preload get it out of the retry loop.
      
      In most cases we can mem_cgroup_cache_charge the page waitably first, to
      attach the page_cgroup in advance, so add_to_page_cache will do no more than
      increment a count; then mem_cgroup_uncharge_page after (in both success and
      failure cases) to balance the books again.
      
      And where there used to be a congestion_wait for kswapd (recently made
      redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
      to go through a cycle of allocation and freeing, without accounting to any
      particular page, and without updating the statistics vector.  This brings the
      cgroup below its limit so the next try usually succeeds.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82369553
    • Hugh Dickins's avatar
      memcgroup: tidy up mem_cgroup_charge_common · 3be91277
      Hugh Dickins authored
      Tidy up mem_cgroup_charge_common before extending it.  Adjust some comments,
      but mainly clean up its loop: I've an aversion to loops full of continues,
      then a break or a goto at the bottom.  And the is_atomic test should be on the
      __GFP_WAIT bit, not GFP_ATOMIC bits.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3be91277
    • Balbir Singh's avatar
      Memory controller use rcu_read_lock() in mem_cgroup_cache_charge() · ac44d354
      Balbir Singh authored
      Hugh Dickins noticed that we were using rcu_dereference() without
      rcu_read_lock() in the cache charging routine. The patch below fixes
      this problem
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac44d354
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: remember "a page is charged as page cache" · 217bc319
      KAMEZAWA Hiroyuki authored
      Add a flag to page_cgroup to remember "this page is
      charged as cache."
      cache here includes page caches and swap cache.
      This is useful for implementing precise accounting in memory cgroup.
      TODO:
        distinguish page-cache and swap-cache
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      217bc319
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: force_empty interface for dropping all account in empty cgroup · cc847582
      KAMEZAWA Hiroyuki authored
      This patch adds an interface "memory.force_empty".  Any write to this file
      will drop all charges in this cgroup if there is no task under.
      
      %echo 1 > /....../memory.force_empty
      
      will drop all charges of memory cgroup if cgroup's tasks is empty.
      
      This is useful to invoke rmdir() against memory cgroup successfully.
      
      Tested and worked well on x86_64/fake-NUMA system.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc847582
    • KAMEZAWA Hiroyuki's avatar
      memory cgroup enhancements: fix zone handling in try_to_free_mem_cgroup_page · 417eead3
      KAMEZAWA Hiroyuki authored
      Because NODE_DATA(node)->node_zonelists[] is guaranteed to contain all
      necessary zones, it is not necessary to use for_each_online_node.
      
      And this for_each_online_node() makes reclaim routine start always
      from node 0. This is not good. This patch makes reclaim start from
      caller's node and just use usual (default) zonelist order.
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      417eead3
    • Hugh Dickins's avatar
      memcgroup: revert swap_state mods · fa1de900
      Hugh Dickins authored
      If we're charging rss and we're charging cache, it seems obvious that we
      should be charging swapcache - as has been done.  But in practice that
      doesn't work out so well: both swapin readahead and swapoff leave the
      majority of pages charged to the wrong cgroup (the cgroup that happened to
      read them in, rather than the cgroup to which they belong).
      
      (Which is why unuse_pte's GFP_KERNEL while holding pte lock never showed up
      as a problem: no allocation was ever done there, every page read being
      already charged to the cgroup which initiated the swapoff.)
      
      It all works rather better if we leave the charging to do_swap_page and
      unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c to
      what it was before the memory-controller patches.  This also speeds up
      significantly a contained process working at its limit: because it no
      longer needs to keep waiting for swap writeback to complete.
      
      Is it unfair that swap pages become uncharged once they're unmapped, even
      though they're still clearly private to particular cgroups?  For a short
      while, yes; but PageReclaim arranges for those pages to go to the end of
      the inactive list and be reclaimed soon if necessary.
      
      shmem/tmpfs pages are a distinct case: their charging also benefits from
      this change, but their second life on the lists as swapcache pages may
      prove more unfair - that I need to check next.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa1de900
    • Hugh Dickins's avatar
      memcgroup: fix zone isolation OOM · 436c6541
      Hugh Dickins authored
      mem_cgroup_charge_common shows a tendency to OOM without good reason, when
      a memhog goes well beyond its rss limit but with plenty of swap available.
      Seen on x86 but not on PowerPC; seen when the next patch omits swapcache
      from memcgroup, but we presume it can happen without.
      
      mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM
      avoidance.  Already it has to scan beyond the nr_to_scan limit when it
      finds a !LRU page or an active page when handling inactive or an inactive
      page when handling active.  It needs to do exactly the same when it finds a
      page from the wrong zone (the x86 tests had two zones, the PowerPC tests
      had only one).
      
      Don't increment scan and then decrement it in these cases, just move the
      incrementation down.  Fix recent off-by-one when checking against
      nr_to_scan.  Cut out "Check if the meta page went away from under us",
      presumably left over from early debugging: no amount of such checks could
      save us if this list really were being updated without locking.
      
      This change does make the unlimited scan while holding two spinlocks
      even worse - bad for latency and bad for containment; but that's a
      separate issue which is better left to be fixed a little later.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      436c6541
    • KAMEZAWA Hiroyuki's avatar
      bugfix for memory cgroup controller: avoid !PageLRU page in mem_cgroup_isolate_pages · ff7283fa
      KAMEZAWA Hiroyuki authored
      This patch makes mem_cgroup_isolate_pages() to be
      
        - ignore !PageLRU pages.
        - fixes the bug that isolation makes no progress if page_zone(page) != zone
          page once find. (just increment scan in this case.)
      
      kswapd and memory migration removes a page from list when it handles
      a page for reclaiming/migration.
      
      Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will
      be safe to avoid touching !PageLRU() page and its page_cgroup.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff7283fa
    • KAMEZAWA Hiroyuki's avatar
      bugfix for memory cgroup controller: migration under memory controller fix · ae41be37
      KAMEZAWA Hiroyuki authored
      While using memory control cgroup, page-migration under it works as following.
      ==
       1. uncharge all refs at try to unmap.
       2. charge regs again remove_migration_ptes()
      ==
      This is simple but has following problems.
      ==
       The page is uncharged and charged back again if *mapped*.
          - This means that cgroup before migration can be different from one after
            migration
          - If page is not mapped but charged as page cache, charge is just ignored
            (because not mapped, it will not be uncharged before migration)
            This is memory leak.
      ==
      This patch tries to keep memory cgroup at page migration by increasing
      one refcnt during it. 3 functions are added.
      
       mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
       mem_cgroup_end_migration()     --- decrease refcnt of page->page_cgroup
       mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
                                       new page.
      
      During migration
        - old page is under PG_locked.
        - new page is under PG_locked, too.
        - both old page and new page is not on LRU.
      
      These 3 facts guarantee that page_cgroup() migration has no race.
      
      Tested and worked well in x86_64/fake-NUMA box.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae41be37
    • KAMEZAWA Hiroyuki's avatar
      bugfix for memory controller: add helper function for assigning cgroup to page · 9175e031
      KAMEZAWA Hiroyuki authored
      This patch adds following functions.
         - clear_page_cgroup(page, pc)
         - page_cgroup_assign_new_page_group(page, pc)
      
      Mainly for cleanup.
      
      A manner "check page->cgroup again after lock_page_cgroup()" is
      implemented in straight way.
      
      A comment in mem_cgroup_uncharge() will be removed by force-empty patch
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9175e031
    • Rik van Riel's avatar
      kswapd should only wait on IO if there is IO · f1a9ee75
      Rik van Riel authored
      The current kswapd (and try_to_free_pages) code has an oddity where the
      code will wait on IO, even if there is no IO in flight.  This problem is
      notable especially when the system scans through many unfreeable pages,
      causing unnecessary stalls in the VM.
      
      Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path
      will sleep if a significant number of pages are encountered that should be
      written out.  This gives kswapd a chance to write out those pages, while
      the direct reclaim task sleeps.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1a9ee75
    • David Rientjes's avatar
      oom: add sysctl to enable task memory dump · fef1bdd6
      David Rientjes authored
      Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
      dump of all system tasks (excluding kernel threads) when performing an
      OOM-killing.  Information includes pid, uid, tgid, vm size, rss, cpu,
      oom_adj score, and name.
      
      This is helpful for determining why there was an OOM condition and which
      rogue task caused it.
      
      It is configurable so that large systems, such as those with several
      thousand tasks, do not incur a performance penalty associated with dumping
      data they may not desire.
      
      If an OOM was triggered as a result of a memory controller, the tasklist
      shall be filtered to exclude tasks that are not a member of the same
      cgroup.
      
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fef1bdd6
    • David Rientjes's avatar
      memcontrol: move oom task exclusion to tasklist scan · 4c4a2214
      David Rientjes authored
      Creates a helper function to return non-zero if a task is a member of a
      memory controller:
      
      	int task_in_mem_cgroup(const struct task_struct *task,
      			       const struct mem_cgroup *mem);
      
      When the OOM killer is constrained by the memory controller, the exclusion
      of tasks that are not a member of that controller was previously misplaced
      and appeared in the badness scoring function.  It should be excluded
      during the tasklist scan in select_bad_process() instead.
      
      [akpm@linux-foundation.org: build fix]
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c4a2214
    • Badari Pulavarty's avatar
      mem-controller gfp-mask fix · 4c6bc8dd
      Badari Pulavarty authored
      Need to strip __GFP_HIGHMEM flag while passing to mem_container_cache_charge().
      Signed-off-by: default avatarBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c6bc8dd
    • Balbir Singh's avatar
      memory controller BUG_ON() · 35c754d7
      Balbir Singh authored
      Move mem_controller_cache_charge() above radix_tree_preload().
      radix_tree_preload() disables preemption, even though the gfp_mask passed
      contains __GFP_WAIT, we cannot really do __GFP_WAIT allocations, thus we
      hit a BUG_ON() in kmem_cache_alloc().
      
      This patch moves mem_controller_cache_charge() to above radix_tree_preload()
      for cache charging.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35c754d7
    • Hugh Dickins's avatar
      memcgroup: reinstate swapoff mod · 044d66c1
      Hugh Dickins authored
      This patch reinstates the "swapoff: scan ptes preemptibly" mod we started
      with: in due course it should be rendered down into the earlier patches,
      leaving us with a more straightforward mem_cgroup_charge mod to unuse_pte,
      allocating with GFP_KERNEL while holding no spinlock and no atomic kmap.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      044d66c1
    • David Rientjes's avatar
      memcontrol: move mm_cgroup to header file · 3062fc67
      David Rientjes authored
      Inline functions must preceed their use, so mm_cgroup() should be defined
      in linux/memcontrol.h.
      
      include/linux/memcontrol.h:48: warning: 'mm_cgroup' declared inline after
      	being called
      include/linux/memcontrol.h:48: warning: previous declaration of
      	'mm_cgroup' was here
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3062fc67
    • Balbir Singh's avatar
      Memory controller: make charging gfp mask aware · e1a1cd59
      Balbir Singh authored
      Nick Piggin pointed out that swap cache and page cache addition routines
      could be called from non GFP_KERNEL contexts.  This patch makes the
      charging routine aware of the gfp context.  Charging might fail if the
      cgroup is over it's limit, in which case a suitable error is returned.
      
      This patch was tested on a Powerpc box.  I am still looking at being able
      to test the path, through which allocations happen in non GFP_KERNEL
      contexts.
      
      [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1a1cd59
    • Balbir Singh's avatar
      Memory controller: make page_referenced() cgroup aware · bed7161a
      Balbir Singh authored
      Make page_referenced() cgroup aware.  Without this patch, page_referenced()
      can cause a page to be skipped while reclaiming pages.  This patch ensures
      that other cgroups do not hold pages in a particular cgroup hostage.  It
      is required to ensure that shared pages are freed from a cgroup when they
      are not actively referenced from the cgroup that brought them in
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bed7161a
    • Balbir Singh's avatar
      Memory controller: add switch to control what type of pages to limit · 8697d331
      Balbir Singh authored
      Choose if we want cached pages to be accounted or not.  By default both are
      accounted for.  A new set of tunables are added.
      
      echo -n 1 > mem_control_type
      
      switches the accounting to account for only mapped pages
      
      echo -n 3 > mem_control_type
      
      switches the behaviour back
      
      [bunk@kernel.org: mm/memcontrol.c: clenups]
      [akpm@linux-foundation.org: fix sparc32 build]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8697d331
    • Pavel Emelianov's avatar
      Memory controller: OOM handling · c7ba5c9e
      Pavel Emelianov authored
      Out of memory handling for cgroups over their limit. A task from the
      cgroup over limit is chosen using the existing OOM logic and killed.
      
      TODO:
      1. As discussed in the OLS BOF session, consider implementing a user
      space policy for OOM handling.
      
      [akpm@linux-foundation.org: fix build due to oom-killer changes]
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7ba5c9e
    • Balbir Singh's avatar
      Memory controller improve user interface · 0eea1030
      Balbir Singh authored
      Change the interface to use bytes instead of pages.  Page sizes can vary
      across platforms and configurations.  A new strategy routine has been added
      to the resource counters infrastructure to format the data as desired.
      
      Suggested by David Rientjes, Andrew Morton and Herbert Poetzl
      
      Tested on a UML setup with the config for memory control enabled.
      
      [kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter]
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0eea1030
    • Balbir Singh's avatar
      Memory controller: add per cgroup LRU and reclaim · 66e1707b
      Balbir Singh authored
      Add the page_cgroup to the per cgroup LRU.  The reclaim algorithm has
      been modified to make the isolate_lru_pages() as a pluggable component.  The
      scan_control data structure now accepts the cgroup on behalf of which
      reclaims are carried out.  try_to_free_pages() has been extended to become
      cgroup aware.
      
      [akpm@linux-foundation.org: fix warning]
      [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member]
      [bunk@kernel.org: make do_try_to_free_pages() static]
      [hugh@veritas.com: memcgroup: fix try_to_free order]
      [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary]
      Signed-off-by: default avatarPavel Emelianov <xemul@openvz.org>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66e1707b
    • Balbir Singh's avatar
      Memory controller: task migration · 67e465a7
      Balbir Singh authored
      Allow tasks to migrate from one cgroup to the other.  We migrate
      mm_struct's mem_cgroup only when the thread group id migrates.
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67e465a7