Commit 8c7c6e34 authored by KAMEZAWA Hiroyuki's avatar KAMEZAWA Hiroyuki Committed by Linus Torvalds

memcg: mem+swap controller core

This patch implements per cgroup limit for usage of memory+swap.  However
there are SwapCache, double counting of swap-cache and swap-entry is
avoided.

Mem+Swap controller works as following.
  - memory usage is limited by memory.limit_in_bytes.
  - memory + swap usage is limited by memory.memsw_limit_in_bytes.

This has following benefits.
  - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

Maybe someone wonder "why not swap but mem+swap ?"

  - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
  map
    - charge  page and memsw.

  unmap
    - uncharge page/memsw if not SwapCache.

  swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

  swap-in (do_swap_page)
    - charged as page and memsw.
      record in swap_cgroup is cleared.
      memsw accounting is decremented.

  swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead.  This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

TODO:
 - maybe more optimization can be don in swap-in path. (but not very safe.)
   But we just do simple accounting at this stage.

[nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
[hugh@veritas.com: memswap controller core swapcache fixes]
Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 27a7faa0
...@@ -137,12 +137,32 @@ behind this approach is that a cgroup that aggressively uses a shared ...@@ -137,12 +137,32 @@ behind this approach is that a cgroup that aggressively uses a shared
page will eventually get charged for it (once it is uncharged from page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure). the cgroup that brought it in -- this will happen on memory pressure).
Exception: When you do swapoff and make swapped-out pages of shmem(tmpfs) to Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used..
When you do swapoff and make swapped-out pages of shmem(tmpfs) to
be backed into memory in force, charges for pages are accounted against the be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem. caller of swapoff rather than the users of shmem.
2.4 Reclaim 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP)
Swap Extension allows you to record charge for swap. A swapped-in page is
charged back to original page allocator if possible.
When swap is accounted, following files are added.
- memory.memsw.usage_in_bytes.
- memory.memsw.limit_in_bytes.
usage of mem+swap is limited by memsw.limit_in_bytes.
Note: why 'mem+swap' rather than swap.
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.
In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap from OS point
of view.
2.5 Reclaim
Each cgroup maintains a per cgroup LRU that consists of an active Each cgroup maintains a per cgroup LRU that consists of an active
and inactive list. When a cgroup goes over its limit, we first try and inactive list. When a cgroup goes over its limit, we first try
...@@ -246,6 +266,11 @@ Such charges are freed(at default) or moved to its parent. When moved, ...@@ -246,6 +266,11 @@ Such charges are freed(at default) or moved to its parent. When moved,
both of RSS and CACHES are moved to parent. both of RSS and CACHES are moved to parent.
If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also.
Charges recorded in swap information is not updated at removal of cgroup.
Recorded information is discarded and a cgroup which uses swap (swapcache)
will be charged as a new owner of it.
5. Misc. interfaces. 5. Misc. interfaces.
5.1 force_empty 5.1 force_empty
......
...@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm, ...@@ -32,6 +32,8 @@ extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
/* for swap handling */ /* for swap handling */
extern int mem_cgroup_try_charge(struct mm_struct *mm, extern int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr); gfp_t gfp_mask, struct mem_cgroup **ptr);
extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t mask, struct mem_cgroup **ptr);
extern void mem_cgroup_commit_charge_swapin(struct page *page, extern void mem_cgroup_commit_charge_swapin(struct page *page,
struct mem_cgroup *ptr); struct mem_cgroup *ptr);
extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr); extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
...@@ -80,7 +82,6 @@ extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone, ...@@ -80,7 +82,6 @@ extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account; extern int do_swap_account;
#endif #endif
#else /* CONFIG_CGROUP_MEM_RES_CTLR */ #else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup; struct mem_cgroup;
...@@ -97,7 +98,13 @@ static inline int mem_cgroup_cache_charge(struct page *page, ...@@ -97,7 +98,13 @@ static inline int mem_cgroup_cache_charge(struct page *page,
} }
static inline int mem_cgroup_try_charge(struct mm_struct *mm, static inline int mem_cgroup_try_charge(struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **ptr) gfp_t gfp_mask, struct mem_cgroup **ptr)
{
return 0;
}
static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
struct page *page, gfp_t gfp_mask, struct mem_cgroup **ptr)
{ {
return 0; return 0;
} }
......
...@@ -214,7 +214,7 @@ static inline void lru_cache_add_active_file(struct page *page) ...@@ -214,7 +214,7 @@ static inline void lru_cache_add_active_file(struct page *page)
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask); gfp_t gfp_mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask); gfp_t gfp_mask, bool noswap);
extern int __isolate_lru_page(struct page *page, int mode, int file); extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages); extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness; extern int vm_swappiness;
...@@ -336,7 +336,7 @@ static inline void disable_swap_token(void) ...@@ -336,7 +336,7 @@ static inline void disable_swap_token(void)
#ifdef CONFIG_CGROUP_MEM_RES_CTLR #ifdef CONFIG_CGROUP_MEM_RES_CTLR
extern int mem_cgroup_cache_charge_swapin(struct page *page, extern int mem_cgroup_cache_charge_swapin(struct page *page,
struct mm_struct *mm, gfp_t mask, bool locked); struct mm_struct *mm, gfp_t mask, bool locked);
extern void mem_cgroup_uncharge_swapcache(struct page *page); extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
#else #else
static inline static inline
int mem_cgroup_cache_charge_swapin(struct page *page, int mem_cgroup_cache_charge_swapin(struct page *page,
...@@ -344,7 +344,15 @@ int mem_cgroup_cache_charge_swapin(struct page *page, ...@@ -344,7 +344,15 @@ int mem_cgroup_cache_charge_swapin(struct page *page,
{ {
return 0; return 0;
} }
static inline void mem_cgroup_uncharge_swapcache(struct page *page) static inline void
mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
{
}
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
#else
static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
{ {
} }
#endif #endif
......
This diff is collapsed.
...@@ -2431,7 +2431,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, ...@@ -2431,7 +2431,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
lock_page(page); lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN); delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
if (mem_cgroup_try_charge(mm, GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) { if (mem_cgroup_try_charge_swapin(mm, page,
GFP_HIGHUSER_MOVABLE, &ptr) == -ENOMEM) {
ret = VM_FAULT_OOM; ret = VM_FAULT_OOM;
unlock_page(page); unlock_page(page);
goto out; goto out;
...@@ -2449,8 +2450,20 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, ...@@ -2449,8 +2450,20 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_nomap; goto out_nomap;
} }
/* The page isn't present yet, go ahead with the fault. */ /*
* The page isn't present yet, go ahead with the fault.
*
* Be careful about the sequence of operations here.
* To get its accounting right, reuse_swap_page() must be called
* while the page is counted on swap but not yet in mapcount i.e.
* before page_add_anon_rmap() and swap_free(); try_to_free_swap()
* must be called after the swap_free(), or it will never succeed.
* And mem_cgroup_commit_charge_swapin(), which uses the swp_entry
* in page->private, must be called before reuse_swap_page(),
* which may delete_from_swap_cache().
*/
mem_cgroup_commit_charge_swapin(page, ptr);
inc_mm_counter(mm, anon_rss); inc_mm_counter(mm, anon_rss);
pte = mk_pte(page, vma->vm_page_prot); pte = mk_pte(page, vma->vm_page_prot);
if (write_access && reuse_swap_page(page)) { if (write_access && reuse_swap_page(page)) {
...@@ -2461,7 +2474,6 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, ...@@ -2461,7 +2474,6 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
flush_icache_page(vma, page); flush_icache_page(vma, page);
set_pte_at(mm, address, page_table, pte); set_pte_at(mm, address, page_table, pte);
page_add_anon_rmap(page, vma, address); page_add_anon_rmap(page, vma, address);
mem_cgroup_commit_charge_swapin(page, ptr);
swap_free(entry); swap_free(entry);
if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
......
...@@ -17,6 +17,7 @@ ...@@ -17,6 +17,7 @@
#include <linux/backing-dev.h> #include <linux/backing-dev.h>
#include <linux/pagevec.h> #include <linux/pagevec.h>
#include <linux/migrate.h> #include <linux/migrate.h>
#include <linux/page_cgroup.h>
#include <asm/pgtable.h> #include <asm/pgtable.h>
...@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) ...@@ -108,6 +109,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
*/ */
void __delete_from_swap_cache(struct page *page) void __delete_from_swap_cache(struct page *page)
{ {
swp_entry_t ent = {.val = page_private(page)};
VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(!PageSwapCache(page)); VM_BUG_ON(!PageSwapCache(page));
VM_BUG_ON(PageWriteback(page)); VM_BUG_ON(PageWriteback(page));
...@@ -118,7 +121,7 @@ void __delete_from_swap_cache(struct page *page) ...@@ -118,7 +121,7 @@ void __delete_from_swap_cache(struct page *page)
total_swapcache_pages--; total_swapcache_pages--;
__dec_zone_page_state(page, NR_FILE_PAGES); __dec_zone_page_state(page, NR_FILE_PAGES);
INC_CACHE_INFO(del_total); INC_CACHE_INFO(del_total);
mem_cgroup_uncharge_swapcache(page); mem_cgroup_uncharge_swapcache(page, ent);
} }
/** /**
......
...@@ -471,8 +471,9 @@ static struct swap_info_struct * swap_info_get(swp_entry_t entry) ...@@ -471,8 +471,9 @@ static struct swap_info_struct * swap_info_get(swp_entry_t entry)
return NULL; return NULL;
} }
static int swap_entry_free(struct swap_info_struct *p, unsigned long offset) static int swap_entry_free(struct swap_info_struct *p, swp_entry_t ent)
{ {
unsigned long offset = swp_offset(ent);
int count = p->swap_map[offset]; int count = p->swap_map[offset];
if (count < SWAP_MAP_MAX) { if (count < SWAP_MAP_MAX) {
...@@ -487,6 +488,7 @@ static int swap_entry_free(struct swap_info_struct *p, unsigned long offset) ...@@ -487,6 +488,7 @@ static int swap_entry_free(struct swap_info_struct *p, unsigned long offset)
swap_list.next = p - swap_info; swap_list.next = p - swap_info;
nr_swap_pages++; nr_swap_pages++;
p->inuse_pages--; p->inuse_pages--;
mem_cgroup_uncharge_swap(ent);
} }
} }
return count; return count;
...@@ -502,7 +504,7 @@ void swap_free(swp_entry_t entry) ...@@ -502,7 +504,7 @@ void swap_free(swp_entry_t entry)
p = swap_info_get(entry); p = swap_info_get(entry);
if (p) { if (p) {
swap_entry_free(p, swp_offset(entry)); swap_entry_free(p, entry);
spin_unlock(&swap_lock); spin_unlock(&swap_lock);
} }
} }
...@@ -582,7 +584,7 @@ int free_swap_and_cache(swp_entry_t entry) ...@@ -582,7 +584,7 @@ int free_swap_and_cache(swp_entry_t entry)
p = swap_info_get(entry); p = swap_info_get(entry);
if (p) { if (p) {
if (swap_entry_free(p, swp_offset(entry)) == 1) { if (swap_entry_free(p, entry) == 1) {
page = find_get_page(&swapper_space, entry.val); page = find_get_page(&swapper_space, entry.val);
if (page && !trylock_page(page)) { if (page && !trylock_page(page)) {
page_cache_release(page); page_cache_release(page);
...@@ -696,7 +698,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, ...@@ -696,7 +698,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
pte_t *pte; pte_t *pte;
int ret = 1; int ret = 1;
if (mem_cgroup_try_charge(vma->vm_mm, GFP_HIGHUSER_MOVABLE, &ptr)) if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
GFP_HIGHUSER_MOVABLE, &ptr))
ret = -ENOMEM; ret = -ENOMEM;
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
......
...@@ -1661,7 +1661,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, ...@@ -1661,7 +1661,8 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR #ifdef CONFIG_CGROUP_MEM_RES_CTLR
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask) gfp_t gfp_mask,
bool noswap)
{ {
struct scan_control sc = { struct scan_control sc = {
.may_writepage = !laptop_mode, .may_writepage = !laptop_mode,
...@@ -1674,6 +1675,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, ...@@ -1674,6 +1675,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
}; };
struct zonelist *zonelist; struct zonelist *zonelist;
if (noswap)
sc.may_swap = 0;
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
zonelist = NODE_DATA(numa_node_id())->node_zonelists; zonelist = NODE_DATA(numa_node_id())->node_zonelists;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment