1. 04 Sep, 2015 1 commit
    • Sebastian Andrzej Siewior's avatar
      mm: memcontrol: bring back the VM_BUG_ON() in mem_cgroup_swapout() · ce9ce665
      Sebastian Andrzej Siewior authored
      Clark stumbled over a VM_BUG_ON() in -RT which was then was removed by
      Johannes in commit f371763a
      
       ("mm: memcontrol: fix false-positive
      VM_BUG_ON() on -rt").  The comment before that patch was a tiny bit better
      than it is now.  While the patch claimed to fix a false-postive on -RT
      this was not the case.  None of the -RT folks ACKed it and it was not a
      false positive report.  That was a *real* problem.
      
      This patch updates the comment that is improper because it refers to
      "disabled preemption" as a consequence of that lock being taken.  A
      spin_lock() disables preemption, true, but in this case the code relies on
      the fact that the lock _also_ disables interrupts once it is acquired.
      And this is the important detail (which was checked the VM_BUG_ON()) which
      needs to be pointed out.  This is the hint one needs while looking at the
      code.  It was explained by Johannes on the list that the per-CPU variables
      are protected by local_irq_save().  The BUG_ON() was helpful.  This code
      has been workarounded in -RT in the meantime.  I wouldn't mind running
      into more of those if the code in question uses *special* kind of locking
      since now there is no verification (in terms of lockdep or BUG_ON()) and
      therefore I bring the VM_BUG_ON() check back in.
      
      The two functions after the comment could also have a "local_irq_save()"
      dance around them in order to serialize access to the per-CPU variables.
      This has been avoided because the interrupts should be off.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce9ce665
  2. 25 Jun, 2015 4 commits
    • Tejun Heo's avatar
      memcg: convert mem_cgroup->under_oom from atomic_t to int · c2b42d3c
      Tejun Heo authored
      
      memcg->under_oom tracks whether the memcg is under OOM conditions and is
      an atomic_t counter managed with mem_cgroup_[un]mark_under_oom().  While
      atomic_t appears to be simple synchronization-wise, when used as a
      synchronization construct like here, it's trickier and more error-prone
      due to weak memory ordering rules, especially around atomic_read(), and
      false sense of security.
      
      For example, both non-trivial read sites of memcg->under_oom are a bit
      problematic although not being actually broken.
      
      * mem_cgroup_oom_register_event()
      
        It isn't explicit what guarantees the memory ordering between event
        addition and memcg->under_oom check.  This isn't broken only because
        memcg_oom_lock is used for both event list and memcg->oom_lock.
      
      * memcg_oom_recover()
      
        The lockless test doesn't have any explanation why this would be
        safe.
      
      mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
      in avoiding locking memcg_oom_lock there.  This patch converts
      memcg->under_oom from atomic_t to int, puts their modifications under
      memcg_oom_lock and documents why the lockless test in
      memcg_oom_recover() is safe.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2b42d3c
    • Tejun Heo's avatar
      memcg: remove unused mem_cgroup->oom_wakeups · f4b90b70
      Tejun Heo authored
      Since commit 49426420
      
       ("mm: memcg: handle non-error OOM situations
      more gracefully"), nobody uses mem_cgroup->oom_wakeups.  Remove it.
      
      While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
      is its only user.  This cleanup was suggested by Michal.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4b90b70
    • Johannes Weiner's avatar
      mm: oom_kill: simplify OOM killer locking · dc56401f
      Johannes Weiner authored
      
      The zonelist locking and the oom_sem are two overlapping locks that are
      used to serialize global OOM killing against different things.
      
      The historical zonelist locking serializes OOM kills from allocations with
      overlapping zonelists against each other to prevent killing more tasks
      than necessary in the same memory domain.  Only when neither tasklists nor
      zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
      bound to separate nodes) are OOM kills allowed to execute in parallel.
      
      The younger oom_sem is a read-write lock to serialize OOM killing against
      the PM code trying to disable the OOM killer altogether.
      
      However, the OOM killer is a fairly cold error path, there is really no
      reason to optimize for highly performant and concurrent OOM kills.  And
      the oom_sem is just flat-out redundant.
      
      Replace both locking schemes with a single global mutex serializing OOM
      kills regardless of context.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acke...
      dc56401f
    • Johannes Weiner's avatar
      mm: oom_kill: clean up victim marking and exiting interfaces · 16e95196
      Johannes Weiner authored
      
      Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
      are related in functionality, but the interface is not symmetrical at
      all: one is an internal OOM killer function used during the killing, the
      other is for an OOM victim to signal its own death on exit later on.
      This has locking implications, see follow-up changes.
      
      While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
      is easier on the eye.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16e95196
  3. 10 Jun, 2015 2 commits
  4. 02 Jun, 2015 8 commits
    • Tejun Heo's avatar
      writeback: implement memcg writeback domain based throttling · c2aa723a
      Tejun Heo authored
      
      While cgroup writeback support now connects memcg and blkcg so that
      writeback IOs are properly attributed and controlled, the IO back
      pressure propagation mechanism implemented in balance_dirty_pages()
      and its subroutines wasn't aware of cgroup writeback.
      
      Processes belonging to a memcg may have access to only subset of total
      memory available in the system and not factoring this into dirty
      throttling rendered it completely ineffective for processes under
      memcg limits and memcg ended up building a separate ad-hoc degenerate
      mechanism directly into vmscan code to limit page dirtying.
      
      The previous patches updated balance_dirty_pages() and its subroutines
      so that they can deal with multiple wb_domain's (writeback domains)
      and defined per-memcg wb_domain.  Processes belonging to a non-root
      memcg are bound to two wb_domains, global wb_domain and memcg
      wb_domain, and should be throttled according to IO pressures from both
      domains.  This patch updates dirty throttling code so that it repeats
      similar calculations for the two domains - the differences between the
      two are few and minor - and applies the lower of the two sets of
      resulting constraints.
      
      wb_over_bg_thresh(), which controls when background writeback
      terminates, is also updated to consider both global and memcg
      wb_domains.  It returns true if dirty is over bg_thresh for either
      domain.
      
      This makes the dirty throttling mechanism operational for memcg
      domains including writeback-bandwidth-proportional dirty page
      distribution inside them but the ad-hoc memcg throttling mechanism in
      vmscan is still in place.  The next patch will rip it out.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c2aa723a
    • Tejun Heo's avatar
      writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes · 2529bb3a
      Tejun Heo authored
      
      The amount of available memory to a memcg wb_domain can change as
      memcg configuration changes.  A domain's ->dirty_limit exists to
      smooth out sudden drops in dirty threshold; however, when a domain's
      size actually drops significantly, it hinders the dirty throttling
      from adjusting to the new configuration leading to unexpected
      behaviors including unnecessary OOM kills.
      
      This patch resolves the issue by adding wb_domain_size_changed() which
      resets ->dirty_limit[_tstmp] and making memcg call it on configuration
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2529bb3a
    • Tejun Heo's avatar
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo authored
      
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      841710aa
    • Tejun Heo's avatar
      memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online · 733a572e
      Tejun Heo authored
      
      cpu_possible_mask represents the CPUs which are actually possible
      during that boot instance.  For systems which don't support CPU
      hotplug, this will match cpu_online_mask exactly in most cases.  Even
      for systems which support CPU hotplug, the number of possible CPU
      slots is highly unlikely to diverge greatly from the number of online
      CPUs.  The only cases where the difference between possible and online
      caused problems were when the boot code failed to initialize the
      possible mask and left it fully set at NR_CPUS - 1.
      
      As such, most per-cpu constructs allocate for all possible CPUs and
      often iterate over the possibles, which also has the benefit of
      avoiding the blocking CPU hotplug synchronization.
      
      memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
      mem_cgroup_read_events(), which iterates over online CPUs and handles
      CPU hotplug operations explicitly.  This complexity doesn't actually
      buy anything.  Switch to iterating over the possibles and drop the
      explicit CPU hotplug handling.
      
      Eventually, we want to convert memcg to use percpu_counter instead of
      its own custom implementation which also benefits from quick access
      w/o summing for cases where larger error margin is acceptable.
      
      This will allow mem_cgroup_read_stat() to be called from non-sleepable
      contexts which will be used by cgroup writeback.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      733a572e
    • Tejun Heo's avatar
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo authored
      
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      52ebea74
    • Tejun Heo's avatar
      memcg: implement mem_cgroup_css_from_page() · ad7fa852
      Tejun Heo authored
      
      Implement mem_cgroup_css_from_page() which returns the
      cgroup_subsys_state of the memcg associated with a given page on the
      default hierarchy.  This will be used by cgroup writeback support.
      
      This function assumes that page->mem_cgroup association doesn't change
      until the page is released, which is true on the default hierarchy as
      long as replace_page_cache_page() is not used.  As the only user of
      replace_page_cache_page() is FUSE which won't support cgroup writeback
      for the time being, this works for now, and replace_page_cache_page()
      will soon be updated so that the invariant actually holds.
      
      Note that the RCU protected page->mem_cgroup access is consistent with
      other usages across memcg but ultimately incorrect.  These unlocked
      accesses are missing required barriers.  page->mem_cgroup should be
      made an RCU pointer and updated and accessed using RCU operations.
      
      v4: Instead of triggering WARN, return the root css on the traditional
          hierarchies.  This makes the function a lot easier to deal with
          especially as there's no light way to synchronize against
          hierarchy rebinding.
      
      v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/
      
      v2: Trigger WARN if the function is used on the traditional
          hierarchies and add comment about the assumed invariant.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ad7fa852
    • Tejun Heo's avatar
      memcg: add mem_cgroup_root_css · 56161634
      Tejun Heo authored
      
      Add global mem_cgroup_root_css which points to the root memcg css.
      This will be used by cgroup writeback support.  If memcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      aCc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      56161634
    • Greg Thelen's avatar
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen authored
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2
      
      .  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: default avatarSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c4843a75
  5. 15 Apr, 2015 3 commits
  6. 14 Apr, 2015 3 commits
  7. 13 Mar, 2015 1 commit
  8. 28 Feb, 2015 2 commits
    • Johannes Weiner's avatar
      mm: memcontrol: use "max" instead of "infinity" in control knobs · d2973697
      Johannes Weiner authored
      
      The memcg control knobs indicate the highest possible value using the
      symbolic name "infinity", which is long and awkward to type.
      
      Switch to the string "max", which is just as descriptive but shorter and
      sweeter.
      
      This changes a user interface, so do it before the release and before
      the development flag is dropped from the default hierarchy.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2973697
    • Michal Hocko's avatar
      memcg: fix low limit calculation · 4e54dede
      Michal Hocko authored
      A memcg is considered low limited even when the current usage is equal to
      the low limit.  This leads to interesting side effects e.g.
      groups/hierarchies with no memory accounted are considered protected and
      so the reclaim will emit MEMCG_LOW event when encountering them.
      
      Another and much bigger issue was reported by Joonsoo Kim.  He has hit a
      NULL ptr dereference with the legacy cgroup API which even doesn't have
      low limit exposed.  The limit is 0 by default but the initial check fails
      for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
      use_hierarchy is 0 and so page_counter_read would try to dereference NULL.
      
      I suppose that the current implementation is just an overlook because the
      documentation in Documentation/cgroups/unified-hierarchy.txt says:
      
        "The memory.low boundary on the other hand is a top-down allocated
        reserve.  A cgroup enjoys reclaim protection when it and all its
        ancestors are below their low boundaries"
      
      Fix the usage and the low limit comparision in mem_cgroup_low accordingly.
      
      Fixes: 241994ed
      
       (mm: memcontrol: default hierarchy interface for memory)
      Reported-by: default avatarJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e54dede
  9. 13 Feb, 2015 8 commits
    • Vladimir Davydov's avatar
      memcg: cleanup static keys decrement · f48b80a5
      Vladimir Davydov authored
      
      Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
      from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
      wrapper functions.
      
      Although this patch moves static keys decrement from __mem_cgroup_free to
      mem_cgroup_css_free, it does not introduce any functional changes, because
      the keys are incremented on setting the limit (tcp or kmem), which can
      only happen after successful mem_cgroup_css_online.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f48b80a5
    • Vladimir Davydov's avatar
      memcg: reparent list_lrus and free kmemcg_id on css offline · 2788cf0c
      Vladimir Davydov authored
      
      Now, the only reason to keep kmemcg_id till css free is list_lru, which
      uses it to distribute elements between per-memcg lists.  However, it can
      be easily sorted out - we only need to change kmemcg_id of an offline
      cgroup to its parent's id, making further list_lru_add()'s add elements to
      the parent's list, and then move all elements from the offline cgroup's
      list to the one of its parent.  It will work, because a racing
      list_lru_del() does not need to know the list it is deleting the element
      from.  It can decrement the wrong nr_items counter though, but the ongoing
      reparenting will fix it.  After list_lru reparenting is done we are free
      to release kmemcg_id saving a valuable slot in a per-memcg array for new
      cgroups.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel...
      2788cf0c
    • Vladimir Davydov's avatar
      memcg: free memcg_caches slot on css offline · 2a4db7eb
      Vladimir Davydov authored
      
      We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
      on allocations, so there is no need to have the array entries set until
      css free - we can clear them on css offline.  This will allow us to reuse
      array entries more efficiently and avoid costly array relocations.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a4db7eb
    • Vladimir Davydov's avatar
      slab: embed memcg_cache_params to kmem_cache · f7ce3190
      Vladimir Davydov authored
      
      Currently, kmem_cache stores a pointer to struct memcg_cache_params
      instead of embedding it.  The rationale is to save memory when kmem
      accounting is disabled.  However, the memcg_cache_params has shrivelled
      drastically since it was first introduced:
      
      * Initially:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct kmem_cache *memcg_caches[0];
      		struct {
      			struct mem_cgroup *memcg;
      			struct list_head list;
      			struct kmem_cache *root_cache;
      			bool dead;
      			atomic_t nr_pages;
      			struct work_struct destroy;
      		};
      	};
      };
      
      * Now:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct {
      			struct rcu_head rcu_head;
      			struct kmem_cache *memcg_caches[0];
      		};
      		struct {
      			struct mem_cgroup *memcg;
      			struct kmem_cache *root_cache;
      		};
      	};
      };
      
      So the memory saving does not seem to be a clear win anymore.
      
      OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
      it results in touching one more cache line on kmem alloc/free hot paths.
      Besides, it makes linking kmem caches in a list chained by a field of
      struct memcg_cache_params really painful due to a level of indirection,
      while I want to make them linked in the following patch.  That said, let
      us embed it.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ce3190
    • Vladimir Davydov's avatar
      list_lru: introduce per-memcg lists · 60d3fd32
      Vladimir Davydov authored
      
      There are several FS shrinkers, including super_block::s_shrink, that
      keep reclaimable objects in the list_lru structure.  Hence to turn them
      to memcg-aware shrinkers, it is enough to make list_lru per-memcg.
      
      This patch does the trick.  It adds an array of lru lists to the
      list_lru_node structure (per-node part of the list_lru), one for each
      kmem-active memcg, and dispatches every item addition or removal to the
      list corresponding to the memcg which the item is accounted to.  So now
      the list_lru structure is not just per node, but per node and per memcg.
      
      Not all list_lrus need this feature, so this patch also adds a new
      method, list_lru_init_memcg, which initializes a list_lru as memcg
      aware.  Otherwise (i.e.  if initialized with old list_lru_init), the
      list_lru won't have per memcg lists.
      
      Just like per memcg caches arrays, the arrays of per-memcg lists are
      indexed by memcg_cache_id, so we must grow them whenever
      memcg_nr_cache_ids is increased.  So we introduce a callback,
      memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
      space is full.
      
      The locking is implemented in a manner similar to lruvecs, i.e.  we have
      one lock per node that protects all lists (both global and per cgroup) on
      the node.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60d3fd32
    • Vladimir Davydov's avatar
      memcg: add rwsem to synchronize against memcg_caches arrays relocation · 05257a1a
      Vladimir Davydov authored
      
      We need a stable value of memcg_nr_cache_ids in kmem_cache_create()
      (memcg_alloc_cache_params() wants it for root caches), where we only
      hold the slab_mutex and no memcg-related locks.  As a result, we have to
      update memcg_nr_cache_ids under the slab_mutex, which we can only take
      on the slab's side (see memcg_update_array_size).  This looks awkward
      and will become even worse when per-memcg list_lru is introduced, which
      also wants stable access to memcg_nr_cache_ids.
      
      To get rid of this dependency between the memcg_nr_cache_ids and the
      slab_mutex, this patch introduces a special rwsem.  The rwsem is held
      for writing during memcg_caches arrays relocation and memcg_nr_cache_ids
      updates.  Therefore one can take it for reading to get a stable access
      to memcg_caches arrays and/or memcg_nr_cache_ids.
      
      Currently the semaphore is taken for reading only from
      kmem_cache_create, right before taking the slab_mutex, so right now
      there's no much point in using rwsem instead of mutex.  However, once
      list_lru is made per-memcg it will allow list_lru initializations to
      proceed concurrently.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      05257a1a
    • Vladimir Davydov's avatar
      memcg: rename some cache id related variables · dbcf73e2
      Vladimir Davydov authored
      
      memcg_limited_groups_array_size, which defines the size of memcg_caches
      arrays, sounds rather cumbersome.  Also it doesn't point anyhow that
      it's related to kmem/caches stuff.  So let's rename it to
      memcg_nr_cache_ids.  It's concise and points us directly to
      memcg_cache_id.
      
      Also, rename kmem_limited_groups to memcg_cache_ida.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbcf73e2
    • Vladimir Davydov's avatar
      vmscan: per memory cgroup slab shrinkers · cb731d6c
      Vladimir Davydov authored
      
      This patch adds SHRINKER_MEMCG_AWARE flag.  If a shrinker has this flag
      set, it will be called per memory cgroup.  The memory cgroup to scan
      objects from is passed in shrink_control->memcg.  If the memory cgroup
      is NULL, a memcg aware shrinker is supposed to scan objects from the
      global list.  Unaware shrinkers are only called on global pressure with
      memcg=NULL.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb731d6c
  10. 12 Feb, 2015 8 commits
    • Naoya Horiguchi's avatar
      memcg: cleanup preparation for page table walk · 26bcd64a
      Naoya Horiguchi authored
      
      pagewalk.c can handle vma in itself, so we don't have to pass vma via
      walk->private.  And both of mem_cgroup_count_precharge() and
      mem_cgroup_move_charge() do for each vma loop themselves, but now it's
      done in pagewalk.c, so let's clean up them.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26bcd64a
    • Johannes Weiner's avatar
      mm: memcontrol: consolidate swap controller code · 21afa38e
      Johannes Weiner authored
      
      The swap controller code is scattered all over the file.  Gather all
      the code that isn't directly needed by the memory controller at the
      end of the file in its own CONFIG_MEMCG_SWAP section.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21afa38e
    • Johannes Weiner's avatar
      mm: memcontrol: consolidate memory controller initialization · 95a045f6
      Johannes Weiner authored
      
      The initialization code for the per-cpu charge stock and the soft
      limit tree is compact enough to inline it into mem_cgroup_init().
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95a045f6
    • Johannes Weiner's avatar
      mm: memcontrol: simplify soft limit tree init code · 9c608dbe
      Johannes Weiner authored
      
      - No need to test the node for N_MEMORY.  node_online() is enough for
        node fallback to work in slab, use NUMA_NO_NODE for everything else.
      
      - Remove the BUG_ON() for allocation failure.  A NULL pointer crash is
        just as descriptive, and the absent return value check is obvious.
      
      - Move local variables to the inner-most blocks.
      
      - Point to the tree structure after its initialized, not before, it's
        just more logical that way.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c608dbe
    • Michal Hocko's avatar
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko authored
      Commit 5695be14
      
       ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • Michal Hocko's avatar
      oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60
      Michal Hocko authored
      This patchset addresses a race which was described in the changelog for
      5695be14
      
       ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
      
      : PM freezer relies on having all tasks frozen by the time devices are
      : getting frozen so that no task will touch them while they are getting
      : frozen.  But OOM killer is allowed to kill an already frozen task in order
      : to handle OOM situtation.  In order to protect from late wake ups OOM
      : killer is disabled after all tasks are frozen.  This, however, still keeps
      : a window open when a killed task didn't manage to die by the time
      : freeze_processes finishes.
      
      The original patch hasn't closed the race window completely because that
      would require a more complex solution as it can be seen by this patchset.
      
      The primary motivation was to close the race condition between OOM killer
      and PM freezer _completely_.  As Tejun pointed out, even though the race
      condition is unlikely the harder it would be to debug weird bugs deep in
      the PM freezer when the debugging options are reduced considerably.  I can
      only speculate what might happen when a task is still runnable
      unexpectedly.
      
      On a plus side and as a side effect the oom enable/disable has a better
      (full barrier) semantic without polluting hot paths.
      
      I have tested the series in KVM with 100M RAM:
      - many small tasks (20M anon mmap) which are triggering OOM continually
      - s2ram which resumes automatically is triggered in a loop
      	echo processors > /sys/power/pm_test
      	while true
      	do
      		echo mem > /sys/power/state
      		sleep 1s
      	done
      - simple module which allocates and frees 20M in 8K chunks. If it sees
        freezing(current) then it tries another round of allocation before calling
        try_to_freeze
      - debugging messages of PM stages and OOM killer enable/disable/fail added
        and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
        it wakes up waiters.
      - rebased on top of the current mmotm which means some necessary updates
        in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
        I think this should be OK because __thaw_task shouldn't interfere with any
        locking down wake_up_process. Oleg?
      
      As expected there are no OOM killed tasks after oom is disabled and
      allocations requested by the kernel thread are failing after all the tasks
      are frozen and OOM disabled.  I wasn't able to catch a race where
      oom_killer_disable would really have to wait but I kinda expected the race
      is really unlikely.
      
      [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
      [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
      [  243.636072] (elapsed 2.837 seconds) done.
      [  243.641985] Trying to disable OOM killer
      [  243.643032] Waiting for concurent OOM victims
      [  243.644342] OOM killer disabled
      [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
      [  243.652983] Suspending console(s) (use no_console_suspend to debug)
      [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
      [...]
      [  243.992600] PM: suspend of devices complete after 336.667 msecs
      [  243.993264] PM: late suspend of devices complete after 0.660 msecs
      [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
      [  243.994717] ACPI: Preparing to enter system sleep state S3
      [  243.994795] PM: Saving platform NVS memory
      [  243.994796] Disabling non-boot CPUs ...
      
      The first 2 patches are simple cleanups for OOM.  They should go in
      regardless the rest IMO.
      
      Patches 3 and 4 are trivial printk -> pr_info conversion and they should
      go in ditto.
      
      The main patch is the last one and I would appreciate acks from Tejun and
      Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
      task_lock where a look from Oleg would appreciated) but I am not so sure I
      haven't screwed anything in the freezer code.  I have found several
      surprises there.
      
      This patch (of 5):
      
      This patch is just a preparatory and it doesn't introduce any functional
      change.
      
      Note:
      I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
      wait for the oom victim and to prevent from new killing. This is
      just a side effect of the flag. The primary meaning is to give the oom
      victim access to the memory reserves and that shouldn't be necessary
      here.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49550b60
    • Johannes Weiner's avatar
      mm: memcontrol: fold move_anon() and move_file() · 1dfab5ab
      Johannes Weiner authored
      
      Turn the move type enum into flags and give the flags field a shorter
      name.  Once that is done, move_anon() and move_file() are simple enough to
      just fold them into the callsites.
      
      [akpm@linux-foundation.org: tweak MOVE_MASK definition, per Michal]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1dfab5ab
    • Johannes Weiner's avatar
      mm: memcontrol: default hierarchy interface for memory · 241994ed
      Johannes Weiner authored
      
      Introduce the basic control files to account, partition, and limit
      memory using cgroups in default hierarchy mode.
      
      This interface versioning allows us to address fundamental design
      issues in the existing memory cgroup interface, further explained
      below.  The old interface will be maintained indefinitely, but a
      clearer model and improved workload performance should encourage
      existing users to switch over to the new one eventually.
      
      The control files are thus:
      
        - memory.current shows the current consumption of the cgroup and its
          descendants, in bytes.
      
        - memory.low configures the lower end of the cgroup's expected
          memory consumption range.  The kernel considers memory below that
          boundary to be a reserve - the minimum that the workload needs in
          order to make forward progress - and generally avoids reclaiming
          it, unless there is an imminent risk of entering an OOM situation.
      
        - memory.high configures the upper end of the cgroup's expected
          memory consumption range.  A cgroup whose consumption grows beyond
          this threshold is forced into direct reclaim, to work off the
          excess and to throttle new allocations heavily, but is generally
          allowed to continue and the OOM killer is not invoked.
      
        - memory.max configures the hard maximum amount of memory that the
          cgroup is allowed to consume before the OOM killer is invoked.
      
        - memory.events shows event counters that indicate how often the
          cgroup was reclaimed while below memory.low, how often it was
          forced to reclaim excess beyond memory.high, how often it hit
          memory.max, and how often it entered OOM due to memory.max.  This
          allows users to identify configuration problems when observing a
          degradation in workload performance.  An overcommitted system will
          have an increased rate of low boundary breaches, whereas increased
          rates of high limit breaches, maximum hits, or even OOM situations
          will indicate internally overcommitted cgroups.
      
      For existing users of memory cgroups, the following deviations from
      the current interface are worth pointing out and explaining:
      
        - The original lower boundary, the soft limit, is defined as a limit
          that is per default unset.  As a result, the set of cgroups that
          global reclaim prefers is opt-in, rather than opt-out.  The costs
          for optimizing these mostly negative lookups are so high that the
          implementation, despite its enormous size, does not even provide
          the basic desirable behavior.  First off, the soft limit has no
          hierarchical meaning.  All configured groups are organized in a
          global rbtree and treated like equal peers, regardless where they
          are located in the hierarchy.  This makes subtree delegation
          impossible.  Second, the soft limit reclaim pass is so aggressive
          that it not just introduces high allocation latencies into the
          system, but also impacts system performance due to overreclaim, to
          the point where the feature becomes self-defeating.
      
          The memory.low boundary on the other hand is a top-down allocated
          reserve.  A cgroup enjoys reclaim protection when it and all its
          ancestors are below their low boundaries, which makes delegation
          of subtrees possible.  Secondly, new cgroups have no reserve per
          default and in the common case most cgroups are eligible for the
          preferred reclaim pass.  This allows the new low boundary to be
          efficiently implemented with just a minor addition to the generic
          reclaim code, without the need for out-of-band data structures and
          reclaim passes.  Because the generic reclaim code considers all
          cgroups except for the ones running low in the preferred first
          reclaim pass, overreclaim of individual groups is eliminated as
          well, resulting in much better overall workload performance.
      
        - The original high boundary, the hard limit, is defined as a strict
          limit that can not budge, even if the OOM killer has to be called.
          But this generally goes against the goal of making the most out of
          the available memory.  The memory consumption of workloads varies
          during runtime, and that requires users to overcommit.  But doing
          that with a strict upper limit requires either a fairly accurate
          prediction of the working set size or adding slack to the limit.
          Since working set size estimation is hard and error prone, and
          getting it wrong results in OOM kills, most users tend to err on
          the side of a looser limit and end up wasting precious resources.
      
          The memory.high boundary on the other hand can be set much more
          conservatively.  When hit, it throttles allocations by forcing
          them into direct reclaim to work off the excess, but it never
          invokes the OOM killer.  As a result, a high boundary that is
          chosen too aggressively will not terminate the processes, but
          instead it will lead to gradual performance degradation.  The user
          can monitor this and make corrections until the minimal memory
          footprint that still gives acceptable performance is found.
      
          In extreme cases, with many concurrent allocations and a complete
          breakdown of reclaim progress within the group, the high boundary
          can be exceeded.  But even then it's mostly better to satisfy the
          allocation from the slack available in other groups or the rest of
          the system than killing the group.  Otherwise, memory.max is there
          to limit this type of spillover and ultimately contain buggy or
          even malicious applications.
      
        - The original control file names are unwieldy and inconsistent in
          many different ways.  For example, the upper boundary hit count is
          exported in the memory.failcnt file, but an OOM event count has to
          be manually counted by listening to memory.oom_control events, and
          lower boundary / soft limit events have to be counted by first
          setting a threshold for that value and then counting those events.
          Also, usage and limit files encode their units in the filename.
          That makes the filenames very long, even though this is not
          information that a user needs to be reminded of every time they
          type out those names.
      
          To address these naming issues, as well as to signal clearly that
          the new interface carries a new configuration model, the naming
          conventions in it necessarily differ from the old interface.
      
        - The original limit files indicate the state of an unset limit with
          a very high number, and a configured limit can be unset by echoing
          -1 into those files.  But that very high number is implementation
          and architecture dependent and not very descriptive.  And while -1
          can be understood as an underflow into the highest possible value,
          -2 or -10M etc. do not work, so it's not inconsistent.
      
          memory.low, memory.high, and memory.max will use the string
          "infinity" to indicate and set the highest possible value.
      
      [akpm@linux-foundation.org: use seq_puts() for basic strings]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      241994ed