1. 19 Dec, 2012 3 commits
    • Linus Torvalds's avatar
      Merge tag 'byteswap-for-linus-20121219' of git://git.infradead.org/users/dwmw2/byteswap · 7f2de817
      Linus Torvalds authored
      Pull preparatory gcc intrisics bswap patch from David Woodhouse:
       "This single patch is effectively a no-op for now.  It enables
        architectures to opt in to using GCC's __builtin_bswapXX() intrinsics
        for byteswapping, and if we merge this now then the architecture
        maintainers can enable it for their arch during the next cycle without
        dependency issues.
      
        It's worth making it a par-arch opt-in, because although in *theory*
        the compiler should never do worse than hand-coded assembler (and of
        course it also ought to do a lot better on platforms like Atom and
        PowerPC which have load-and-swap or store-and-swap instructions), that
        isn't always the case.  See
      
           http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46453
      
        for example."
      
      * tag 'byteswap-for-linus-20121219' of git://git.infradead.org/users/dwmw2/byteswap:
        byteorder: allow arch to opt to use GCC intrinsics for byteswapping
      7f2de817
    • Linus Torvalds's avatar
      blk: avoid divide-by-zero with zero discard granularity · 59771079
      Linus Torvalds authored
      Commit 8dd2cb7e ("block: discard granularity might not be power of
      2") changed a couple of 'binary and' operations into modulus operations.
      Which turned the harmless case of a zero discard_granularity into a
      possible divide-by-zero.
      
      The code also had a much more subtle bug: it was doing the modulus of a
      value in bytes using 'sector_t'.  That was always conceptually wrong,
      but didn't actually matter back when the code assumed a power-of-two
      granularity: we only looked at the low bits anyway.
      
      But with potentially arbitrary sector numbers, using a 'sector_t' to
      express bytes is very very wrong: depending on configuration it limits
      the starting offset of the device to just 32 bits, and any overflow
      would result in a wrong value if the modulus wasn't a power-of-two.
      
      So re-write the code to not only protect against the divide-by-zero, but
      to do the starting sector arithmetic in sectors, and using the proper
      types.
      
      [ For any mathematicians out there: it also looks monumentally stupid to
        do the 'modulo granularity' operation *twice*, never mind having a "+
        granularity" in the second modulus op.
      
        But that's the easiest way to avoid negative values or overflow, and
        it is how the original code was done. ]
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Reported-by: default avatarDoug Anderson <dianders@chromium.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59771079
    • Linus Torvalds's avatar
      Merge branch 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux · 752451f0
      Linus Torvalds authored
      Pull i2c-embedded changes from Wolfram Sang:
       - CBUS driver (an I2C variant)
       - continued rework of the omap driver
       - s3c2410 gets lots of fixes and gains pinctrl support
       - at91 gains DMA support
       - the GPIO muxer gains devicetree probing
       - typical fixes and additions all over
      
      * 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux: (45 commits)
        i2c: omap: Remove the OMAP_I2C_FLAG_RESET_REGS_POSTIDLE flag
        i2c: at91: add dma support
        i2c: at91: change struct members indentation
        i2c: at91: fix compilation warning
        i2c: mxs: Do not disable the I2C SMBus quick mode
        i2c: mxs: Handle i2c DMA failure properly
        i2c: s3c2410: Remove recently introduced performance overheads
        i2c: ocores: Move grlib set/get functions into #ifdef CONFIG_OF block
        i2c: s3c2410: Add fix for i2c suspend/resume
        i2c: s3c2410: Fix code to free gpios
        i2c: i2c-cbus-gpio: introduce driver
        i2c: ocores: Add support for the GRLIB port of the controller and use function pointers for getreg and setreg functions
        i2c: ocores: Add irq support for sparc
        i2c: omap: Move the remove constraint
        ARM: dts: cfa10049: Add the i2c muxer buses to the CFA-10049
        i2c: s3c2410: do not special case HDMIPHY stuck bus detection
        i2c: s3c2410: use exponential back off while polling for bus idle
        i2c: s3c2410: do not generate STOP for QUIRK_HDMIPHY
        i2c: s3c2410: grab adapter lock while changing i2c clock
        i2c: s3c2410: Add support for pinctrl
        ...
      752451f0
  2. 18 Dec, 2012 37 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (more patches from Andrew) · 673ab878
      Linus Torvalds authored
      Merge patches from Andrew Morton:
       "Most of the rest of MM, plus a few dribs and drabs.
      
        I still have quite a few irritating patches left around: ones with
        dubious testing results, lack of review, ones which should have gone
        via maintainer trees but the maintainers are slack, etc.
      
        I need to be more activist in getting these things wrapped up outside
        the merge window, but they're such a PITA."
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (48 commits)
        mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()
        vmscan: comment too_many_isolated()
        mm/kmemleak.c: remove obsolete simple_strtoul
        mm/memory_hotplug.c: improve comments
        mm/hugetlb: create hugetlb cgroup file in hugetlb_init
        mm/mprotect.c: coding-style cleanups
        Documentation: ABI: /sys/devices/system/node/
        slub: drop mutex before deleting sysfs entry
        memcg: add comments clarifying aspects of cache attribute propagation
        kmem: add slab-specific documentation about the kmem controller
        slub: slub-specific propagation changes
        slab: propagate tunable values
        memcg: aggregate memcg cache values in slabinfo
        memcg/sl[au]b: shrink dead caches
        memcg/sl[au]b: track all the memcg children of a kmem_cache
        memcg: destroy memcg caches
        sl[au]b: allocate objects from memcg cache
        sl[au]b: always get the cache from its page in kmem_cache_free()
        memcg: skip memcg kmem allocations in specified code regions
        memcg: infrastructure to match an allocation to the right cache
        ...
      673ab878
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging · d7b96ca5
      Linus Torvalds authored
      Pull hwmon fixlet from Guenter Roeck:
       "Fix fallout from __devexit removal"
      
      * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        hwmon: (twl4030-madc-hwmon) Fix warning message caused by removal of __devexit
      d7b96ca5
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile · 5b3040a4
      Linus Torvalds authored
      Pull tile updates from Chris Metcalf:
       "These are a smattering of minor changes from Tilera and other folks,
        mostly in the ptrace area."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
        arch/tile: set CORE_DUMP_USE_REGSET on tile
        arch/tile: implement arch_ptrace using user_regset on tile
        arch/tile: implement user_regset interface on tile
        arch/tile: clean up tile-specific PTRACE_SETOPTIONS
        arch/tile: provide PT_FLAGS_COMPAT value in pt_regs
        tile/PCI: use for_each_pci_dev to simplify the code
        tilegx: remove __init from pci fixup hook
      5b3040a4
    • Fengguang Wu's avatar
      mm/vmscan.c: avoid possible deadlock caused by too_many_isolated() · 3cf23841
      Fengguang Wu authored
      Neil found that if too_many_isolated() returns true while performing
      direct reclaim we can end up waiting for other threads to complete their
      direct reclaim.  If those threads are allowed to enter the FS or IO to
      free memory, but this thread is not, then it is possible that those
      threads will be waiting on this thread and so we get a circular deadlock.
      
      some task enters direct reclaim with GFP_KERNEL
        => too_many_isolated() false
          => vmscan and run into dirty pages
            => pageout()
              => take some FS lock
                => fs/block code does GFP_NOIO allocation
                  => enter direct reclaim again
                    => too_many_isolated() true
                      => waiting for others to progress, however the other
                         tasks may be circular waiting for the FS lock..
      
      The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
      priority than normal ones, by lowering the throttle threshold for the
      latter.
      
      Allowing ~1/8 isolated pages in normal is large enough.  For example, for
      a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
      isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
      logical CPUs per NUMA node).  So it's not likely some CPU goes idle
      waiting (when it could make progress) because of this limit: there are
      much more sleeping reclaim tasks than the number of CPU, so the task may
      well be blocked by some low level queue/lock anyway.
      
      Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
       They will be blocked only when there are too many concurrent !GFP_IOFS
      reclaims, however that's very unlikely because the IO-less direct reclaims
      is able to progress much more faster, and they won't deadlock each other.
      The threshold is raised high enough for them, so that there can be
      sufficient parallel progress of !GFP_IOFS reclaims.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Torsten Kaiser <just.for.lkml@googlemail.com>
      Tested-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cf23841
    • Fengguang Wu's avatar
      vmscan: comment too_many_isolated() · d37dd5dc
      Fengguang Wu authored
      Comment "Why it's doing so" rather than "What it does" as proposed by
      Andrew Morton.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d37dd5dc
    • Abhijit Pawar's avatar
      mm/kmemleak.c: remove obsolete simple_strtoul · dc053733
      Abhijit Pawar authored
      Replace the obsolete simple_strtoul() with kstrtoul().
      Signed-off-by: default avatarAbhijit Pawar <abhi.c.pawar@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc053733
    • Tang Chen's avatar
      mm/memory_hotplug.c: improve comments · 79a4dcef
      Tang Chen authored
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79a4dcef
    • Jianguo Wu's avatar
      mm/hugetlb: create hugetlb cgroup file in hugetlb_init · 7179e7bf
      Jianguo Wu authored
      Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
      CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
      will fail to boot.
      
      This failure is caused by following code path:
      
        setup_hugepagesz
          hugetlb_add_hstate
            hugetlb_cgroup_file_init
              cgroup_add_cftypes
                kzalloc <--slab is *not available* yet
      
      For this path, slab is not available yet, so memory allocated will be
      failed, and cause WARN_ON() in hugetlb_cgroup_file_init().
      
      So I move hugetlb_cgroup_file_init() into hugetlb_init().
      
      [akpm@linux-foundation.org: tweak coding-style, remove pointless __init on inlined function]
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7179e7bf
    • Andrew Morton's avatar
      mm/mprotect.c: coding-style cleanups · 7d12efae
      Andrew Morton authored
      A few gremlins have recently crept in.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d12efae
    • Davidlohr Bueso's avatar
      Documentation: ABI: /sys/devices/system/node/ · 5bbe1ec1
      Davidlohr Bueso authored
      Describe NUMA node sysfs files/attributes.
      
      Note that for the specific dates and contacts I couldn't find,
      I left it as default for Oct 2002 and linux-mm.
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bbe1ec1
    • Glauber Costa's avatar
      slub: drop mutex before deleting sysfs entry · 5413dfba
      Glauber Costa authored
      Sasha Levin recently reported a lockdep problem resulting from the new
      attribute propagation introduced by kmemcg series.  In short, slab_mutex
      will be called from within the sysfs attribute store function.  This will
      create a dependency, that will later be held backwards when a cache is
      destroyed - since destruction occurs with the slab_mutex held, and then
      calls in to the sysfs directory removal function.
      
      In this patch, I propose to adopt a strategy close to what
      __kmem_cache_create does before calling sysfs_slab_add, and release the
      lock before the call to sysfs_slab_remove.  This is pretty much the last
      operation in the kmem_cache_shutdown() path, so we could do better by
      splitting this and moving this call alone to later on.  This will fit
      nicely when sysfs handling is consistent between all caches, but will look
      weird now.
      
      Lockdep info:
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G        W
        -------------------------------------------------------
        trinity-child13/6961 is trying to acquire lock:
         (s_active#43){++++.+}, at:  sysfs_addrm_finish+0x31/0x60
      
        but task is already holding lock:
         (slab_mutex){+.+.+.}, at:  kmem_cache_destroy+0x22/0xe0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
        -> #1 (slab_mutex){+.+.+.}:
                lock_acquire+0x1aa/0x240
                __mutex_lock_common+0x59/0x5a0
                mutex_lock_nested+0x3f/0x50
                slab_attr_store+0xde/0x110
                sysfs_write_file+0xfa/0x150
                vfs_write+0xb0/0x180
                sys_pwrite64+0x60/0xb0
                tracesys+0xe1/0xe6
        -> #0 (s_active#43){++++.+}:
                __lock_acquire+0x14df/0x1ca0
                lock_acquire+0x1aa/0x240
                sysfs_deactivate+0x122/0x1a0
                sysfs_addrm_finish+0x31/0x60
                sysfs_remove_dir+0x89/0xd0
                kobject_del+0x16/0x40
                __kmem_cache_shutdown+0x40/0x60
                kmem_cache_destroy+0x40/0xe0
                mon_text_release+0x78/0xe0
                __fput+0x122/0x2d0
                ____fput+0x9/0x10
                task_work_run+0xbe/0x100
                do_exit+0x432/0xbd0
                do_group_exit+0x84/0xd0
                get_signal_to_deliver+0x81d/0x930
                do_signal+0x3a/0x950
                do_notify_resume+0x3e/0x90
                int_signal+0x12/0x17
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(slab_mutex);
                                       lock(s_active#43);
                                       lock(slab_mutex);
          lock(s_active#43);
      
         *** DEADLOCK ***
      
        2 locks held by trinity-child13/6961:
         #0:  (mon_lock){+.+.+.}, at:  mon_text_release+0x25/0xe0
         #1:  (slab_mutex){+.+.+.}, at:  kmem_cache_destroy+0x22/0xe0
      
        stack backtrace:
        Pid: 6961, comm: trinity-child13 Tainted: G        W    3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
        Call Trace:
          print_circular_bug+0x1fb/0x20c
          __lock_acquire+0x14df/0x1ca0
          lock_acquire+0x1aa/0x240
          sysfs_deactivate+0x122/0x1a0
          sysfs_addrm_finish+0x31/0x60
          sysfs_remove_dir+0x89/0xd0
          kobject_del+0x16/0x40
          __kmem_cache_shutdown+0x40/0x60
          kmem_cache_destroy+0x40/0xe0
          mon_text_release+0x78/0xe0
          __fput+0x122/0x2d0
          ____fput+0x9/0x10
          task_work_run+0xbe/0x100
          do_exit+0x432/0xbd0
          do_group_exit+0x84/0xd0
          get_signal_to_deliver+0x81d/0x930
          do_signal+0x3a/0x950
          do_notify_resume+0x3e/0x90
          int_signal+0x12/0x17
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5413dfba
    • Glauber Costa's avatar
      memcg: add comments clarifying aspects of cache attribute propagation · ebe945c2
      Glauber Costa authored
      This patch clarifies two aspects of cache attribute propagation.
      
      First, the expected context for the for_each_memcg_cache macro in
      memcontrol.h.  The usages already in the codebase are safe.  In mm/slub.c,
      it is trivially safe because the lock is acquired right before the loop.
      In mm/slab.c, it is less so: the lock is acquired by an outer function a
      few steps back in the stack, so a VM_BUG_ON() is added to make sure it is
      indeed safe.
      
      A comment is also added to detail why we are returning the value of the
      parent cache and ignoring the children's when we propagate the attributes.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebe945c2
    • Glauber Costa's avatar
      kmem: add slab-specific documentation about the kmem controller · 92e79349
      Glauber Costa authored
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      92e79349
    • Glauber Costa's avatar
      slub: slub-specific propagation changes · 107dab5c
      Glauber Costa authored
      SLUB allows us to tune a particular cache behavior with sysfs-based
      tunables.  When creating a new memcg cache copy, we'd like to preserve any
      tunables the parent cache already had.
      
      This can be done by tapping into the store attribute function provided by
      the allocator.  We of course don't need to mess with read-only fields.
      Since the attributes can have multiple types and are stored internally by
      sysfs, the best strategy is to issue a ->show() in the root cache, and
      then ->store() in the memcg cache.
      
      The drawback of that, is that sysfs can allocate up to a page in buffering
      for show(), that we are likely not to need, but also can't guarantee.  To
      avoid always allocating a page for that, we can update the caches at store
      time with the maximum attribute size ever stored to the root cache.  We
      will then get a buffer big enough to hold it.  The corolary to this, is
      that if no stores happened, nothing will be propagated.
      
      It can also happen that a root cache has its tunables updated during
      normal system operation.  In this case, we will propagate the change to
      all caches that are already active.
      
      [akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      107dab5c
    • Glauber Costa's avatar
      slab: propagate tunable values · 943a451a
      Glauber Costa authored
      SLAB allows us to tune a particular cache behavior with tunables.  When
      creating a new memcg cache copy, we'd like to preserve any tunables the
      parent cache already had.
      
      This could be done by an explicit call to do_tune_cpucache() after the
      cache is created.  But this is not very convenient now that the caches are
      created from common code, since this function is SLAB-specific.
      
      Another method of doing that is taking advantage of the fact that
      do_tune_cpucache() is always called from enable_cpucache(), which is
      called at cache initialization.  We can just preset the values, and then
      things work as expected.
      
      It can also happen that a root cache has its tunables updated during
      normal system operation.  In this case, we will propagate the change to
      all caches that are already active.
      
      This change will require us to move the assignment of root_cache in
      memcg_params a bit earlier.  We need this to be already set - which
      memcg_kmem_register_cache will do - when we reach __kmem_cache_create()
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      943a451a
    • Glauber Costa's avatar
      memcg: aggregate memcg cache values in slabinfo · 749c5415
      Glauber Costa authored
      When we create caches in memcgs, we need to display their usage
      information somewhere.  We'll adopt a scheme similar to /proc/meminfo,
      with aggregate totals shown in the global file, and per-group information
      stored in the group itself.
      
      For the time being, only reads are allowed in the per-group cache.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      749c5415
    • Glauber Costa's avatar
      memcg/sl[au]b: shrink dead caches · 22933152
      Glauber Costa authored
      This means that when we destroy a memcg cache that happened to be empty,
      those caches may take a lot of time to go away: removing the memcg
      reference won't destroy them - because there are pending references, and
      the empty pages will stay there, until a shrinker is called upon for any
      reason.
      
      In this patch, we will call kmem_cache_shrink() for all dead caches that
      cannot be destroyed because of remaining pages.  After shrinking, it is
      possible that it could be freed.  If this is not the case, we'll schedule
      a lazy worker to keep trying.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22933152
    • Glauber Costa's avatar
      memcg/sl[au]b: track all the memcg children of a kmem_cache · 7cf27982
      Glauber Costa authored
      This enables us to remove all the children of a kmem_cache being
      destroyed, if for example the kernel module it's being used in gets
      unloaded.  Otherwise, the children will still point to the destroyed
      parent.
      Signed-off-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cf27982
    • Glauber Costa's avatar
      memcg: destroy memcg caches · 1f458cbf
      Glauber Costa authored
      Implement destruction of memcg caches.  Right now, only caches where our
      reference counter is the last remaining are deleted.  If there are any
      other reference counters around, we just leave the caches lying around
      until they go away.
      
      When that happens, a destruction function is called from the cache code.
      Caches are only destroyed in process context, so we queue them up for
      later processing in the general case.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f458cbf
    • Glauber Costa's avatar
      sl[au]b: allocate objects from memcg cache · d79923fa
      Glauber Costa authored
      We are able to match a cache allocation to a particular memcg.  If the
      task doesn't change groups during the allocation itself - a rare event,
      this will give us a good picture about who is the first group to touch a
      cache page.
      
      This patch uses the now available infrastructure by calling
      memcg_kmem_get_cache() before all the cache allocations.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d79923fa
    • Glauber Costa's avatar
      sl[au]b: always get the cache from its page in kmem_cache_free() · b9ce5ef4
      Glauber Costa authored
      struct page already has this information.  If we start chaining caches,
      this information will always be more trustworthy than whatever is passed
      into the function.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9ce5ef4
    • Glauber Costa's avatar
      memcg: skip memcg kmem allocations in specified code regions · 0e9d92f2
      Glauber Costa authored
      Create a mechanism that skip memcg allocations during certain pieces of
      our core code.  It basically works in the same way as
      preempt_disable()/preempt_enable(): By marking a region under which all
      allocations will be accounted to the root memcg.
      
      We need this to prevent races in early cache creation, when we
      allocate data using caches that are not necessarily created already.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      yCc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e9d92f2
    • Glauber Costa's avatar
      memcg: infrastructure to match an allocation to the right cache · d7f25f8a
      Glauber Costa authored
      The page allocator is able to bind a page to a memcg when it is
      allocated.  But for the caches, we'd like to have as many objects as
      possible in a page belonging to the same cache.
      
      This is done in this patch by calling memcg_kmem_get_cache in the
      beginning of every allocation function.  This function is patched out by
      static branches when kernel memory controller is not being used.
      
      It assumes that the task allocating, which determines the memcg in the
      page allocator, belongs to the same cgroup throughout the whole process.
      Misaccounting can happen if the task calls memcg_kmem_get_cache() while
      belonging to a cgroup, and later on changes.  This is considered
      acceptable, and should only happen upon task migration.
      
      Before the cache is created by the memcg core, there is also a possible
      imbalance: the task belongs to a memcg, but the cache being allocated from
      is the global cache, since the child cache is not yet guaranteed to be
      ready.  This case is also fine, since in this case the GFP_KMEMCG will not
      be passed and the page allocator will not attempt any cgroup accounting.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7f25f8a
    • Glauber Costa's avatar
      memcg: allocate memory for memcg caches whenever a new memcg appears · 55007d84
      Glauber Costa authored
      Every cache that is considered a root cache (basically the "original"
      caches, tied to the root memcg/no-memcg) will have an array that should be
      large enough to store a cache pointer per each memcg in the system.
      
      Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
      in the 64k pointers range.  Most of the time, we won't be using that much.
      
      What goes in this patch, is a simple scheme to dynamically allocate such
      an array, in order to minimize memory usage for memcg caches.  Because we
      would also like to avoid allocations all the time, at least for now, the
      array will only grow.  It will tend to be big enough to hold the maximum
      number of kmem-limited memcgs ever achieved.
      
      We'll allocate it to be a minimum of 64 kmem-limited memcgs.  When we have
      more than that, we'll start doubling the size of this array every time the
      limit is reached.
      
      Because we are only considering kmem limited memcgs, a natural point for
      this to happen is when we write to the limit.  At that point, we already
      have set_limit_mutex held, so that will become our natural synchronization
      mechanism.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55007d84
    • Glauber Costa's avatar
      slab/slub: consider a memcg parameter in kmem_create_cache · 2633d7a0
      Glauber Costa authored
      Allow a memcg parameter to be passed during cache creation.  When the slub
      allocator is being used, it will only merge caches that belong to the same
      memcg.  We'll do this by scanning the global list, and then translating
      the cache to a memcg-specific cache
      
      Default function is created as a wrapper, passing NULL to the memcg
      version.  We only merge caches that belong to the same memcg.
      
      A helper is provided, memcg_css_id: because slub needs a unique cache name
      for sysfs.  Since this is visible, but not the canonical location for slab
      data, the cache name is not used, the css_id should suffice.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2633d7a0
    • Glauber Costa's avatar
      slab: annotate on-slab caches nodelist locks · 6ccfb5bc
      Glauber Costa authored
      We currently provide lockdep annotation for kmalloc caches, and also
      caches that have SLAB_DEBUG_OBJECTS enabled.  The reason for this is that
      we can quite frequently nest in the l3->list_lock lock, which is not
      something trivial to avoid.
      
      My proposal with this patch, is to extend this to caches whose slab
      management object lives within the slab as well ("on_slab").  The need for
      this arose in the context of testing kmemcg-slab patches.  With such
      patchset, we can have per-memcg kmalloc caches.  So the same path that led
      to nesting between kmalloc caches will could then lead to in-memcg
      nesting.  Because they are not annotated, lockdep will trigger.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ccfb5bc
    • Glauber Costa's avatar
      slab/slub: struct memcg_params · ba6c496e
      Glauber Costa authored
      For the kmem slab controller, we need to record some extra information in
      the kmem_cache structure.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Signed-off-by: default avatarSuleiman Souhlal <suleiman@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba6c496e
    • Glauber Costa's avatar
      memcg: add documentation about the kmem controller · d5bdae7d
      Glauber Costa authored
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5bdae7d
    • Glauber Costa's avatar
      fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1
      Glauber Costa authored
      Because those architectures will draw their stacks directly from the page
      allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
      flag, and issue the corresponding free_pages.
      
      This code path is taken when the architecture doesn't define
      CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
      THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
      architectures fall in this category.
      
      This will guarantee that every stack page is accounted to the memcg the
      process currently lives on, and will have the allocations to fail if they
      go over limit.
      
      For the time being, I am defining a new variant of THREADINFO_GFP, not to
      mess with the other path.  Once the slab is also tracked by memcg, we can
      get rid of that flag.
      
      Tested to successfully protect against :(){ :|:& };:
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarFrederic Weisbecker <fweisbec@redhat.com>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ad306b1
    • Glauber Costa's avatar
      memcg: execute the whole memcg freeing in free_worker() · c8b2a36f
      Glauber Costa authored
      A lot of the initialization we do in mem_cgroup_create() is done with
      softirqs enabled.  This include grabbing a css id, which holds
      &ss->id_lock->rlock, and the per-zone trees, which holds
      rtpz->lock->rlock.  All of those signal to the lockdep mechanism that
      those locks can be used in SOFTIRQ-ON-W context.
      
      This means that the freeing of memcg structure must happen in a
      compatible context, otherwise we'll get a deadlock, like the one below,
      caught by lockdep:
      
         free_accounted_pages+0x47/0x4c
         free_task+0x31/0x5c
         __put_task_struct+0xc2/0xdb
         put_task_struct+0x1e/0x22
         delayed_put_task_struct+0x7a/0x98
         __rcu_process_callbacks+0x269/0x3df
         rcu_process_callbacks+0x31/0x5b
         __do_softirq+0x122/0x277
      
      This usage pattern could not be triggered before kmem came into play.
      With the introduction of kmem stack handling, it is possible that we call
      the last mem_cgroup_put() from the task destructor, which is run in an rcu
      callback.  Such callbacks are run with softirqs disabled, leading to the
      offensive usage pattern.
      
      In general, we have little, if any, means to guarantee in which context
      the last memcg_put will happen.  The best we can do is test it and try to
      make sure no invalid context releases are happening.  But as we add more
      code to memcg, the possible interactions grow in number and expose more
      ways to get context conflicts.  One thing to keep in mind, is that part of
      the freeing process is already deferred to a worker, such as vfree(), that
      can only be called from process context.
      
      For the moment, the only two functions we really need moved away are:
      
        * free_css_id(), and
        * mem_cgroup_remove_from_trees().
      
      But because the later accesses per-zone info,
      free_mem_cgroup_per_zone_info() needs to be moved as well.  With that, we
      are left with the per_cpu stats only.  Better move it all.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Tested-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8b2a36f
    • Glauber Costa's avatar
      memcg: allow a memcg with kmem charges to be destructed · bea207c8
      Glauber Costa authored
      Because the ultimate goal of the kmem tracking in memcg is to track slab
      pages as well, we can't guarantee that we'll always be able to point a
      page to a particular process, and migrate the charges along with it -
      since in the common case, a page will contain data belonging to multiple
      processes.
      
      Because of that, when we destroy a memcg, we only make sure the
      destruction will succeed by discounting the kmem charges from the user
      charges when we try to empty the cgroup.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bea207c8
    • Glauber Costa's avatar
      memcg: use static branches when code not in use · a8964b9b
      Glauber Costa authored
      We can use static branches to patch the code in or out when not used.
      
      Because the _ACTIVE bit on kmem_accounted is only set after the increment
      is done, we guarantee that the root memcg will always be selected for kmem
      charges until all call sites are patched (see memcg_kmem_enabled).  This
      guarantees that no mischarges are applied.
      
      Static branch decrement happens when the last reference count from the
      kmem accounting in memcg dies.  This will only happen when the charges
      drop down to 0.
      
      When that happens, we need to disable the static branch only on those
      memcgs that enabled it.  To achieve this, we would be forced to complicate
      the code by keeping track of which memcgs were the ones that actually
      enabled limits, and which ones got it from its parents.
      
      It is a lot simpler just to do static_key_slow_inc() on every child
      that is accounted.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8964b9b
    • Glauber Costa's avatar
      memcg: kmem accounting lifecycle management · 7de37682
      Glauber Costa authored
      Because kmem charges can outlive the cgroup, we need to make sure that we
      won't free the memcg structure while charges are still in flight.  For
      reviewing simplicity, the charge functions will issue mem_cgroup_get() at
      every charge, and mem_cgroup_put() at every uncharge.
      
      This can get expensive, however, and we can do better.  mem_cgroup_get()
      only really needs to be issued once: when the first limit is set.  In the
      same spirit, we only need to issue mem_cgroup_put() when the last charge
      is gone.
      
      We'll need an extra bit in kmem_account_flags for that:
      KMEM_ACCOUNTED_DEAD.  it will be set when the cgroup dies, if there are
      charges in the group.  If there aren't, we can proceed right away.
      
      Our uncharge function will have to test that bit every time the charges
      drop to 0.  Because that is not the likely output of res_counter_uncharge,
      this should not impose a big hit on us: it is certainly much better than a
      reference count decrease at every operation.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7de37682
    • Glauber Costa's avatar
      res_counter: return amount of charges after res_counter_uncharge() · 50bdd430
      Glauber Costa authored
      It is useful to know how many charges are still left after a call to
      res_counter_uncharge.  While it is possible to issue a res_counter_read
      after uncharge, this can be racy.
      
      If we need, for instance, to take some action when the counters drop down
      to 0, only one of the callers should see it.  This is the same semantics
      as the atomic variables in the kernel.
      
      Since the current return value is void, we don't need to worry about
      anything breaking due to this change: nobody relied on that, and only
      users appearing from now on will be checking this value.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50bdd430
    • Glauber Costa's avatar
      mm: allocate kernel pages to the right memcg · 6a1a0d3b
      Glauber Costa authored
      When a process tries to allocate a page with the __GFP_KMEMCG flag, the
      page allocator will call the corresponding memcg functions to validate
      the allocation.  Tasks in the root memcg can always proceed.
      
      To avoid adding markers to the page - and a kmem flag that would
      necessarily follow, as much as doing page_cgroup lookups for no reason,
      whoever is marking its allocations with __GFP_KMEMCG flag is responsible
      for telling the page allocator that this is such an allocation at
      free_pages() time.  This is done by the invocation of
      __free_accounted_pages() and free_accounted_pages().
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a1a0d3b
    • Glauber Costa's avatar
      memcg: kmem controller infrastructure · 7ae1e1d0
      Glauber Costa authored
      Introduce infrastructure for tracking kernel memory pages to a given
      memcg.  This will happen whenever the caller includes the flag
      __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
      
      In memcontrol.h those functions are wrapped in inline acessors.  The idea
      is to later on, patch those with static branches, so we don't incur any
      overhead when no mem cgroups with limited kmem are being used.
      
      Users of this functionality shall interact with the memcg core code
      through the following functions:
      
      memcg_kmem_newpage_charge: will return true if the group can handle the
                                 allocation. At this point, struct page is not
                                 yet allocated.
      
      memcg_kmem_commit_charge: will either revert the charge, if struct page
                                allocation failed, or embed memcg information
                                into page_cgroup.
      
      memcg_kmem_uncharge_page: called at free time, will revert the charge.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ae1e1d0
    • Glauber Costa's avatar
      mm: add a __GFP_KMEMCG flag · 7a64bf05
      Glauber Costa authored
      This flag is used to indicate to the callees that this allocation is a
      kernel allocation in process context, and should be accounted to current's
      memcg.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Frederic Weisbecker <fweisbec@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a64bf05