Commits · 19529784785d8fd164079e1008c8b1970d6062ee · Kirill Smelkov / linux

19 Dec, 2012 16 commits

hwmon: (it87) Support PECI for additional chips · 19529784

Guenter Roeck authored Dec 19, 2012

Extend support for reporting and selecting PECI temperature sensors
to IT8718, IT8720, IT8782, and IT8783. For IT8721, report the sensor
type for temp2 as Intel PECI (6) if the chip is configured to report
the PCH temperature.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

19529784

hwmon: (it87) Report thermal sensor type as Intel PECI if appropriate · 5d8d2f2b

Guenter Roeck authored Dec 19, 2012

IT8721 and IT8728 support Intel PECI temperature reporting. Each sensor
can be programmed to display the temperature reported on the PECI interface.

If configured for Intel PECI, the driver reported the wrong sensor type for
the respective thermal sensor. Fix the code to correctly report it as
"Intel PECI (6)".
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

5d8d2f2b

hwmon: (it87) Manage device specific features with table · 483db43e

Guenter Roeck authored Dec 19, 2012

This simplifies the code, improves runtime performance, reduces
code size (about 280 bytes on x86_64), and makes it easier
to add support for new devices.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

483db43e

hwmon: (it87) Replace pwm group macro with direct attribute definitions · c4458db3

Guenter Roeck authored Dec 19, 2012

Fix checkpatch error:

ERROR: Macros with multiple statements should be enclosed in a do - while loop
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

c4458db3

hwmon: (it87) Avoid quoted string splits across lines · 1d9bcf6a

Guenter Roeck authored Dec 19, 2012

Fix the respective checkpatch warnings.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

1d9bcf6a

hwmon: (it87) Save fan registers in 2-dimensional array · e1169ba0

Guenter Roeck authored Dec 19, 2012

Also unify fan functions to use the same code for 8 and 16 bit fans.

This patch reduces code size by approximately 1,200 bytes on x86_64.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

e1169ba0

hwmon: (it87) Introduce support for tempX_offset sysfs attribute · 161d898a
Guenter Roeck authored Dec 19, 2012
```
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>
```
161d898a

hwmon: (it87) Replace macro defining tempX_type sensors with direct definitions · 2cece01f

Guenter Roeck authored Dec 19, 2012

The macro name show_sensor_offset is confusing since it related to the sensor
type, not an offset - even more so when we introduce offset attributes later on.
Replace it with direct definitions, and replace the show_sensor/set_sensor
function names with show_temp_type/set_temp_type. This also resolves a
checkpatch error.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

2cece01f

hwmon: (it87) Save voltage register values in 2-dimensional array · 929c6a56

Guenter Roeck authored Dec 19, 2012

Reduces code size (more than 600 bytes on x86_64),
and gets rid of some checkpatch errors.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

929c6a56

hwmon: (it87) Save temperature registers in 2-dimensional array · 60ca385a

Guenter Roeck authored Dec 19, 2012

Cleaner code, fewer checkpatch errors, and reduced code size
(saves more than 500 bytes on x86-64).
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

60ca385a

hwmon: (w83627ehf) Get rid of smatch warnings · 45633fb3

Jean Delvare authored Dec 19, 2012

The smatch static code analyzer complains:

drivers/hwmon/w83627ehf.c:911 w83627ehf_update_device() error: buffer overflow 'W83627EHF_REG_TEMP_OFFSET' 3 <= 8
drivers/hwmon/w83627ehf.c:909 w83627ehf_update_device() error: buffer overflow 'data->temp_offset' 3 <= 8
drivers/hwmon/w83627ehf.c:2672 w83627ehf_resume() error: buffer overflow 'W83627EHF_REG_TEMP_OFFSET' 3 <= 8
drivers/hwmon/w83627ehf.c:2673 w83627ehf_resume() error: buffer overflow 'data->temp_offset' 3 <= 8

A deeper analysis of the code shows that these are false positives, as
only the lower 3 bits of data->have_temp_offset can be set so the
write is never attempted with i >= 3. However this shows that the code
isn't very robust and future changes could easily introduce a buffer
overflow. So let's add a safety check to prevent that and make smatch
happy.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Peter Huewe <PeterHuewe@gmx.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>

45633fb3

hwmon: (w83627hf) Don't touch nonexistent I2C address registers · 8f3c7c54

Jean Delvare authored Dec 19, 2012

Only the W83627HF could be accessed through I2C. All other supported
chips are LPC-only, so they do not have I2C address registers. Don't
write to nonexistent or reserved registers on these chips.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>

8f3c7c54

hwmon: (w83627ehf) Add support for suspend · 7e630bb5

Jean Delvare authored Dec 19, 2012

On suspend some register values are lost, most notably the Value RAM
areas but also other limits and settings. Restore them on resume.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>

7e630bb5

hwmon: (w83627hf) Add support for suspend · 275b7d6e

Jean Delvare authored Dec 19, 2012

On suspend some register values are lost, most notably the Value RAM
areas but also other limits. Restore them on resume. On top of that,
some fixups are needed to work around BIOS bugs, in particular when
the BIOS omits running the same initialization sequence on resume
that it does after boot. In that case we have to carry initialization
over suspend.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>

275b7d6e

hwmon: Fix PCI device reference leak in quirk · d6dab7dd

Jean Delvare authored Dec 19, 2012

Thankfully this only affects systems with one specific south bridge
and is most probably harmless unless the hwmon module is heavily
cycled.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>

d6dab7dd

Merge branch 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux · 752451f0

Linus Torvalds authored Dec 18, 2012

Pull i2c-embedded changes from Wolfram Sang:
 - CBUS driver (an I2C variant)
 - continued rework of the omap driver
 - s3c2410 gets lots of fixes and gains pinctrl support
 - at91 gains DMA support
 - the GPIO muxer gains devicetree probing
 - typical fixes and additions all over

* 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux: (45 commits)
  i2c: omap: Remove the OMAP_I2C_FLAG_RESET_REGS_POSTIDLE flag
  i2c: at91: add dma support
  i2c: at91: change struct members indentation
  i2c: at91: fix compilation warning
  i2c: mxs: Do not disable the I2C SMBus quick mode
  i2c: mxs: Handle i2c DMA failure properly
  i2c: s3c2410: Remove recently introduced performance overheads
  i2c: ocores: Move grlib set/get functions into #ifdef CONFIG_OF block
  i2c: s3c2410: Add fix for i2c suspend/resume
  i2c: s3c2410: Fix code to free gpios
  i2c: i2c-cbus-gpio: introduce driver
  i2c: ocores: Add support for the GRLIB port of the controller and use function pointers for getreg and setreg functions
  i2c: ocores: Add irq support for sparc
  i2c: omap: Move the remove constraint
  ARM: dts: cfa10049: Add the i2c muxer buses to the CFA-10049
  i2c: s3c2410: do not special case HDMIPHY stuck bus detection
  i2c: s3c2410: use exponential back off while polling for bus idle
  i2c: s3c2410: do not generate STOP for QUIRK_HDMIPHY
  i2c: s3c2410: grab adapter lock while changing i2c clock
  i2c: s3c2410: Add support for pinctrl
  ...

752451f0

18 Dec, 2012 24 commits

Merge branch 'akpm' (more patches from Andrew) · 673ab878

Linus Torvalds authored Dec 18, 2012

Merge patches from Andrew Morton:
 "Most of the rest of MM, plus a few dribs and drabs.

  I still have quite a few irritating patches left around: ones with
  dubious testing results, lack of review, ones which should have gone
  via maintainer trees but the maintainers are slack, etc.

  I need to be more activist in getting these things wrapped up outside
  the merge window, but they're such a PITA."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (48 commits)
  mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()
  vmscan: comment too_many_isolated()
  mm/kmemleak.c: remove obsolete simple_strtoul
  mm/memory_hotplug.c: improve comments
  mm/hugetlb: create hugetlb cgroup file in hugetlb_init
  mm/mprotect.c: coding-style cleanups
  Documentation: ABI: /sys/devices/system/node/
  slub: drop mutex before deleting sysfs entry
  memcg: add comments clarifying aspects of cache attribute propagation
  kmem: add slab-specific documentation about the kmem controller
  slub: slub-specific propagation changes
  slab: propagate tunable values
  memcg: aggregate memcg cache values in slabinfo
  memcg/sl[au]b: shrink dead caches
  memcg/sl[au]b: track all the memcg children of a kmem_cache
  memcg: destroy memcg caches
  sl[au]b: allocate objects from memcg cache
  sl[au]b: always get the cache from its page in kmem_cache_free()
  memcg: skip memcg kmem allocations in specified code regions
  memcg: infrastructure to match an allocation to the right cache
  ...

673ab878

Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging · d7b96ca5

Linus Torvalds authored Dec 18, 2012

Pull hwmon fixlet from Guenter Roeck:
 "Fix fallout from __devexit removal"

* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (twl4030-madc-hwmon) Fix warning message caused by removal of __devexit

d7b96ca5

Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile · 5b3040a4

Linus Torvalds authored Dec 18, 2012

Pull tile updates from Chris Metcalf:
 "These are a smattering of minor changes from Tilera and other folks,
  mostly in the ptrace area."

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  arch/tile: set CORE_DUMP_USE_REGSET on tile
  arch/tile: implement arch_ptrace using user_regset on tile
  arch/tile: implement user_regset interface on tile
  arch/tile: clean up tile-specific PTRACE_SETOPTIONS
  arch/tile: provide PT_FLAGS_COMPAT value in pt_regs
  tile/PCI: use for_each_pci_dev to simplify the code
  tilegx: remove __init from pci fixup hook

5b3040a4

mm/vmscan.c: avoid possible deadlock caused by too_many_isolated() · 3cf23841

Fengguang Wu authored Dec 18, 2012

Neil found that if too_many_isolated() returns true while performing
direct reclaim we can end up waiting for other threads to complete their
direct reclaim.  If those threads are allowed to enter the FS or IO to
free memory, but this thread is not, then it is possible that those
threads will be waiting on this thread and so we get a circular deadlock.

some task enters direct reclaim with GFP_KERNEL
  => too_many_isolated() false
    => vmscan and run into dirty pages
      => pageout()
        => take some FS lock
          => fs/block code does GFP_NOIO allocation
            => enter direct reclaim again
              => too_many_isolated() true
                => waiting for others to progress, however the other
                   tasks may be circular waiting for the FS lock..

The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
priority than normal ones, by lowering the throttle threshold for the
latter.

Allowing ~1/8 isolated pages in normal is large enough.  For example, for
a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
logical CPUs per NUMA node).  So it's not likely some CPU goes idle
waiting (when it could make progress) because of this limit: there are
much more sleeping reclaim tasks than the number of CPU, so the task may
well be blocked by some low level queue/lock anyway.

Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
 They will be blocked only when there are too many concurrent !GFP_IOFS
reclaims, however that's very unlikely because the IO-less direct reclaims
is able to progress much more faster, and they won't deadlock each other.
The threshold is raised high enough for them, so that there can be
sufficient parallel progress of !GFP_IOFS reclaims.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: NeilBrown <neilb@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3cf23841

vmscan: comment too_many_isolated() · d37dd5dc

Fengguang Wu authored Dec 18, 2012

Comment "Why it's doing so" rather than "What it does" as proposed by
Andrew Morton.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d37dd5dc

mm/kmemleak.c: remove obsolete simple_strtoul · dc053733

Abhijit Pawar authored Dec 18, 2012

Replace the obsolete simple_strtoul() with kstrtoul().
Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

dc053733

mm/memory_hotplug.c: improve comments · 79a4dcef

Tang Chen authored Dec 18, 2012

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

79a4dcef

mm/hugetlb: create hugetlb cgroup file in hugetlb_init · 7179e7bf

Jianguo Wu authored Dec 18, 2012

Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and
CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system
will fail to boot.

This failure is caused by following code path:

  setup_hugepagesz
    hugetlb_add_hstate
      hugetlb_cgroup_file_init
        cgroup_add_cftypes
          kzalloc <--slab is *not available* yet

For this path, slab is not available yet, so memory allocated will be
failed, and cause WARN_ON() in hugetlb_cgroup_file_init().

So I move hugetlb_cgroup_file_init() into hugetlb_init().

[akpm@linux-foundation.org: tweak coding-style, remove pointless __init on inlined function]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7179e7bf

mm/mprotect.c: coding-style cleanups · 7d12efae

Andrew Morton authored Dec 18, 2012

A few gremlins have recently crept in.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7d12efae

Documentation: ABI: /sys/devices/system/node/ · 5bbe1ec1

Davidlohr Bueso authored Dec 18, 2012

Describe NUMA node sysfs files/attributes.

Note that for the specific dates and contacts I couldn't find,
I left it as default for Oct 2002 and linux-mm.
Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5bbe1ec1

slub: drop mutex before deleting sysfs entry · 5413dfba

Glauber Costa authored Dec 18, 2012

Sasha Levin recently reported a lockdep problem resulting from the new
attribute propagation introduced by kmemcg series.  In short, slab_mutex
will be called from within the sysfs attribute store function.  This will
create a dependency, that will later be held backwards when a cache is
destroyed - since destruction occurs with the slab_mutex held, and then
calls in to the sysfs directory removal function.

In this patch, I propose to adopt a strategy close to what
__kmem_cache_create does before calling sysfs_slab_add, and release the
lock before the call to sysfs_slab_remove.  This is pretty much the last
operation in the kmem_cache_shutdown() path, so we could do better by
splitting this and moving this call alone to later on.  This will fit
nicely when sysfs handling is consistent between all caches, but will look
weird now.

Lockdep info:

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G        W
  -------------------------------------------------------
  trinity-child13/6961 is trying to acquire lock:
   (s_active#43){++++.+}, at:  sysfs_addrm_finish+0x31/0x60

  but task is already holding lock:
   (slab_mutex){+.+.+.}, at:  kmem_cache_destroy+0x22/0xe0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:
  -> #1 (slab_mutex){+.+.+.}:
          lock_acquire+0x1aa/0x240
          __mutex_lock_common+0x59/0x5a0
          mutex_lock_nested+0x3f/0x50
          slab_attr_store+0xde/0x110
          sysfs_write_file+0xfa/0x150
          vfs_write+0xb0/0x180
          sys_pwrite64+0x60/0xb0
          tracesys+0xe1/0xe6
  -> #0 (s_active#43){++++.+}:
          __lock_acquire+0x14df/0x1ca0
          lock_acquire+0x1aa/0x240
          sysfs_deactivate+0x122/0x1a0
          sysfs_addrm_finish+0x31/0x60
          sysfs_remove_dir+0x89/0xd0
          kobject_del+0x16/0x40
          __kmem_cache_shutdown+0x40/0x60
          kmem_cache_destroy+0x40/0xe0
          mon_text_release+0x78/0xe0
          __fput+0x122/0x2d0
          ____fput+0x9/0x10
          task_work_run+0xbe/0x100
          do_exit+0x432/0xbd0
          do_group_exit+0x84/0xd0
          get_signal_to_deliver+0x81d/0x930
          do_signal+0x3a/0x950
          do_notify_resume+0x3e/0x90
          int_signal+0x12/0x17

  other info that might help us debug this:

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(slab_mutex);
                                 lock(s_active#43);
                                 lock(slab_mutex);
    lock(s_active#43);

   *** DEADLOCK ***

  2 locks held by trinity-child13/6961:
   #0:  (mon_lock){+.+.+.}, at:  mon_text_release+0x25/0xe0
   #1:  (slab_mutex){+.+.+.}, at:  kmem_cache_destroy+0x22/0xe0

  stack backtrace:
  Pid: 6961, comm: trinity-child13 Tainted: G        W    3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
  Call Trace:
    print_circular_bug+0x1fb/0x20c
    __lock_acquire+0x14df/0x1ca0
    lock_acquire+0x1aa/0x240
    sysfs_deactivate+0x122/0x1a0
    sysfs_addrm_finish+0x31/0x60
    sysfs_remove_dir+0x89/0xd0
    kobject_del+0x16/0x40
    __kmem_cache_shutdown+0x40/0x60
    kmem_cache_destroy+0x40/0xe0
    mon_text_release+0x78/0xe0
    __fput+0x122/0x2d0
    ____fput+0x9/0x10
    task_work_run+0xbe/0x100
    do_exit+0x432/0xbd0
    do_group_exit+0x84/0xd0
    get_signal_to_deliver+0x81d/0x930
    do_signal+0x3a/0x950
    do_notify_resume+0x3e/0x90
    int_signal+0x12/0x17
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5413dfba

memcg: add comments clarifying aspects of cache attribute propagation · ebe945c2

Glauber Costa authored Dec 18, 2012

This patch clarifies two aspects of cache attribute propagation.

First, the expected context for the for_each_memcg_cache macro in
memcontrol.h.  The usages already in the codebase are safe.  In mm/slub.c,
it is trivially safe because the lock is acquired right before the loop.
In mm/slab.c, it is less so: the lock is acquired by an outer function a
few steps back in the stack, so a VM_BUG_ON() is added to make sure it is
indeed safe.

A comment is also added to detail why we are returning the value of the
parent cache and ignoring the children's when we propagate the attributes.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ebe945c2

kmem: add slab-specific documentation about the kmem controller · 92e79349

Glauber Costa authored Dec 18, 2012

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

92e79349

slub: slub-specific propagation changes · 107dab5c

Glauber Costa authored Dec 18, 2012

SLUB allows us to tune a particular cache behavior with sysfs-based
tunables.  When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.

This can be done by tapping into the store attribute function provided by
the allocator.  We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.

The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee.  To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache.  We
will then get a buffer big enough to hold it.  The corolary to this, is
that if no stores happened, nothing will be propagated.

It can also happen that a root cache has its tunables updated during
normal system operation.  In this case, we will propagate the change to
all caches that are already active.

[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

107dab5c

slab: propagate tunable values · 943a451a

Glauber Costa authored Dec 18, 2012

SLAB allows us to tune a particular cache behavior with tunables.  When
creating a new memcg cache copy, we'd like to preserve any tunables the
parent cache already had.

This could be done by an explicit call to do_tune_cpucache() after the
cache is created.  But this is not very convenient now that the caches are
created from common code, since this function is SLAB-specific.

Another method of doing that is taking advantage of the fact that
do_tune_cpucache() is always called from enable_cpucache(), which is
called at cache initialization.  We can just preset the values, and then
things work as expected.

It can also happen that a root cache has its tunables updated during
normal system operation.  In this case, we will propagate the change to
all caches that are already active.

This change will require us to move the assignment of root_cache in
memcg_params a bit earlier.  We need this to be already set - which
memcg_kmem_register_cache will do - when we reach __kmem_cache_create()
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

943a451a

memcg: aggregate memcg cache values in slabinfo · 749c5415

Glauber Costa authored Dec 18, 2012

When we create caches in memcgs, we need to display their usage
information somewhere.  We'll adopt a scheme similar to /proc/meminfo,
with aggregate totals shown in the global file, and per-group information
stored in the group itself.

For the time being, only reads are allowed in the per-group cache.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

749c5415

memcg/sl[au]b: shrink dead caches · 22933152

Glauber Costa authored Dec 18, 2012

This means that when we destroy a memcg cache that happened to be empty,
those caches may take a lot of time to go away: removing the memcg
reference won't destroy them - because there are pending references, and
the empty pages will stay there, until a shrinker is called upon for any
reason.

In this patch, we will call kmem_cache_shrink() for all dead caches that
cannot be destroyed because of remaining pages.  After shrinking, it is
possible that it could be freed.  If this is not the case, we'll schedule
a lazy worker to keep trying.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

22933152

memcg/sl[au]b: track all the memcg children of a kmem_cache · 7cf27982

Glauber Costa authored Dec 18, 2012

This enables us to remove all the children of a kmem_cache being
destroyed, if for example the kernel module it's being used in gets
unloaded.  Otherwise, the children will still point to the destroyed
parent.
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

7cf27982

memcg: destroy memcg caches · 1f458cbf

Glauber Costa authored Dec 18, 2012

Implement destruction of memcg caches.  Right now, only caches where our
reference counter is the last remaining are deleted.  If there are any
other reference counters around, we just leave the caches lying around
until they go away.

When that happens, a destruction function is called from the cache code.
Caches are only destroyed in process context, so we queue them up for
later processing in the general case.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

1f458cbf

sl[au]b: allocate objects from memcg cache · d79923fa

Glauber Costa authored Dec 18, 2012

We are able to match a cache allocation to a particular memcg.  If the
task doesn't change groups during the allocation itself - a rare event,
this will give us a good picture about who is the first group to touch a
cache page.

This patch uses the now available infrastructure by calling
memcg_kmem_get_cache() before all the cache allocations.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d79923fa

sl[au]b: always get the cache from its page in kmem_cache_free() · b9ce5ef4

Glauber Costa authored Dec 18, 2012

struct page already has this information.  If we start chaining caches,
this information will always be more trustworthy than whatever is passed
into the function.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b9ce5ef4

memcg: skip memcg kmem allocations in specified code regions · 0e9d92f2

Glauber Costa authored Dec 18, 2012

Create a mechanism that skip memcg allocations during certain pieces of
our core code.  It basically works in the same way as
preempt_disable()/preempt_enable(): By marking a region under which all
allocations will be accounted to the root memcg.

We need this to prevent races in early cache creation, when we
allocate data using caches that are not necessarily created already.
Signed-off-by: Glauber Costa <glommer@parallels.com>
yCc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0e9d92f2

memcg: infrastructure to match an allocation to the right cache · d7f25f8a

Glauber Costa authored Dec 18, 2012

The page allocator is able to bind a page to a memcg when it is
allocated.  But for the caches, we'd like to have as many objects as
possible in a page belonging to the same cache.

This is done in this patch by calling memcg_kmem_get_cache in the
beginning of every allocation function.  This function is patched out by
static branches when kernel memory controller is not being used.

It assumes that the task allocating, which determines the memcg in the
page allocator, belongs to the same cgroup throughout the whole process.
Misaccounting can happen if the task calls memcg_kmem_get_cache() while
belonging to a cgroup, and later on changes.  This is considered
acceptable, and should only happen upon task migration.

Before the cache is created by the memcg core, there is also a possible
imbalance: the task belongs to a memcg, but the cache being allocated from
is the global cache, since the child cache is not yet guaranteed to be
ready.  This case is also fine, since in this case the GFP_KMEMCG will not
be passed and the page allocator will not attempt any cgroup accounting.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d7f25f8a

memcg: allocate memory for memcg caches whenever a new memcg appears · 55007d84

Glauber Costa authored Dec 18, 2012

Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.

Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range.  Most of the time, we won't be using that much.

What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches.  Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow.  It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.

We'll allocate it to be a minimum of 64 kmem-limited memcgs.  When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.

Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit.  At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

55007d84