An error occurred fetching the project authors.
  1. 17 Apr, 2015 6 commits
    • Konstantin Khlebnikov's avatar
      mm: rcu-protected get_mm_exe_file() · 90f31d0e
      Konstantin Khlebnikov authored
      This patch removes mm->mmap_sem from mm->exe_file read side.
      Also it kills dup_mm_exe_file() and moves exe_file duplication into
      dup_mmap() where both mmap_sems are locked.
      
      [akpm@linux-foundation.org: fix comment typo]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90f31d0e
    • Heinrich Schuchardt's avatar
      kernel/sysctl.c: threads-max observe limits · 16db3d3f
      Heinrich Schuchardt authored
      Users can change the maximum number of threads by writing to
      /proc/sys/kernel/threads-max.
      
      With the patch the value entered is checked against the same limits that
      apply when fork_init is called.
      Signed-off-by: default avatarHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16db3d3f
    • Heinrich Schuchardt's avatar
      kernel/fork.c: avoid division by zero · ac1b398d
      Heinrich Schuchardt authored
      PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
      THREAD_SIZE.
      
      E.g.  architecture hexagon may have page size 1M and thread size 4096.
      This would lead to a division by zero in the calculation of max_threads.
      
      With 32-bit calculation there is no solution which delivers valid results
      for all possible combinations of the parameters.  The code is only called
      once.  Hence a 64-bit calculation can be used as solution.
      
      [akpm@linux-foundation.org: use clamp_t(), per Oleg]
      Signed-off-by: default avatarHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac1b398d
    • Heinrich Schuchardt's avatar
      kernel/fork.c: new function for max_threads · ff691f6e
      Heinrich Schuchardt authored
      PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
      THREAD_SIZE.
      
      E.g.  architecture hexagon may have page size 1M and thread size 4096.
      This would lead to a division by zero in the calculation of max_threads.
      
      With this patch the buggy code is moved to a separate function
      set_max_threads.  The error is not fixed.
      
      After fixing the problem in a separate patch the new function can be
      reused to adjust max_threads after adding or removing memory.
      
      Argument mempages of function fork_init() is removed as totalram_pages is
      an exported symbol.
      
      The creation of separate patches for refactoring to a new function and for
      fixing the logic was suggested by Ingo Molnar.
      Signed-off-by: default avatarHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff691f6e
    • Jean Delvare's avatar
      fork_init: update max_threads comment · 3ea7f5e2
      Jean Delvare authored
      The comment explaining what value max_threads is set to is outdated.  The
      maximum memory consumption ratio for thread structures was 1/2 until
      February 2002, then it was briefly changed to 1/16 before being set to 1/8
      which we still use today.  The comment was never updated to reflect that
      change, it's about time.
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ea7f5e2
    • Michal Hocko's avatar
      fork: report pid reservation failure properly · 35f71bc0
      Michal Hocko authored
      copy_process will report any failure in alloc_pid as ENOMEM currently
      which is misleading because the pid allocation might fail not only when
      the memory is short but also when the pid space is consumed already.
      
      The current man page even mentions this case:
      
      : EAGAIN
      :
      :       A system-imposed limit on the number of threads was encountered.
      :       There are a number of limits that may trigger this error: the
      :       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
      :       limits the number of processes and threads for a real user ID, was
      :       reached; the kernel's system-wide limit on the number of processes
      :       and threads, /proc/sys/kernel/threads-max, was reached (see
      :       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
      :       was reached (see proc(5)).
      
      so the current behavior is also incorrect wrt.  documentation.  POSIX man
      page also suggest returing EAGAIN when the process count limit is reached.
      
      This patch simply propagates error code from alloc_pid and makes sure we
      return -EAGAIN due to reservation failure.  This will make behavior of
      fork closer to both our documentation and POSIX.
      
      alloc_pid might alsoo fail when the reaper in the pid namespace is dead
      (the namespace basically disallows all new processes) and there is no
      good error code which would match documented ones. We have traditionally
      returned ENOMEM for this case which is misleading as well but as per
      Eric W. Biederman this behavior is documented in man pid_namespaces(7)
      
      : If the "init" process of a PID namespace terminates, the kernel
      : terminates all of the processes in the namespace via a SIGKILL signal.
      : This behavior reflects the fact that the "init" process is essential for
      : the correct operation of a PID namespace.  In this case, a subsequent
      : fork(2) into this PID namespace will fail with the error ENOMEM; it is
      : not possible to create a new processes in a PID namespace whose "init"
      : process has terminated.
      
      and introducing a new error code would be too risky so let's stick to
      ENOMEM for this case.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35f71bc0
  2. 12 Apr, 2015 1 commit
  3. 13 Feb, 2015 1 commit
  4. 12 Feb, 2015 2 commits
    • Kirill A. Shutemov's avatar
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov authored
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarTyler Baker <tyler.baker@linaro.org>
      Tested-by: default avatarTyler Baker <tyler.baker@linaro.org>
      Tested-by: default avatarNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • Kirill A. Shutemov's avatar
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov authored
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
  5. 10 Feb, 2015 1 commit
  6. 13 Dec, 2014 1 commit
  7. 29 Oct, 2014 1 commit
  8. 10 Oct, 2014 1 commit
  9. 03 Oct, 2014 1 commit
  10. 02 Oct, 2014 1 commit
  11. 19 Sep, 2014 1 commit
    • Aaron Tomlin's avatar
      init/main.c: Give init_task a canary · d4311ff1
      Aaron Tomlin authored
      Tasks get their end of stack set to STACK_END_MAGIC with the
      aim to catch stack overruns. Currently this feature does not
      apply to init_task. This patch removes this restriction.
      
      Note that a similar patch was posted by Prarit Bhargava
      some time ago but was never merged:
      
        http://marc.info/?l=linux-kernel&m=127144305403241&w=2Signed-off-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dzickus@redhat.com
      Cc: bmr@redhat.com
      Cc: jcastillo@redhat.com
      Cc: jgh@redhat.com
      Cc: minchan@kernel.org
      Cc: tglx@linutronix.de
      Cc: hannes@cmpxchg.org
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d4311ff1
  12. 08 Sep, 2014 1 commit
    • Rik van Riel's avatar
      time, signal: Protect resource use statistics with seqlock · e78c3496
      Rik van Riel authored
      Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
      issues on large systems, due to both functions being serialized with a
      lock.
      
      The lock protects against reporting a wrong value, due to a thread in the
      task group exiting, its statistics reporting up to the signal struct, and
      that exited task's statistics being counted twice (or not at all).
      
      Protecting that with a lock results in times() and clock_gettime() being
      completely serialized on large systems.
      
      This can be fixed by using a seqlock around the events that gather and
      propagate statistics. As an additional benefit, the protection code can
      be moved into thread_group_cputime(), slightly simplifying the calling
      functions.
      
      In the case of posix_cpu_clock_get_task() things can be simplified a
      lot, because the calling function already ensures that the task sticks
      around, and the rest is now taken care of in thread_group_cputime().
      
      This way the statistics reporting code can run lockless.
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e78c3496
  13. 11 Aug, 2014 1 commit
  14. 08 Aug, 2014 7 commits
    • David Herrmann's avatar
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann authored
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: default avatarDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
    • Jack Miller's avatar
      shm: make exit_shm work proportional to task activity · ab602f79
      Jack Miller authored
      This is small set of patches our team has had kicking around for a few
      versions internally that fixes tasks getting hung on shm_exit when there
      are many threads hammering it at once.
      
      Anton wrote a simple test to cause the issue:
      
        http://ozlabs.org/~anton/junkcode/bust_shm_exit.c
      
      Before applying this patchset, this test code will cause either hanging
      tracebacks or pthread out of memory errors.
      
      After this patchset, it will still produce output like:
      
        root@somehost:~# ./bust_shm_exit 1024 160
        ...
        INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
        INFO: Stall ended before state dump start
        ...
      
      But the task will continue to run along happily, so we consider this an
      improvement over hanging, even if it's a bit noisy.
      
      This patch (of 3):
      
      exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
      walks every shared memory segment in the namespace.  Thus the amount of
      work is related to the number of shm segments in the namespace not the
      number of segments that might need to be cleaned.
      
      In addition, this occurs after the task has been notified the thread has
      exited, so the number of tasks waiting for the ns shm rwsem can grow
      without bound until memory is exausted.
      
      Add a list to the task struct of all shmids allocated by this task.  Init
      the list head in copy_process.  Use the ns->rwsem for locking.  Add
      segments after id is added, remove before removing from id.
      
      On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
      to handling of semaphore undo.
      
      I chose a define for the init sequence since its a simple list init,
      otherwise it would require a function call to avoid include loops between
      the semaphore code and the task struct.  Converting the list_del to
      list_del_init for the unshare cases would remove the exit followed by
      init, but I left it blow up if not inited.
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Signed-off-by: default avatarJack Miller <millerjo@us.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab602f79
    • Vladimir Davydov's avatar
      kernel/fork.c: make mm_init_owner static · 33144e84
      Vladimir Davydov authored
      It's only used in fork.c:mm_init().
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33144e84
    • Vladimir Davydov's avatar
      fork: copy mm's vm usage counters under mmap_sem · 4f7d4614
      Vladimir Davydov authored
      If a forking process has a thread calling (un)mmap (silly but still),
      the child process may have some of its mm's vm usage counters (total_vm
      and friends) screwed up, because currently they are copied from oldmm
      w/o holding any locks (memcpy in dup_mm).
      
      This patch moves the counters initialization to dup_mmap() to be called
      under oldmm->mmap_sem, which eliminates any possibility of race.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f7d4614
    • Vladimir Davydov's avatar
      fork: reset mm->pinned_vm · ce65cefa
      Vladimir Davydov authored
      mm->pinned_vm counts pages of mm's address space that were permanently
      pinned in memory by increasing their reference counter. The counter was
      introduced by commit bc3e53f6 ("mm: distinguish between mlocked and
      pinned pages"), while before it locked_vm had been used for such pages.
      
      Obviously, we should reset the counter on fork if !CLONE_VM, just like
      we do with locked_vm, but currently we don't. Let's fix it.
      
      This patch will fix the contents of /proc/pid/status:VmPin.
      
      ib_umem_get[infiniband] and perf_mmap still check pinned_vm against
      RLIMIT_MEMLOCK.  It's left from the times when pinned pages were accounted
      under locked_vm, but today it looks wrong.  It isn't clear how we should
      deal with it.
      
      We still have some drivers accounting pinned pages under mm->locked_vm -
      this is what commit bc3e53f6 was fighting against.  It's
      infiniband/usnic and vfio.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce65cefa
    • Vladimir Davydov's avatar
      fork/exec: cleanup mm initialization · 41f727fd
      Vladimir Davydov authored
      mm initialization on fork/exec is spread all over the place, which makes
      the code look inconsistent.
      
      We have mm_init(), which is supposed to init/nullify mm's internals, but
      it doesn't init all the fields it should:
      
       - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
         are zeroed in dup_mmap();
      
       - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
         calling mm_init();
      
       - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
         called before mm_init() on both fork and exec;
      
       - ->context is initialized by init_new_context(), which is called after
         mm_init() on both fork and exec;
      
      Let's consolidate all the initializations in mm_init() to make the code
      look cleaner.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41f727fd
    • Johannes Weiner's avatar
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner authored
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      747db954
  15. 23 Jul, 2014 2 commits
  16. 18 Jul, 2014 1 commit
    • Kees Cook's avatar
      seccomp: introduce writer locking · dbd95212
      Kees Cook authored
      Normally, task_struct.seccomp.filter is only ever read or modified by
      the task that owns it (current). This property aids in fast access
      during system call filtering as read access is lockless.
      
      Updating the pointer from another task, however, opens up race
      conditions. To allow cross-thread filter pointer updates, writes to the
      seccomp fields are now protected by the sighand spinlock (which is shared
      by all threads in the thread group). Read access remains lockless because
      pointer updates themselves are atomic.  However, writes (or cloning)
      often entail additional checking (like maximum instruction counts)
      which require locking to perform safely.
      
      In the case of cloning threads, the child is invisible to the system
      until it enters the task list. To make sure a child can't be cloned from
      a thread and left in a prior state, seccomp duplication is additionally
      moved under the sighand lock. Then parent and child are certain have
      the same seccomp state when they exit the lock.
      
      Based on patches by Will Drewry and David Drysdale.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      dbd95212
  17. 16 Jul, 2014 1 commit
  18. 21 Jun, 2014 1 commit
  19. 12 Jun, 2014 1 commit
  20. 06 Jun, 2014 1 commit
    • Matthew Dempsky's avatar
      ptrace: fix fork event messages across pid namespaces · 4e52365f
      Matthew Dempsky authored
      When tracing a process in another pid namespace, it's important for fork
      event messages to contain the child's pid as seen from the tracer's pid
      namespace, not the parent's.  Otherwise, the tracer won't be able to
      correlate the fork event with later SIGTRAP signals it receives from the
      child.
      
      We still risk a race condition if a ptracer from a different pid
      namespace attaches after we compute the pid_t value.  However, sending a
      bogus fork event message in this unlikely scenario is still a vast
      improvement over the status quo where we always send bogus fork event
      messages to debuggers in a different pid namespace than the forking
      process.
      Signed-off-by: default avatarMatthew Dempsky <mdempsky@chromium.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Julien Tinnes <jln@chromium.org>
      Cc: Roland McGrath <mcgrathr@chromium.org>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e52365f
  21. 04 Jun, 2014 2 commits
  22. 07 Apr, 2014 5 commits
    • Gideon Israel Dsouza's avatar
      kernel: use macros from compiler.h instead of __attribute__((...)) · 52f5684c
      Gideon Israel Dsouza authored
      To increase compiler portability there is <linux/compiler.h> which
      provides convenience macros for various gcc constructs.  Eg: __weak for
      __attribute__((weak)).  I've replaced all instances of gcc attributes
      with the right macro in the kernel subsystem.
      Signed-off-by: default avatarGideon Israel Dsouza <gidisrael@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52f5684c
    • David Rientjes's avatar
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes authored
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • David Rientjes's avatar
      fork: collapse copy_flags into copy_process · 514ddb44
      David Rientjes authored
      copy_flags() does not use the clone_flags formal and can be collapsed
      into copy_process() for cleaner code.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      514ddb44
    • Davidlohr Bueso's avatar
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso authored
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
    • Alex Thorlton's avatar
      mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE · a0715cc2
      Alex Thorlton authored
      Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs.  It
      also adds a prctl control which allows us to set the THP disable bit in
      mm->def_flags so that VMs will pick up the setting as they are created.
      Signed-off-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0715cc2