An error occurred fetching the project authors.
  1. 28 Apr, 2020 2 commits
    • Eric W. Biederman's avatar
      exec: Remove BUG_ON(has_group_leader_pid) · 610b8188
      Eric W. Biederman authored
      With the introduction of exchange_tids thread_group_leader and
      has_group_leader_pid have become equivalent.  Further at this point in the
      code a thread group has exactly two threads, the previous thread_group_leader
      that is waiting to be reaped and tsk.  So we know it is impossible for tsk to
      be the thread_group_leader.
      
      This is also the last user of has_group_leader_pid so removing this check
      will allow has_group_leader_pid to be removed.
      
      So remove the "BUG_ON(has_group_leader_pid)" that will never fire.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      610b8188
    • Eric W. Biederman's avatar
      proc: Ensure we see the exit of each process tid exactly once · 6b03d130
      Eric W. Biederman authored
      When the thread group leader changes during exec and the old leaders
      thread is reaped proc_flush_pid will flush the dentries for the entire
      process because the leader still has it's original pid.
      
      Fix this by exchanging the pids in an rcu safe manner,
      and wrapping the code to do that up in a helper exchange_tids.
      
      When I removed switch_exec_pids and introduced this behavior
      in d73d6529 ("[PATCH] pidhash: kill switch_exec_pids") there
      really was nothing that cared as flushing happened with
      the cached dentry and de_thread flushed both of them on exec.
      
      This lack of fully exchanging pids became a problem a few months later
      when I introduced 48e6484d ("[PATCH] proc: Rewrite the proc dentry
      flush on exit optimization").  Which overlooked the de_thread case
      was no longer swapping pids, and I was looking up proc dentries
      by task->pid.
      
      The current behavior isn't properly a bug as everything in proc will
      continue to work correctly just a little bit less efficiently.  Fix
      this just so there are no little surprise corner cases waiting to bite
      people.
      
      -- Oleg points out this could be an issue in next_tgid in proc where
         has_group_leader_pid is called, and reording some of the assignments
         should fix that.
      
      -- Oleg points out this will break the 10 year old hack in __exit_signal.c
      >	/*
      >	 * This can only happen if the caller is de_thread().
      >	 * FIXME: this is the temporary hack, we should teach
      >	 * posix-cpu-timers to handle this case correctly.
      >	 */
      >	if (unlikely(has_group_leader_pid(tsk)))
      >		posix_cpu_timers_exit_group(tsk);
      
      The code in next_tgid has been changed to use PIDTYPE_TGID,
      and the posix cpu timers code has been fixed so it does not
      need the 10 year old hack, so this should be safe to merge
      now.
      
      Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 48e6484d ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      6b03d130
  2. 01 Apr, 2020 1 commit
    • Eric W. Biederman's avatar
      signal: Extend exec_id to 64bits · d1e7fd64
      Eric W. Biederman authored
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d1e7fd64
  3. 25 Mar, 2020 5 commits
  4. 10 Feb, 2020 1 commit
    • Topi Miettinen's avatar
      firmware_loader: load files from the mount namespace of init · 901cff7c
      Topi Miettinen authored
      I have an experimental setup where almost every possible system
      service (even early startup ones) runs in separate namespace, using a
      dedicated, minimal file system. In process of minimizing the contents
      of the file systems with regards to modules and firmware files, I
      noticed that in my system, the firmware files are loaded from three
      different mount namespaces, those of systemd-udevd, init and
      systemd-networkd. The logic of the source namespace is not very clear,
      it seems to depend on the driver, but the namespace of the current
      process is used.
      
      So, this patch tries to make things a bit clearer and changes the
      loading of firmware files only from the mount namespace of init. This
      may also improve security, though I think that using firmware files as
      attack vector could be too impractical anyway.
      
      Later, it might make sense to make the mount namespace configurable,
      for example with a new file in /proc/sys/kernel/firmware_config/. That
      would allow a dedicated file system only for firmware files and those
      need not be present anywhere else. This configurability would make
      more sense if made also for kernel modules and /sbin/modprobe. Modules
      are already loaded from init namespace (usermodehelper uses kthreadd
      namespace) except when directly loaded by systemd-udevd.
      
      Instead of using the mount namespace of the current process to load
      firmware files, use the mount namespace of init process.
      
      Link: https://lore.kernel.org/lkml/bb46ebae-4746-90d9-ec5b-fce4c9328c86@gmail.com/
      Link: https://lore.kernel.org/lkml/0e3f7653-c59d-9341-9db2-c88f5b988c68@gmail.com/Signed-off-by: default avatarTopi Miettinen <toiwoton@gmail.com>
      Link: https://lore.kernel.org/r/20200123125839.37168-1-toiwoton@gmail.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      901cff7c
  5. 31 Jan, 2020 1 commit
    • Alexey Dobriyan's avatar
      execve: warn if process starts with executable stack · 47a2ebb7
      Alexey Dobriyan authored
      There were few episodes of silent downgrade to an executable stack over
      years:
      
      1) linking innocent looking assembly file will silently add executable
         stack if proper linker options is not given as well:
      
      	$ cat f.S
      	.intel_syntax noprefix
      	.text
      	.globl f
      	f:
      	        ret
      
      	$ cat main.c
      	void f(void);
      	int main(void)
      	{
      	        f();
      	        return 0;
      	}
      
      	$ gcc main.c f.S
      	$ readelf -l ./a.out
      	  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                               0x0000000000000000 0x0000000000000000  RWE    0x10
      			 					 ^^^
      
      2) converting C99 nested function into a closure
         https://nullprogram.com/blog/2019/11/15/
      
      	void intsort2(int *base, size_t nmemb, _Bool invert)
      	{
      	    int cmp(const void *a, const void *b)
      	    {
      	        int r = *(int *)a - *(int *)b;
      	        return invert ? -r : r;
      	    }
      	    qsort(base, nmemb, sizeof(*base), cmp);
      	}
      
      will silently require stack trampolines while non-closure version will
      not.
      
      Without doubt this behaviour is documented somewhere, add a warning so
      that developers and users can at least notice.  After so many years of
      x86_64 having proper executable stack support it should not cause too
      many problems.
      
      Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47a2ebb7
  6. 23 Jan, 2020 1 commit
    • Dave Hansen's avatar
      mm: remove arch_bprm_mm_init() hook · 42222eae
      Dave Hansen authored
      From: Dave Hansen <dave.hansen@linux.intel.com>
      
      MPX is being removed from the kernel due to a lack of support
      in the toolchain going forward (gcc).
      
      arch_bprm_mm_init() is used at execve() time.  The only non-stub
      implementation is on x86 for MPX.  Remove the hook entirely from
      all architectures and generic code.
      
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arch@vger.kernel.org
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      42222eae
  7. 20 Nov, 2019 1 commit
  8. 13 Nov, 2019 1 commit
  9. 23 Oct, 2019 1 commit
  10. 25 Sep, 2019 1 commit
    • Mathieu Desnoyers's avatar
      sched/membarrier: Fix p->mm->membarrier_state racy load · 227a4aad
      Mathieu Desnoyers authored
      The membarrier_state field is located within the mm_struct, which
      is not guaranteed to exist when used from runqueue-lock-free iteration
      on runqueues by the membarrier system call.
      
      Copy the membarrier_state from the mm_struct into the scheduler runqueue
      when the scheduler switches between mm.
      
      When registering membarrier for mm, after setting the registration bit
      in the mm membarrier state, issue a synchronize_rcu() to ensure the
      scheduler observes the change. In order to take care of the case
      where a runqueue keeps executing the target mm without swapping to
      other mm, iterate over each runqueue and issue an IPI to copy the
      membarrier_state from the mm_struct into each runqueue which have the
      same mm which state has just been modified.
      
      Move the mm membarrier_state field closer to pgd in mm_struct to use
      a cache line already touched by the scheduler switch_mm.
      
      The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
      clear the runqueue's membarrier state in addition to clear the mm
      membarrier state, so move its implementation into the scheduler
      membarrier code so it can access the runqueue structure.
      
      Add memory barrier in membarrier_exec_mmap() prior to clearing
      the membarrier state, ensuring memory accesses executed prior to exec
      are not reordered with the stores clearing the membarrier state.
      
      As suggested by Linus, move all membarrier.c RCU read-side locks outside
      of the for each cpu loops.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      227a4aad
  11. 25 Jul, 2019 1 commit
    • Jann Horn's avatar
      sched/fair: Don't free p->numa_faults with concurrent readers · 16d51a59
      Jann Horn authored
      When going through execve(), zero out the NUMA fault statistics instead of
      freeing them.
      
      During execve, the task is reachable through procfs and the scheduler. A
      concurrent /proc/*/sched reader can read data from a freed ->numa_faults
      allocation (confirmed by KASAN) and write it back to userspace.
      I believe that it would also be possible for a use-after-free read to occur
      through a race between a NUMA fault and execve(): task_numa_fault() can
      lead to task_numa_compare(), which invokes task_weight() on the currently
      running task of a different CPU.
      
      Another way to fix this would be to make ->numa_faults RCU-managed or add
      extra locking, but it seems easier to wipe the NUMA fault statistics on
      execve.
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
      Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      16d51a59
  12. 27 May, 2019 1 commit
  13. 21 May, 2019 1 commit
  14. 15 May, 2019 1 commit
  15. 08 Mar, 2019 2 commits
  16. 19 Feb, 2019 1 commit
    • YueHaibing's avatar
      exec: Fix mem leak in kernel_read_file · f612acfa
      YueHaibing authored
      syzkaller report this:
      BUG: memory leak
      unreferenced object 0xffffc9000488d000 (size 9195520):
        comm "syz-executor.0", pid 2752, jiffies 4294787496 (age 18.757s)
        hex dump (first 32 bytes):
          ff ff ff ff ff ff ff ff a8 00 00 00 01 00 00 00  ................
          02 00 00 00 00 00 00 00 80 a1 7a c1 ff ff ff ff  ..........z.....
        backtrace:
          [<000000000863775c>] __vmalloc_node mm/vmalloc.c:1795 [inline]
          [<000000000863775c>] __vmalloc_node_flags mm/vmalloc.c:1809 [inline]
          [<000000000863775c>] vmalloc+0x8c/0xb0 mm/vmalloc.c:1831
          [<000000003f668111>] kernel_read_file+0x58f/0x7d0 fs/exec.c:924
          [<000000002385813f>] kernel_read_file_from_fd+0x49/0x80 fs/exec.c:993
          [<0000000011953ff1>] __do_sys_finit_module+0x13b/0x2a0 kernel/module.c:3895
          [<000000006f58491f>] do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
          [<00000000ee78baf4>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<00000000241f889b>] 0xffffffffffffffff
      
      It should goto 'out_free' lable to free allocated buf while kernel_read
      fails.
      
      Fixes: 39d637af ("vfs: forbid write access when reading a file into memory")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f612acfa
  17. 04 Feb, 2019 1 commit
    • Elena Reshetova's avatar
      sched/core: Convert sighand_struct.count to refcount_t · d036bda7
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable sighand_struct.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the sighand_struct.count it might make a difference
      in following places:
      
       - __cleanup_sighand: decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d036bda7
  18. 04 Jan, 2019 2 commits
  19. 10 Dec, 2018 1 commit
  20. 04 Dec, 2018 1 commit
    • Rafael J. Wysocki's avatar
      Revert "exec: make de_thread() freezable" · a72173ec
      Rafael J. Wysocki authored
      Revert commit c2239788 "exec: make de_thread() freezable" as
      requested by Ingo Molnar:
      
      "So there's a new regression in v4.20-rc4, my desktop produces this
      lockdep splat:
      
      [ 1772.588771] WARNING: pkexec/4633 still has locks held!
      [ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
      [ 1772.588775] ------------------------------------
      [ 1772.588776] 1 lock held by pkexec/4633:
      [ 1772.588778]  #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
      [ 1772.588786] stack backtrace:
      [ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
      [ 1772.588792] Call Trace:
      [ 1772.588800]  dump_stack+0x85/0xcb
      [ 1772.588803]  flush_old_exec+0x116/0x890
      [ 1772.588807]  ? load_elf_phdrs+0x72/0xb0
      [ 1772.588809]  load_elf_binary+0x291/0x1620
      [ 1772.588815]  ? sched_clock+0x5/0x10
      [ 1772.588817]  ? search_binary_handler+0x6d/0x240
      [ 1772.588820]  search_binary_handler+0x80/0x240
      [ 1772.588823]  load_script+0x201/0x220
      [ 1772.588825]  search_binary_handler+0x80/0x240
      [ 1772.588828]  __do_execve_file.isra.32+0x7d2/0xa60
      [ 1772.588832]  ? strncpy_from_user+0x40/0x180
      [ 1772.588835]  __x64_sys_execve+0x34/0x40
      [ 1772.588838]  do_syscall_64+0x60/0x1c0
      
      The warning gets triggered by an ancient lockdep check in the freezer:
      
      (gdb) list *0xffffffff812ece06
      0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
      52	 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
      53	 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
      54	 */
      55	static inline bool try_to_freeze_unsafe(void)
      56	{
      57		might_sleep();
      58		if (likely(!freezing(current)))
      59			return false;
      60		return __refrigerator(false);
      61	}
      
      I reviewed the ->cred_guard_mutex code, and the mutex is held across all
      of exec() - and we always did this.
      
      But there's this recent -rc4 commit:
      
      > Chanho Min (1):
      >       exec: make de_thread() freezable
      
        c2239788: exec: make de_thread() freezable
      
      I believe this commit is bogus, you cannot call try_to_freeze() from
      de_thread(), because it's holding the ->cred_guard_mutex."
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a72173ec
  21. 19 Nov, 2018 1 commit
    • Chanho Min's avatar
      exec: make de_thread() freezable · c2239788
      Chanho Min authored
      Suspend fails due to the exec family of functions blocking the freezer.
      The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting for
      all sub-threads to die, and we have the deadlock if one of them is frozen.
      This also can occur with the schedule() waiting for the group thread leader
      to exit if it is frozen.
      
      In our machine, it causes freeze timeout as bellows.
      
      Freezing of tasks failed after 20.010 seconds (1 tasks refusing to freeze, wq_busy=0):
      setcpushares-ls D ffffffc00008ed70     0  5817   1483 0x0040000d
       Call trace:
      [<ffffffc00008ed70>] __switch_to+0x88/0xa0
      [<ffffffc000d1c30c>] __schedule+0x1bc/0x720
      [<ffffffc000d1ca90>] schedule+0x40/0xa8
      [<ffffffc0001cd784>] flush_old_exec+0xdc/0x640
      [<ffffffc000220360>] load_elf_binary+0x2a8/0x1090
      [<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
      [<ffffffc00021c584>] load_script+0x20c/0x228
      [<ffffffc0001ccff4>] search_binary_handler+0x9c/0x240
      [<ffffffc0001ce8e0>] do_execveat_common.isra.14+0x4f8/0x6e8
      [<ffffffc0001cedd0>] compat_SyS_execve+0x38/0x48
      [<ffffffc00008de30>] el0_svc_naked+0x24/0x28
      
      To fix this, make de_thread() freezable. It looks safe and works fine.
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarChanho Min <chanho.min@lge.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c2239788
  22. 10 Oct, 2018 1 commit
    • Eric Biggers's avatar
      vfs: require i_size <= SIZE_MAX in kernel_read_file() · 691115c3
      Eric Biggers authored
      On 32-bit systems, the buffer allocated by kernel_read_file() is too
      small if the file size is > SIZE_MAX, due to truncation to size_t.
      
      Fortunately, since the 'count' argument to kernel_read() is also
      truncated to size_t, only the allocated space is filled; then, -EIO is
      returned since 'pos != i_size' after the read loop.
      
      But this is not obvious and seems incidental.  We should be more
      explicit about this case.  So, fail early if i_size > SIZE_MAX.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMimi Zohar <zohar@linux.ibm.com>
      691115c3
  23. 27 Jul, 2018 1 commit
    • Kirill A. Shutemov's avatar
      mm: fix vma_is_anonymous() false-positives · bfd40eaf
      Kirill A. Shutemov authored
      vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
      VMA.  This is unreliable as ->mmap may not set ->vm_ops.
      
      False-positive vma_is_anonymous() may lead to crashes:
      
      	next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
      	prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
      	pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
      	flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
      	------------[ cut here ]------------
      	kernel BUG at mm/memory.c:1422!
      	invalid opcode: 0000 [#1] SMP KASAN
      	CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
      	01/01/2011
      	RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
      	RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
      	RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
      	RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
      	Call Trace:
      	 unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
      	 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
      	 unmap_mapping_range_vma mm/memory.c:2792 [inline]
      	 unmap_mapping_range_tree mm/memory.c:2813 [inline]
      	 unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
      	 unmap_mapping_range+0x48/0x60 mm/memory.c:2880
      	 truncate_pagecache+0x54/0x90 mm/truncate.c:800
      	 truncate_setsize+0x70/0xb0 mm/truncate.c:826
      	 simple_setattr+0xe9/0x110 fs/libfs.c:409
      	 notify_change+0xf13/0x10f0 fs/attr.c:335
      	 do_truncate+0x1ac/0x2b0 fs/open.c:63
      	 do_sys_ftruncate+0x492/0x560 fs/open.c:205
      	 __do_sys_ftruncate fs/open.c:215 [inline]
      	 __se_sys_ftruncate fs/open.c:213 [inline]
      	 __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
      	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reproducer:
      
      	#include <stdio.h>
      	#include <stddef.h>
      	#include <stdint.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/types.h>
      	#include <sys/stat.h>
      	#include <sys/ioctl.h>
      	#include <sys/mman.h>
      	#include <unistd.h>
      	#include <fcntl.h>
      
      	#define KCOV_INIT_TRACE			_IOR('c', 1, unsigned long)
      	#define KCOV_ENABLE			_IO('c', 100)
      	#define KCOV_DISABLE			_IO('c', 101)
      	#define COVER_SIZE			(1024<<10)
      
      	#define KCOV_TRACE_PC  0
      	#define KCOV_TRACE_CMP 1
      
      	int main(int argc, char **argv)
      	{
      		int fd;
      		unsigned long *cover;
      
      		system("mount -t debugfs none /sys/kernel/debug");
      		fd = open("/sys/kernel/debug/kcov", O_RDWR);
      		ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
      		munmap(cover, COVER_SIZE * sizeof(unsigned long));
      		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
      				PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
      		memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
      		ftruncate(fd, 3UL << 20);
      		return 0;
      	}
      
      This can be fixed by assigning anonymous VMAs own vm_ops and not relying
      on it being NULL.
      
      If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
      dummy_vm_ops.  This way we will have non-NULL ->vm_ops for all VMAs.
      
      Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bfd40eaf
  24. 21 Jul, 2018 3 commits
    • Linus Torvalds's avatar
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds authored
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • Linus Torvalds's avatar
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds authored
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
    • Eric W. Biederman's avatar
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman authored
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a
  25. 06 Jun, 2018 1 commit
    • Mathieu Desnoyers's avatar
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers authored
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  26. 23 May, 2018 1 commit
    • Alexei Starovoitov's avatar
      umh: introduce fork_usermode_blob() helper · 449325b5
      Alexei Starovoitov authored
      Introduce helper:
      int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
      struct umh_info {
             struct file *pipe_to_umh;
             struct file *pipe_from_umh;
             pid_t pid;
      };
      
      that GPLed kernel modules (signed or unsigned) can use it to execute part
      of its own data as swappable user mode process.
      
      The kernel will do:
      - allocate a unique file in tmpfs
      - populate that file with [data, data + len] bytes
      - user-mode-helper code will do_execve that file and, before the process
        starts, the kernel will create two unix pipes for bidirectional
        communication between kernel module and umh
      - close tmpfs file, effectively deleting it
      - the fork_usermode_blob will return zero on success and populate
        'struct umh_info' with two unix pipes and the pid of the user process
      
      As the first step in the development of the bpfilter project
      the fork_usermode_blob() helper is introduced to allow user mode code
      to be invoked from a kernel module. The idea is that user mode code plus
      normal kernel module code are built as part of the kernel build
      and installed as traditional kernel module into distro specified location,
      such that from a distribution point of view, there is
      no difference between regular kernel modules and kernel modules + umh code.
      Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
      by a kernel module doesn't make it any special from kernel and user space
      tooling point of view.
      
      Such approach enables kernel to delegate functionality traditionally done
      by the kernel modules into the user space processes (either root or !root) and
      reduces security attack surface of the new code. The buggy umh code would crash
      the user process, but not the kernel. Another advantage is that umh code
      of the kernel module can be debugged and tested out of user space
      (e.g. opening the possibility to run clang sanitizers, fuzzers or
      user space test suites on the umh code).
      In case of the bpfilter project such architecture allows complex control plane
      to be done in the user space while bpf based data plane stays in the kernel.
      
      Since umh can crash, can be oom-ed by the kernel, killed by the admin,
      the kernel module that uses them (like bpfilter) needs to manage life
      time of umh on its own via two unix pipes and the pid of umh.
      
      The exit code of such kernel module should kill the umh it started,
      so that rmmod of the kernel module will cleanup the corresponding umh.
      Just like if the kernel module does kmalloc() it should kfree() it
      in the exit code.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      449325b5
  27. 11 Apr, 2018 3 commits
  28. 19 Mar, 2018 1 commit
  29. 03 Jan, 2018 1 commit