1. 14 Jul, 2020 2 commits
    • Sargun Dhillon's avatar
      selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD · c97aedc5
      Sargun Dhillon authored
      Test whether we can add file descriptors in response to notifications.
      This injects the file descriptors via notifications, and then uses kcmp
      to determine whether or not it has been successful.
      
      It also includes some basic sanity checking for arguments.
      Signed-off-by: default avatarSargun Dhillon <sargun@sargun.me>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Palmer <palmer@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Robert Sesek <rsesek@google.com>
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Matt Denton <mpdenton@google.com>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Link: https://lore.kernel.org/r/20200603011044.7972-5-sargun@sargun.meCo-developed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      c97aedc5
    • Sargun Dhillon's avatar
      seccomp: Introduce addfd ioctl to seccomp user notifier · 7cf97b12
      Sargun Dhillon authored
      The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over
      an fd. It is often used in settings where a supervising task emulates
      syscalls on behalf of a supervised task in userspace, either to further
      restrict the supervisee's syscall abilities or to circumvent kernel
      enforced restrictions the supervisor deems safe to lift (e.g. actually
      performing a mount(2) for an unprivileged container).
      
      While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall,
      only a certain subset of syscalls could be correctly emulated. Over the
      last few development cycles, the set of syscalls which can't be emulated
      has been reduced due to the addition of pidfd_getfd(2). With this we are
      now able to, for example, intercept syscalls that require the supervisor
      to operate on file descriptors of the supervisee such as connect(2).
      
      However, syscalls that cause new file descriptors to be installed can not
      currently be correctly emulated since there is no way for the supervisor
      to inject file descriptors into the supervisee. This patch adds a
      new addfd ioctl to remove this restriction by allowing the supervisor to
      install file descriptors into the intercepted task. By implementing this
      feature via seccomp the supervisor effectively instructs the supervisee
      to install a set of file descriptors into its own file descriptor table
      during the intercepted syscall. This way it is possible to intercept
      syscalls such as open() or accept(), and install (or replace, like
      dup2(2)) the supervisor's resulting fd into the supervisee. One
      replacement use-case would be to redirect the stdout and stderr of a
      supervisee into log file descriptors opened by the supervisor.
      
      The ioctl handling is based on the discussions[1] of how Extensible
      Arguments should interact with ioctls. Instead of building size into
      the addfd structure, make it a function of the ioctl command (which
      is how sizes are normally passed to ioctls). To support forward and
      backward compatibility, just mask out the direction and size, and match
      everything. The size (and any future direction) checks are done along
      with copy_struct_from_user() logic.
      
      As a note, the seccomp_notif_addfd structure is laid out based on 8-byte
      alignment without requiring packing as there have been packing issues
      with uapi highlighted before[2][3]. Although we could overload the
      newfd field and use -1 to indicate that it is not to be used, doing
      so requires changing the size of the fd field, and introduces struct
      packing complexity.
      
      [1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/
      [2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/
      [3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Jann Horn <jannh@google.com>
      Cc: Robert Sesek <rsesek@google.com>
      Cc: Chris Palmer <palmer@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Suggested-by: default avatarMatt Denton <mpdenton@google.com>
      Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.meSigned-off-by: default avatarSargun Dhillon <sargun@sargun.me>
      Reviewed-by: default avatarWill Drewry <wad@chromium.org>
      Co-developed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      7cf97b12
  2. 13 Jul, 2020 7 commits
    • Kees Cook's avatar
      fs: Expand __receive_fd() to accept existing fd · 17381715
      Kees Cook authored
      Expand __receive_fd() with support for replace_fd() for the coming seccomp
      "addfd" ioctl(). Add new wrapper receive_fd_replace() for the new behavior
      and update existing wrappers to retain old behavior.
      
      Thanks to Colin Ian King <colin.king@canonical.com> for pointing out an
      uninitialized variable exposure in an earlier version of this patch.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Kadashev <dkadashev@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSargun Dhillon <sargun@sargun.me>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      17381715
    • Kees Cook's avatar
      pidfd: Replace open-coded receive_fd() · 910d2f16
      Kees Cook authored
      Replace the open-coded version of receive_fd() with a call to the
      new helper.
      
      Thanks to Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> for
      catching a missed fput() in an earlier version of this patch.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSargun Dhillon <sargun@sargun.me>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      910d2f16
    • Kees Cook's avatar
      fs: Add receive_fd() wrapper for __receive_fd() · deefa7f3
      Kees Cook authored
      For both pidfd and seccomp, the __user pointer is not used. Update
      __receive_fd() to make writing to ufd optional via a NULL check. However,
      for the receive_fd_user() wrapper, ufd is NULL checked so an -EFAULT
      can be returned to avoid changing the SCM_RIGHTS interface behavior. Add
      new wrapper receive_fd() for pidfd and seccomp that does not use the ufd
      argument. For the new helper, the allocated fd needs to be returned on
      success. Update the existing callers to handle it.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarSargun Dhillon <sargun@sargun.me>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      deefa7f3
    • Kees Cook's avatar
      fs: Move __scm_install_fd() to __receive_fd() · 66590610
      Kees Cook authored
      In preparation for users of the "install a received file" logic outside
      of net/ (pidfd and seccomp), relocate and rename __scm_install_fd() from
      net/core/scm.c to __receive_fd() in fs/file.c, and provide a wrapper
      named receive_fd_user(), as future patches will change the interface
      to __receive_fd().
      
      Additionally add a comment to fd_install() as a counterpoint to how
      __receive_fd() interacts with fput().
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Dmitry Kadashev <dkadashev@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Ido Schimmel <idosch@idosch.org>
      Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Reviewed-by: default avatarSargun Dhillon <sargun@sargun.me>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      66590610
    • Kees Cook's avatar
      net/scm: Regularize compat handling of scm_detach_fds() · c0029de5
      Kees Cook authored
      Duplicate the cleanups from commit 2618d530 ("net/scm: cleanup
      scm_detach_fds") into the compat code.
      
      Replace open-coded __receive_sock() with a call to the helper.
      
      Move the check added in commit 1f466e1f ("net: cleanly handle kernel
      vs user buffers for ->msg_control") to before the compat call, even
      though it should be impossible for an in-kernel call to also be compat.
      
      Correct the int "flags" argument to unsigned int to match fd_install()
      and similar APIs.
      
      Regularize any remaining differences, including a whitespace issue,
      a checkpatch warning, and add the check from commit 6900317f ("net,
      scm: fix PaX detected msg_controllen overflow in scm_detach_fds") which
      fixed an overflow unique to 64-bit. To avoid confusion when comparing
      the compat handler to the native handler, just include the same check
      in the compat handler.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      c0029de5
    • Kees Cook's avatar
      pidfd: Add missing sock updates for pidfd_getfd() · 4969f8a0
      Kees Cook authored
      The sock counting (sock_update_netprioidx() and sock_update_classid())
      was missing from pidfd's implementation of received fd installation. Add
      a call to the new __receive_sock() helper.
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 8649c322 ("pid: Implement pidfd_getfd syscall")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      4969f8a0
    • Kees Cook's avatar
      net/compat: Add missing sock updates for SCM_RIGHTS · d9539752
      Kees Cook authored
      Add missed sock updates to compat path via a new helper, which will be
      used more in coming patches. (The net/core/scm.c code is left as-is here
      to assist with -stable backports for the compat path.)
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 48a87cc2 ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly")
      Fixes: d8429506 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly")
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      d9539752
  3. 10 Jul, 2020 21 commits
    • Kees Cook's avatar
      selftests/seccomp: Check ENOSYS under tracing · 11eb004e
      Kees Cook authored
      There should be no difference between -1 and other negative syscalls
      while tracing.
      
      Cc: Keno Fischer <keno@juliacomputing.com>
      Tested-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      11eb004e
    • Kees Cook's avatar
      selftests/seccomp: Refactor to use fixture variants · adeeec84
      Kees Cook authored
      Now that the selftest harness has variants, use them to eliminate a
      bunch of copy/paste duplication.
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      adeeec84
    • Kees Cook's avatar
      selftests/harness: Clean up kern-doc for fixtures · 9d1587ad
      Kees Cook authored
      The FIXTURE*() macro kern-doc examples had the wrong names for the C code
      examples associated with them. Fix those and clarify that FIXTURE_DATA()
      usage should be avoided.
      
      Cc: Shuah Khan <shuah@kernel.org>
      Fixes: 74bc7c97 ("kselftest: add fixture variants")
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      9d1587ad
    • Kees Cook's avatar
      seccomp: Use -1 marker for end of mode 1 syscall list · fe4bfff8
      Kees Cook authored
      The terminator for the mode 1 syscalls list was a 0, but that could be
      a valid syscall number (e.g. x86_64 __NR_read). By luck, __NR_read was
      listed first and the loop construct would not test it, so there was no
      bug. However, this is fragile. Replace the terminator with -1 instead,
      and make the variable name for mode 1 syscall lists more descriptive.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      fe4bfff8
    • Kees Cook's avatar
      seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID · 47e33c05
      Kees Cook authored
      When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong
      direction flag set. While this isn't a big deal as nothing currently
      enforces these bits in the kernel, it should be defined correctly. Fix
      the define and provide support for the old command until it is no longer
      needed for backward compatibility.
      
      Fixes: 6a21cc50 ("seccomp: add a return code to trap to userspace")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      47e33c05
    • Kees Cook's avatar
      selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall() · 279ed890
      Kees Cook authored
      The user_trap_syscall() helper creates a filter with
      SECCOMP_RET_USER_NOTIF. To avoid confusion with SECCOMP_RET_TRAP, rename
      the helper to user_notif_syscall().
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@chromium.org>
      Cc: linux-kselftest@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: bpf@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      279ed890
    • Kees Cook's avatar
      selftests/seccomp: Make kcmp() less required · cf8918db
      Kees Cook authored
      The seccomp tests are a bit noisy without CONFIG_CHECKPOINT_RESTORE (due
      to missing the kcmp() syscall). The seccomp tests are more accurate with
      kcmp(), but it's not strictly required. Refactor the tests to use
      alternatives (comparing fd numbers), and provide a central test for
      kcmp() so there is a single SKIP instead of many. Continue to produce
      warnings for the other tests, though.
      
      Additionally adds some more bad flag EINVAL tests to the addfd selftest.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@chromium.org>
      Cc: linux-kselftest@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: bpf@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      cf8918db
    • Kees Cook's avatar
      seccomp: Use pr_fmt · e68f9d49
      Kees Cook authored
      Avoid open-coding "seccomp: " prefixes for pr_*() calls.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      e68f9d49
    • Kees Cook's avatar
      selftests/seccomp: Improve calibration loop · 81a0c8bc
      Kees Cook authored
      The seccomp benchmark calibration loop did not need to take so long.
      Instead, use a simple 1 second timeout and multiply up to target. It
      does not need to be accurate.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      81a0c8bc
    • Thadeu Lima de Souza Cascardo's avatar
      selftests/seccomp: use 90s as timeout · bc32c9c8
      Thadeu Lima de Souza Cascardo authored
      As seccomp_benchmark tries to calibrate how many samples will take more
      than 5 seconds to execute, it may end up picking up a number of samples
      that take 10 (but up to 12) seconds. As the calibration will take double
      that time, it takes around 20 seconds. Then, it executes the whole thing
      again, and then once more, with some added overhead. So, the thing might
      take more than 40 seconds, which is too close to the 45s timeout.
      
      That is very dependent on the system where it's executed, so may not be
      observed always, but it has been observed on x86 VMs. Using a 90s timeout
      seems safe enough.
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Link: https://lore.kernel.org/r/20200601123202.1183526-1-cascardo@canonical.comSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      bc32c9c8
    • Kees Cook's avatar
      selftests/seccomp: Expand benchmark to per-filter measurements · d3a37ea9
      Kees Cook authored
      It's useful to see how much (at a minimum) each filter adds to the
      syscall overhead. Add additional calculations.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      d3a37ea9
    • Christian Brauner's avatar
      selftests/seccomp: Check for EPOLLHUP for user_notif · ad568218
      Christian Brauner authored
      This verifies we're correctly notified when a seccomp filter becomes
      unused when a notifier is in use.
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20200531115031.391515-4-christian.brauner@ubuntu.comSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      ad568218
    • Christian Brauner's avatar
      seccomp: notify about unused filter · 99cdb8b9
      Christian Brauner authored
      We've been making heavy use of the seccomp notifier to intercept and
      handle certain syscalls for containers. This patch allows a syscall
      supervisor listening on a given notifier to be notified when a seccomp
      filter has become unused.
      
      A container is often managed by a singleton supervisor process the
      so-called "monitor". This monitor process has an event loop which has
      various event handlers registered. If the user specified a seccomp
      profile that included a notifier for various syscalls then we also
      register a seccomp notify even handler. For any container using a
      separate pid namespace the lifecycle of the seccomp notifier is bound to
      the init process of the pid namespace, i.e. when the init process exits
      the filter must be unused.
      
      If a new process attaches to a container we force it to assume a seccomp
      profile. This can either be the same seccomp profile as the container
      was started with or a modified one. If the attaching process makes use
      of the seccomp notifier we will register a new seccomp notifier handler
      in the monitor's event loop. However, when the attaching process exits
      we can't simply delete the handler since other child processes could've
      been created (daemons spawned etc.) that have inherited the seccomp
      filter and so we need to keep the seccomp notifier fd alive in the event
      loop. But this is problematic since we don't get a notification when the
      seccomp filter has become unused and so we currently never remove the
      seccomp notifier fd from the event loop and just keep accumulating fds
      in the event loop. We've had this issue for a while but it has recently
      become more pressing as more and larger users make use of this.
      
      To fix this, we introduce a new "users" reference counter that tracks any
      tasks and dependent filters making use of a filter. When a notifier is
      registered waiting tasks will be notified that the filter is now empty
      by receiving a (E)POLLHUP event.
      
      The concept in this patch introduces is the same as for signal_struct,
      i.e. reference counting for life-cycle management is decoupled from
      reference counting taks using the object. There's probably some trickery
      possible but the second counter is just the correct way of doing this
      IMHO and has precedence.
      
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matt Denton <mpdenton@google.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jann Horn <jannh@google.com>
      Cc: Chris Palmer <palmer@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Robert Sesek <rsesek@google.com>
      Cc: Jeffrey Vander Stoep <jeffv@google.com>
      Cc: Linux Containers <containers@lists.linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20200531115031.391515-3-christian.brauner@ubuntu.comSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      99cdb8b9
    • Christian Brauner's avatar
      seccomp: Lift wait_queue into struct seccomp_filter · 76194c4e
      Christian Brauner authored
      Lift the wait_queue from struct notification into struct seccomp_filter.
      This is cleaner overall and lets us avoid having to take the notifier
      mutex in the future for EPOLLHUP notifications since we need to neither
      read nor modify the notifier specific aspects of the seccomp filter. In
      the exit path I'd very much like to avoid having to take the notifier mutex
      for each filter in the task's filter hierarchy.
      
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matt Denton <mpdenton@google.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jann Horn <jannh@google.com>
      Cc: Chris Palmer <palmer@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Robert Sesek <rsesek@google.com>
      Cc: Jeffrey Vander Stoep <jeffv@google.com>
      Cc: Linux Containers <containers@lists.linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      76194c4e
    • Christian Brauner's avatar
      seccomp: release filter after task is fully dead · 3a15fb6e
      Christian Brauner authored
      The seccomp filter used to be released in free_task() which is called
      asynchronously via call_rcu() and assorted mechanisms. Since we need
      to inform tasks waiting on the seccomp notifier when a filter goes empty
      we will notify them as soon as a task has been marked fully dead in
      release_task(). To not split seccomp cleanup into two parts, move
      filter release out of free_task() and into release_task() after we've
      unhashed struct task from struct pid, exited signals, and unlinked it
      from the threadgroups' thread list. We'll put the empty filter
      notification infrastructure into it in a follow up patch.
      
      This also renames put_seccomp_filter() to seccomp_filter_release() which
      is a more descriptive name of what we're doing here especially once
      we've added the empty filter notification mechanism in there.
      
      We're also NULL-ing the task's filter tree entrypoint which seems
      cleaner than leaving a dangling pointer in there. Note that this shouldn't
      need any memory barriers since we're calling this when the task is in
      release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
      filters anymore. You can also see this from the point where we're calling
      seccomp_filter_release(). It's after __exit_signal() and at this point,
      tsk->sighand will already have been NULLed which is required for
      thread-sync and filter installation alike.
      
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matt Denton <mpdenton@google.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jann Horn <jannh@google.com>
      Cc: Chris Palmer <palmer@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Robert Sesek <rsesek@google.com>
      Cc: Jeffrey Vander Stoep <jeffv@google.com>
      Cc: Linux Containers <containers@lists.linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.comSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      3a15fb6e
    • Christian Brauner's avatar
      seccomp: rename "usage" to "refs" and document · b707ddee
      Christian Brauner authored
      Naming the lifetime counter of a seccomp filter "usage" suggests a
      little too strongly that its about tasks that are using this filter
      while it also tracks other references such as the user notifier or
      ptrace. This also updates the documentation to note this fact.
      
      We'll be introducing an actual usage counter in a follow-up patch.
      
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matt Denton <mpdenton@google.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Jann Horn <jannh@google.com>
      Cc: Chris Palmer <palmer@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Robert Sesek <rsesek@google.com>
      Cc: Jeffrey Vander Stoep <jeffv@google.com>
      Cc: Linux Containers <containers@lists.linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20200531115031.391515-1-christian.brauner@ubuntu.comSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      b707ddee
    • Sargun Dhillon's avatar
      seccomp: Add find_notification helper · 9f87dcf1
      Sargun Dhillon authored
      This adds a helper which can iterate through a seccomp_filter to
      find a notification matching an ID. It removes several replicated
      chunks of code.
      Signed-off-by: default avatarSargun Dhillon <sargun@sargun.me>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarTycho Andersen <tycho@tycho.ws>
      Cc: Matt Denton <mpdenton@google.com>
      Cc: Kees Cook <keescook@google.com>,
      Cc: Jann Horn <jannh@google.com>,
      Cc: Robert Sesek <rsesek@google.com>,
      Cc: Chris Palmer <palmer@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Tycho Andersen <tycho@tycho.ws>
      Link: https://lore.kernel.org/r/20200601112532.150158-1-sargun@sargun.meSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      9f87dcf1
    • Kees Cook's avatar
      seccomp: Report number of loaded filters in /proc/$pid/status · c818c03b
      Kees Cook authored
      A common question asked when debugging seccomp filters is "how many
      filters are attached to your process?" Provide a way to easily answer
      this question through /proc/$pid/status with a "Seccomp_filters" line.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      c818c03b
    • Kees Cook's avatar
      selftests/seccomp: Set NNP for TSYNC ESRCH flag test · e4d05028
      Kees Cook authored
      The TSYNC ESRCH flag test will fail for regular users because NNP was
      not set yet. Add NNP setting.
      
      Fixes: 51891498 ("seccomp: allow TSYNC and USER_NOTIF together")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarTycho Andersen <tycho@tycho.ws>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      e4d05028
    • Kees Cook's avatar
      selftests/seccomp: Add SKIPs for failed unshare() · d7d2e5bb
      Kees Cook authored
      Running the seccomp tests as a regular user shouldn't just fail tests
      that require CAP_SYS_ADMIN (for getting a PID namespace). Instead,
      detect those cases and SKIP them. Additionally, gracefully SKIP missing
      CONFIG_USER_NS (and add to "config" since we'd prefer to actually test
      this case).
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      d7d2e5bb
    • Kees Cook's avatar
      selftests/seccomp: Rename XFAIL to SKIP · 8b1bc88c
      Kees Cook authored
      The kselftests will be renaming XFAIL to SKIP in the test harness, and
      to avoid painful conflicts, rename XFAIL to SKIP now in a future-proofed
      way.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      8b1bc88c
  4. 14 Jun, 2020 4 commits
    • Linus Torvalds's avatar
      Linux 5.8-rc1 · b3a9e3b9
      Linus Torvalds authored
      b3a9e3b9
    • Linus Torvalds's avatar
      Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux · 4a87b197
      Linus Torvalds authored
      Pull SafeSetID update from Micah Morton:
       "Add additional LSM hooks for SafeSetID
      
        SafeSetID is capable of making allow/deny decisions for set*uid calls
        on a system, and we want to add similar functionality for set*gid
        calls.
      
        The work to do that is not yet complete, so probably won't make it in
        for v5.8, but we are looking to get this simple patch in for v5.8
        since we have it ready.
      
        We are planning on the rest of the work for extending the SafeSetID
        LSM being merged during the v5.9 merge window"
      
      * tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
        security: Add LSM hooks to set*gid syscalls
      4a87b197
    • Thomas Cedeno's avatar
      security: Add LSM hooks to set*gid syscalls · 39030e13
      Thomas Cedeno authored
      The SafeSetID LSM uses the security_task_fix_setuid hook to filter
      set*uid() syscalls according to its configured security policy. In
      preparation for adding analagous support in the LSM for set*gid()
      syscalls, we add the requisite hook here. Tested by putting print
      statements in the security_task_fix_setgid hook and seeing them get hit
      during kernel boot.
      Signed-off-by: default avatarThomas Cedeno <thomascedeno@google.com>
      Signed-off-by: default avatarMicah Morton <mortonm@chromium.org>
      39030e13
    • Linus Torvalds's avatar
      Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 9d645db8
      Linus Torvalds authored
      Pull btrfs updates from David Sterba:
       "This reverts the direct io port to iomap infrastructure of btrfs
        merged in the first pull request. We found problems in invalidate page
        that don't seem to be fixable as regressions or without changing iomap
        code that would not affect other filesystems.
      
        There are four reverts in total, but three of them are followup
        cleanups needed to revert a43a67a2 cleanly. The result is the
        buffer head based implementation of direct io.
      
        Reverts are not great, but under current circumstances I don't see
        better options"
      
      * tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Revert "btrfs: switch to iomap_dio_rw() for dio"
        Revert "fs: remove dio_end_io()"
        Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
        Revert "btrfs: split btrfs_direct_IO to read and write part"
      9d645db8
  5. 13 Jun, 2020 6 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 96144c58
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix cfg80211 deadlock, from Johannes Berg.
      
       2) RXRPC fails to send norigications, from David Howells.
      
       3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
          Geliang Tang.
      
       4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.
      
       5) The ucc_geth driver needs __netdev_watchdog_up exported, from
          Valentin Longchamp.
      
       6) Fix hashtable memory leak in dccp, from Wang Hai.
      
       7) Fix how nexthops are marked as FDB nexthops, from David Ahern.
      
       8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.
      
       9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.
      
      10) Fix link speed reporting in iavf driver, from Brett Creeley.
      
      11) When a channel is used for XSK and then reused again later for XSK,
          we forget to clear out the relevant data structures in mlx5 which
          causes all kinds of problems. Fix from Maxim Mikityanskiy.
      
      12) Fix memory leak in genetlink, from Cong Wang.
      
      13) Disallow sockmap attachments to UDP sockets, it simply won't work.
          From Lorenz Bauer.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
        net: ethernet: ti: ale: fix allmulti for nu type ale
        net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
        net: atm: Remove the error message according to the atomic context
        bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
        libbpf: Support pre-initializing .bss global variables
        tools/bpftool: Fix skeleton codegen
        bpf: Fix memlock accounting for sock_hash
        bpf: sockmap: Don't attach programs to UDP sockets
        bpf: tcp: Recv() should return 0 when the peer socket is closed
        ibmvnic: Flush existing work items before device removal
        genetlink: clean up family attributes allocations
        net: ipa: header pad field only valid for AP->modem endpoint
        net: ipa: program upper nibbles of sequencer type
        net: ipa: fix modem LAN RX endpoint id
        net: ipa: program metadata mask differently
        ionic: add pcie_print_link_status
        rxrpc: Fix race between incoming ACK parser and retransmitter
        net/mlx5: E-Switch, Fix some error pointer dereferences
        net/mlx5: Don't fail driver on failure to create debugfs
        net/mlx5e: CT: Fix ipv6 nat header rewrite actions
        ...
      96144c58
    • David Sterba's avatar
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba authored
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55e20bd1
    • Grygorii Strashko's avatar
      net: ethernet: ti: ale: fix allmulti for nu type ale · bc139119
      Grygorii Strashko authored
      On AM65xx MCU CPSW2G NUSS and 66AK2E/L NUSS allmulti setting does not allow
      unregistered mcast packets to pass.
      
      This happens, because ALE VLAN entries on these SoCs do not contain port
      masks for reg/unreg mcast packets, but instead store indexes of
      ALE_VLAN_MASK_MUXx_REG registers which intended for store port masks for
      reg/unreg mcast packets.
      This path was missed by commit 9d1f6447 ("net: ethernet: ti: ale: fix
      seeing unreg mcast packets with promisc and allmulti disabled").
      
      Hence, fix it by taking into account ALE type in cpsw_ale_set_allmulti().
      
      Fixes: 9d1f6447 ("net: ethernet: ti: ale: fix seeing unreg mcast packets with promisc and allmulti disabled")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc139119
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init · 2074f9ea
      Grygorii Strashko authored
      The ALE parameters structure is created on stack, so it has to be reset
      before passing to cpsw_ale_create() to avoid garbage values.
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2074f9ea
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · fa7566a0
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2020-06-12
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 26 non-merge commits during the last 10 day(s) which contain
      a total of 27 files changed, 348 insertions(+), 93 deletions(-).
      
      The main changes are:
      
      1) sock_hash accounting fix, from Andrey.
      
      2) libbpf fix and probe_mem sanitizing, from Andrii.
      
      3) sock_hash fixes, from Jakub.
      
      4) devmap_val fix, from Jesper.
      
      5) load_bytes_relative fix, from YiFei.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa7566a0
    • Liao Pingfang's avatar
      net: atm: Remove the error message according to the atomic context · bf97bac9
      Liao Pingfang authored
      Looking into the context (atomic!) and the error message should be dropped.
      Signed-off-by: default avatarLiao Pingfang <liao.pingfang@zte.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf97bac9