• Linus Torvalds's avatar
    Merge tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 83fa805b
    Linus Torvalds authored
    Pull thread management updates from Christian Brauner:
     "Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
      syscall.
    
      This syscall allows for the retrieval of file descriptors of a process
      based on its pidfd. A task needs to have ptrace_may_access()
      permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
      Andy) on the target.
    
      One of the main use-cases is in combination with seccomp's user
      notification feature. As a reminder, seccomp's user notification
      feature was made available in v5.0. It allows a task to retrieve a
      file descriptor for its seccomp filter. The file descriptor is usually
      handed of to a more privileged supervising process. The supervisor can
      then listen for syscall events caught by the seccomp filter of the
      supervisee and perform actions in lieu of the supervisee, usually
      emulating syscalls. pidfd_getfd() is needed to expand its uses.
    
      There are currently two major users that wait on pidfd_getfd() and one
      future user:
    
       - Netflix, Sargun said, is working on a service mesh where users
         should be able to connect to a dns-based VIP. When a user connects
         to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
         redirected to an envoy process. This service mesh uses seccomp user
         notifications and pidfd to intercept all connect calls and instead
         of connecting them to 1.2.3.4:80 connects them to e.g.
         127.0.0.1:8080.
    
       - LXD uses the seccomp notifier heavily to intercept and emulate
         mknod() and mount() syscalls for unprivileged containers/processes.
         With pidfd_getfd() more uses-cases e.g. bridging socket connections
         will be possible.
    
       - The patchset has also seen some interest from the browser corner.
         Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
         broker process. In the future glibc will start blocking all signals
         during dlopen() rendering this type of sandbox impossible. Hence,
         in the future Firefox will switch to a seccomp-user-nofication
         based sandbox which also makes use of file descriptor retrieval.
         The thread for this can be found at
         https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html
    
      With pidfd_getfd() it is e.g. possible to bridge socket connections
      for the supervisee (binding to a privileged port) and taking actions
      on file descriptors on behalf of the supervisee in general.
    
      Sargun's first version was using an ioctl on pidfds but various people
      pushed for it to be a proper syscall which he duely implemented as
      well over various review cycles. Selftests are of course included.
      I've also added instructions how to deal with merge conflicts below.
    
      There's also a small fix coming from the kernel mentee project to
      correctly annotate struct sighand_struct with __rcu to fix various
      sparse warnings. We've received a few more such fixes and even though
      they are mostly trivial I've decided to postpone them until after -rc1
      since they came in rather late and I don't want to risk introducing
      build warnings.
    
      Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
      needed to avoid allocation recursions triggerable by storage drivers
      that have userspace parts that run in the IO path (e.g. dm-multipath,
      iscsi, etc). These allocation recursions deadlock the device.
    
      The new prctl() allows such privileged userspace components to avoid
      allocation recursions by setting the PF_MEMALLOC_NOIO and
      PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
      relevant maintainers and is routed here as part of prctl()
      thread-management."
    
    * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
      prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
      sched.h: Annotate sighand_struct with __rcu
      test: Add test for pidfd getfd
      arch: wire up pidfd_getfd syscall
      pid: Implement pidfd_getfd syscall
      vfs, fdtable: Add fget_task helper
    83fa805b
syscall.tbl 22.2 KB