• Adrian Reber's avatar
    capabilities: Introduce CAP_CHECKPOINT_RESTORE · 124ea650
    Adrian Reber authored
    This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
    checkpoint/restore for non-root users.
    
    Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has
    been asked numerous times if it is possible to checkpoint/restore a
    process as non-root. The answer usually was: 'almost'.
    
    The main blocker to restore a process as non-root was to control the PID
    of the restored process. This feature available via the clone3 system
    call, or via /proc/sys/kernel/ns_last_pid is unfortunately guarded by
    CAP_SYS_ADMIN.
    
    In the past two years, requests for non-root checkpoint/restore have
    increased due to the following use cases:
    * Checkpoint/Restore in an HPC environment in combination with a
      resource manager distributing jobs where users are always running as
      non-root. There is a desire to provide a way to checkpoint and
      restore long running jobs.
    * Container migration as non-root
    * We have been in contact with JVM developers who are integrating
      CRIU into a Java VM to decrease the startup time. These
      checkpoint/restore applications are not meant to be running with
      CAP_SYS_ADMIN.
    
    We have seen the following workarounds:
    * Use a setuid wrapper around CRIU:
      See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
    * Use a setuid helper that writes to ns_last_pid.
      Unfortunately, this helper delegation technique is impossible to use
      with clone3, and is thus prone to races.
      See https://github.com/twosigma/set_ns_last_pid
    * Cycle through PIDs with fork() until the desired PID is reached:
      This has been demonstrated to work with cycling rates of 100,000 PIDs/s
      See https://github.com/twosigma/set_ns_last_pid
    * Patch out the CAP_SYS_ADMIN check from the kernel
    * Run the desired application in a new user and PID namespace to provide
      a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited
      use in typical container environments (e.g., Kubernetes) as /proc is
      typically protected with read-only layers (e.g., /proc/sys) for
      hardening purposes. Read-only layers prevent additional /proc mounts
      (due to proc's SB_I_USERNS_VISIBLE property), making the use of new
      PID namespaces limited as certain applications need access to /proc
      matching their PID namespace.
    
    The introduced capability allows to:
    * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
      for the corresponding PID namespace via ns_last_pid/clone3.
    * Open files in /proc/pid/map_files when the current user is
      CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for
      recovering files that are unreachable via the file system such as
      deleted files, or memfd files.
    
    See corresponding selftest for an example with clone3().
    Signed-off-by: default avatarAdrian Reber <areber@redhat.com>
    Signed-off-by: default avatarNicolas Viennot <Nicolas.Viennot@twosigma.com>
    Reviewed-by: default avatarSerge Hallyn <serge@hallyn.com>
    Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Link: https://lore.kernel.org/r/20200719100418.2112740-2-areber@redhat.comSigned-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    124ea650
capability.h 8.13 KB