• Aleksa Sarai's avatar
    memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy · 9876cfe8
    Aleksa Sarai authored
    This sysctl has the very unusual behaviour of not allowing any user (even
    CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were
    to set this sysctl to a more restrictive option in the host pidns you
    would need to reboot your machine in order to reset it.
    
    The justification given in [1] is that this is a security feature and thus
    it should not be possible to disable.  Aside from the fact that we have
    plenty of security-related sysctls that can be disabled after being
    enabled (fs.protected_symlinks for instance), the protection provided by
    the sysctl is to stop users from being able to create a binary and then
    execute it.  A user with CAP_SYS_ADMIN can trivially do this without
    memfd_create(2):
    
      % cat mount-memfd.c
      #include <fcntl.h>
      #include <string.h>
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <linux/mount.h>
    
      #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
    
      int main(void)
      {
      	int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
      	assert(fsfd >= 0);
      	assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
    
      	int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
      	assert(dfd >= 0);
    
      	int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
      	assert(execfd >= 0);
      	assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
      	assert(!close(execfd));
    
      	char *execpath = NULL;
      	char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
      	execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
      	assert(execfd >= 0);
      	assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
      	assert(!execve(execpath, argv, envp));
      }
      % ./mount-memfd
      this file was executed from this totally private tmpfs: /proc/self/fd/5
      %
    
    Given that it is possible for CAP_SYS_ADMIN users to create executable
    binaries without memfd_create(2) and without touching the host filesystem
    (not to mention the many other things a CAP_SYS_ADMIN process would be
    able to do that would be equivalent or worse), it seems strange to cause a
    fair amount of headache to admins when there doesn't appear to be an
    actual security benefit to blocking this.  There appear to be concerns
    about confused-deputy-esque attacks[2] but a confused deputy that can
    write to arbitrary sysctls is a bigger security issue than executable
    memfds.
    
    /* New API */
    
    The primary requirement from the original author appears to be more based
    on the need to be able to restrict an entire system in a hierarchical
    manner[3], such that child namespaces cannot re-enable executable memfds.
    
    So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
    evaluated up the pidns tree to &init_pid_ns and you have the most
    restrictive value applied to you.  The new lower limit you can set
    vm.memfd_noexec is whatever limit applies to your parent.
    
    Note that a pidns will inherit a copy of the parent pidns's effective
    vm.memfd_noexec setting at unshare() time.  This matches the existing
    behaviour, and it also ensures that a pidns will never have its
    vm.memfd_noexec setting *lowered* behind its back (but it will be raised
    if the parent raises theirs).
    
    /* Backwards Compatibility */
    
    As the previous version of the sysctl didn't allow you to lower the
    setting at all, there are no backwards compatibility issues with this
    aspect of the change.
    
    However it should be noted that now that the setting is completely
    hierarchical.  Previously, a cloned pidns would just copy the current
    pidns setting, meaning that if the parent's vm.memfd_noexec was changed it
    wouldn't propoagate to existing pid namespaces.  Now, the restriction
    applies recursively.  This is a uAPI change, however:
    
     * The sysctl is very new, having been merged in 6.3.
     * Several aspects of the sysctl were broken up until this patchset and
       the other patchset by Jeff Xu last month.
    
    And thus it seems incredibly unlikely that any real users would run into
    this issue. In the worst case, if this causes userspace isues we could
    make it so that modifying the setting follows the hierarchical rules but
    the restriction checking uses the cached copy.
    
    [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
    [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
    [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/
    
    Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com
    Fixes: 105ff533 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
    Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
    Cc: Dominique Martinet <asmadeus@codewreck.org>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Daniel Verkamp <dverkamp@chromium.org>
    Cc: Jeff Xu <jeffxu@google.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    9876cfe8
pid_sysctl.h 1.46 KB