• Christian Brauner's avatar
    Revert "vfs: Allow userns root to call mknod on owned filesystems." · 94f82008
    Christian Brauner authored
    This reverts commit 55956b59.
    
    commit 55956b59 ("vfs: Allow userns root to call mknod on owned filesystems.")
    enabled mknod() in user namespaces for userns root if CAP_MKNOD is
    available. However, these device nodes are useless since any filesystem
    mounted from a non-initial user namespace will set the SB_I_NODEV flag on
    the filesystem. Now, when a device node s created in a non-initial user
    namespace a call to open() on said device node will fail due to:
    
    bool may_open_dev(const struct path *path)
    {
            return !(path->mnt->mnt_flags & MNT_NODEV) &&
                    !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
    }
    
    The problem with this is that as of the aforementioned commit mknod()
    creates partially functional device nodes in non-initial user namespaces.
    In particular, it has the consequence that as of the aforementioned commit
    open() will be more privileged with respect to device nodes than mknod().
    Before it was the other way around. Specifically, if mknod() succeeded
    then it was transparent for any userspace application that a fatal error
    must have occured when open() failed.
    
    All of this breaks multiple userspace workloads and a widespread assumption
    about how to handle mknod(). Basically, all container runtimes and systemd
    live by the slogan "ask for forgiveness not permission" when running user
    namespace workloads. For mknod() the assumption is that if the syscall
    succeeds the device nodes are useable irrespective of whether it succeeds
    in a non-initial user namespace or not. This logic was chosen explicitly
    to allow for the glorious day when mknod() will actually be able to create
    fully functional device nodes in user namespaces.
    A specific problem people are already running into when running 4.18 rc
    kernels are failing systemd services. For any distro that is run in a
    container systemd services started with the PrivateDevices= property set
    will fail to start since the device nodes in question cannot be
    opened (cf. the arguments in [1]).
    
    Full disclosure, Seth made the very sound argument that it is already
    possible to end up with partially functional device nodes. Any filesystem
    mounted with MS_NODEV set will allow mknod() to succeed but will not allow
    open() to succeed. The difference to the case here is that the MS_NODEV
    case is transparent to userspace since it is an explicitly set mount option
    while the SB_I_NODEV case is an implicit property enforced by the kernel
    and hence opaque to userspace.
    
    [1]: https://github.com/systemd/systemd/pull/9483Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Seth Forshee <seth.forshee@canonical.com>
    Cc: Serge Hallyn <serge@hallyn.com>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    94f82008
namei.c 123 KB