• Christian Brauner's avatar
    binfmt_misc: cleanup on filesystem umount · 1c5976ef
    Christian Brauner authored
    Currently, registering a new binary type pins the binfmt_misc
    filesystem. Specifically, this means that as long as there is at least
    one binary type registered the binfmt_misc filesystem survives all
    umounts, i.e. the superblock is not destroyed. Meaning that a umount
    followed by another mount will end up with the same superblock and the
    same binary type handlers. This is a behavior we tend to discourage for
    any new filesystems (apart from a few special filesystems such as e.g.
    configfs or debugfs). A umount operation without the filesystem being
    pinned - by e.g. someone holding a file descriptor to an open file -
    should usually result in the destruction of the superblock and all
    associated resources. This makes introspection easier and leads to
    clearly defined, simple and clean semantics. An administrator can rely
    on the fact that a umount will guarantee a clean slate making it
    possible to reinitialize a filesystem. Right now all binary types would
    need to be explicitly deleted before that can happen.
    
    This allows us to remove the heavy-handed calls to simple_pin_fs() and
    simple_release_fs() when creating and deleting binary types. This in
    turn allows us to replace the current brittle pinning mechanism abusing
    dget() which has caused a range of bugs judging from prior fixes in [2]
    and [3]. The additional dget() in load_misc_binary() pins the dentry but
    only does so for the sake to prevent ->evict_inode() from freeing the
    node when a user removes the binary type and kill_node() is run. Which
    would mean ->interpreter and ->interp_file would be freed causing a UAF.
    
    This isn't really nicely documented nor is it very clean because it
    relies on simple_pin_fs() pinning the filesystem as long as at least one
    binary type exists. Otherwise it would cause load_misc_binary() to hold
    on to a dentry belonging to a superblock that has been shutdown.
    Replace that implicit pinning with a clean and simple per-node refcount
    and get rid of the ugly dget() pinning. A similar mechanism exists for
    e.g. binderfs (cf. [4]). All the cleanup work can now be done in
    ->evict_inode().
    
    In a follow-up patch we will make it possible to use binfmt_misc in
    sandboxes. We will use the cleaner semantics where a umount for the
    filesystem will cause the superblock and all resources to be
    deallocated. In preparation for this apply the same semantics to the
    initial binfmt_misc mount. Note, that this is a user-visible change and
    as such a uapi change but one that we can reasonably risk. We've
    discussed this in earlier versions of this patchset (cf. [1]).
    
    The main user and provider of binfmt_misc is systemd. Systemd provides
    binfmt_misc via autofs since it is configurable as a kernel module and
    is used by a few exotic packages and users. As such a binfmt_misc mount
    is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
    provided on demand. Other autofs on demand filesystems include EFI ESP
    which systemd umounts if the mountpoint stays idle for a certain amount
    of time. This doesn't apply to the binfmt_misc autofs mount which isn't
    touched once it is mounted meaning this change can't accidently wipe
    binary type handlers without someone having explicitly unmounted
    binfmt_misc. After speaking to systemd folks they don't expect this
    change to affect them.
    
    In line with our general policy, if we see a regression for systemd or
    other users with this change we will switch back to the old behavior for
    the initial binfmt_misc mount and have binary types pin the filesystem
    again. But while we touch this code let's take the chance and let's
    improve on the status quo.
    
    [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
    [2]: commit 43a4f261 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
    [3]: commit 83f91827 ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
    [4]: commit f0fe2c0f ("binder: prevent UAF for binderfs devices II")
    
    Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1)
    Cc: Sargun Dhillon <sargun@sargun.me>
    Cc: Serge Hallyn <serge@hallyn.com>
    Cc: Jann Horn <jannh@google.com>
    Cc: Henning Schild <henning.schild@siemens.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Laurent Vivier <laurent@vivier.eu>
    Cc: linux-fsdevel@vger.kernel.org
    Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    Signed-off-by: default avatarKees Cook <keescook@chromium.org>
    ---
    /* v2 */
    - Christian Brauner <christian.brauner@ubuntu.com>:
      - Add more comments that explain what's going on.
      - Rename functions while changing them to better reflect what they are
        doing to make the code easier to understand.
      - In the first version when a specific binary type handler was removed
        either through a write to the entry's file or all binary type
        handlers were removed by a write to the binfmt_misc mount's status
        file all cleanup work happened during inode eviction.
        That includes removal of the relevant entries from entry list. While
        that works fine I disliked that model after thinking about it for a
        bit. Because it means that there was a window were someone has
        already removed a or all binary handlers but they could still be
        safely reached from load_misc_binary() when it has managed to take
        the read_lock() on the entries list while inode eviction was already
        happening. Again, that perfectly benign but it's cleaner to remove
        the binary handler from the list immediately meaning that ones the
        write to then entry's file or the binfmt_misc status file returns
        the binary type cannot be executed anymore. That gives stronger
        guarantees to the user.
    1c5976ef
binfmt_misc.c 21.6 KB