1. 23 Jul, 2016 1 commit
    • Eric W. Biederman's avatar
      fs: Call d_automount with the filesystems creds · aeaa4a79
      Eric W. Biederman authored
      Seth Forshee reported a mount regression in nfs autmounts
      with "fs: Add user namespace member to struct super_block".
      
      It turns out that the assumption that current->cred is something
      reasonable during mount while necessary to improve support of
      unprivileged mounts is wrong in the automount path.
      
      To fix the existing filesystems override current->cred with the
      init_cred before calling d_automount and restore current->cred after
      d_automount completes.
      
      To support unprivileged mounts would require a more nuanced cred
      selection, so fail on unprivileged mounts for the time being.  As none
      of the filesystems that currently set FS_USERNS_MOUNT implement
      d_automount this check is only good for preventing future problems.
      
      Fixes: 6e4eab57 ("fs: Add user namespace member to struct super_block")
      Tested-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      aeaa4a79
  2. 05 Jul, 2016 7 commits
    • Seth Forshee's avatar
      fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns · 81754357
      Seth Forshee authored
      For filesystems mounted from a user namespace on-disk ids should
      be translated relative to s_users_ns rather than init_user_ns.
      
      When an id in the filesystem doesn't exist in s_user_ns the
      associated id in the inode will be set to INVALID_[UG]ID, which
      turns these into de facto "nobody" ids. This actually maps pretty
      well into the way most code already works, and those places where
      it didn't were fixed in previous patches. Moving forward vfs code
      needs to be careful to handle instances where ids in inodes may
      be invalid.
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      81754357
    • Seth Forshee's avatar
      evm: Translate user/group ids relative to s_user_ns when computing HMAC · 0b3c9761
      Seth Forshee authored
      The EVM HMAC should be calculated using the on disk user and
      group ids, so the k[ug]ids in the inode must be translated
      relative to the s_user_ns of the inode's super block.
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      0b3c9761
    • Eric W. Biederman's avatar
      dquot: For now explicitly don't support filesystems outside of init_user_ns · 5c004828
      Eric W. Biederman authored
      Mostly supporting filesystems outside of init_user_ns is
      s/&init_usre_ns/dquot->dq_sb->s_user_ns/.  An actual need for
      supporting quotas on filesystems outside of s_user_ns is quite a ways
      away and to be done responsibily needs an audit on what can happen
      with hostile quota files.  Until that audit is complete don't attempt
      to support quota files on filesystems outside of s_user_ns.
      
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      5c004828
    • Eric W. Biederman's avatar
      quota: Handle quota data stored in s_user_ns in quota_setxquota · cfd4c70a
      Eric W. Biederman authored
      In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with
      the filesystems notion of id 0.
      
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Inspired-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      cfd4c70a
    • Eric W. Biederman's avatar
      quota: Ensure qids map to the filesystem · d49d3762
      Eric W. Biederman authored
      Introduce the helper qid_has_mapping and use it to ensure that the
      quota system only considers qids that map to the filesystems
      s_user_ns.
      
      In practice for quota supporting filesystems today this is the exact
      same check as qid_valid.  As only 0xffffffff aka (qid_t)-1 does not
      map into init_user_ns.
      
      Replace the qid_valid calls with qid_has_mapping as values come in
      from userspace.  This is harmless today and it prepares the quota
      system to work on filesystems with quotas but mounted by unprivileged
      users.
      
      Call qid_has_mapping from dqget.  This ensures the passed in qid has a
      prepresentation on the underlying filesystem.  Previously this was
      unnecessary as filesystesm never had qids that could not map.  With
      the introduction of filesystems outside of s_user_ns this will not
      remain true.
      
      All of this ensures the quota code never has to deal with qids that
      don't map to the underlying filesystem.
      
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d49d3762
    • Eric W. Biederman's avatar
      vfs: Don't create inodes with a uid or gid unknown to the vfs · 036d5236
      Eric W. Biederman authored
      It is expected that filesystems can not represent uids and gids from
      outside of their user namespace.  Keep things simple by not even
      trying to create filesystem nodes with non-sense uids and gids.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      036d5236
    • Eric W. Biederman's avatar
      vfs: Don't modify inodes with a uid or gid unknown to the vfs · 0bd23d09
      Eric W. Biederman authored
      When a filesystem outside of init_user_ns is mounted it could have
      uids and gids stored in it that do not map to init_user_ns.
      
      The plan is to allow those filesystems to set i_uid to INVALID_UID and
      i_gid to INVALID_GID for unmapped uids and gids and then to handle
      that strange case in the vfs to ensure there is consistent robust
      handling of the weirdness.
      
      Upon a careful review of the vfs and filesystems about the only case
      where there is any possibility of confusion or trouble is when the
      inode is written back to disk.  In that case filesystems typically
      read the inode->i_uid and inode->i_gid and write them to disk even
      when just an inode timestamp is being updated.
      
      Which leads to a rule that is very simple to implement and understand
      inodes whose i_uid or i_gid is not valid may not be written.
      
      In dealing with access times this means treat those inodes as if the
      inode flag S_NOATIME was set.  Reads of the inodes appear safe and
      useful, but any write or modification is disallowed.  The only inode
      write that is allowed is a chown that sets the uid and gid on the
      inode to valid values.  After such a chown the inode is normal and may
      be treated as such.
      
      Denying all writes to inodes with uids or gids unknown to the vfs also
      prevents several oddball cases where corruption would have occurred
      because the vfs does not have complete information.
      
      One problem case that is prevented is attempting to use the gid of a
      directory for new inodes where the directories sgid bit is set but the
      directories gid is not mapped.
      
      Another problem case avoided is attempting to update the evm hash
      after setxattr, removexattr, and setattr.  As the evm hash includeds
      the inode->i_uid or inode->i_gid not knowning the uid or gid prevents
      a correct evm hash from being computed.  evm hash verification also
      fails when i_uid or i_gid is unknown but that is essentially harmless
      as it does not cause filesystem corruption.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      0bd23d09
  3. 30 Jun, 2016 4 commits
  4. 28 Jun, 2016 1 commit
  5. 24 Jun, 2016 5 commits
    • Seth Forshee's avatar
      selinux: Add support for unprivileged mounts from user namespaces · aad82892
      Seth Forshee authored
      Security labels from unprivileged mounts in user namespaces must
      be ignored. Force superblocks from user namespaces whose labeling
      behavior is to use xattrs to use mountpoint labeling instead.
      For the mountpoint label, default to converting the current task
      context into a form suitable for file objects, but also allow the
      policy writer to specify a different label through policy
      transition rules.
      
      Pieced together from code snippets provided by Stephen Smalley.
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarJames Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      aad82892
    • Seth Forshee's avatar
      Smack: Handle labels consistently in untrusted mounts · 809c02e0
      Seth Forshee authored
      The SMACK64, SMACK64EXEC, and SMACK64MMAP labels are all handled
      differently in untrusted mounts. This is confusing and
      potentically problematic. Change this to handle them all the same
      way that SMACK64 is currently handled; that is, read the label
      from disk and check it at use time. For SMACK64 and SMACK64MMAP
      access is denied if the label does not match smk_root. To be
      consistent with suid, a SMACK64EXEC label which does not match
      smk_root will still allow execution of the file but will not run
      with the label supplied in the xattr.
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Acked-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      809c02e0
    • Seth Forshee's avatar
      Smack: Add support for unprivileged mounts from user namespaces · 9f50eda2
      Seth Forshee authored
      Security labels from unprivileged mounts cannot be trusted.
      Ideally for these mounts we would assign the objects in the
      filesystem the same label as the inode for the backing device
      passed to mount. Unfortunately it's currently impossible to
      determine which inode this is from the LSM mount hooks, so we
      settle for the label of the process doing the mount.
      
      This label is assigned to s_root, and also to smk_default to
      ensure that new inodes receive this label. The transmute property
      is also set on s_root to make this behavior more explicit, even
      though it is technically not necessary.
      
      If a filesystem has existing security labels, access to inodes is
      permitted if the label is the same as smk_root, otherwise access
      is denied. The SMACK64EXEC xattr is completely ignored.
      
      Explicit setting of security labels continues to require
      CAP_MAC_ADMIN in init_user_ns.
      
      Altogether, this ensures that filesystem objects are not
      accessible to subjects which cannot already access the backing
      store, that MAC is not violated for any objects in the fileystem
      which are already labeled, and that a user cannot use an
      unprivileged mount to gain elevated MAC privileges.
      
      sysfs, tmpfs, and ramfs are already mountable from user
      namespaces and support security labels. We can't rule out the
      possibility that these filesystems may already be used in mounts
      from user namespaces with security lables set from the init
      namespace, so failing to trust lables in these filesystems may
      introduce regressions. It is safe to trust labels from these
      filesystems, since the unprivileged user does not control the
      backing store and thus cannot supply security labels, so an
      explicit exception is made to trust labels from these
      filesystems.
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Acked-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      9f50eda2
    • Andy Lutomirski's avatar
      fs: Treat foreign mounts as nosuid · 380cf5ba
      Andy Lutomirski authored
      If a process gets access to a mount from a different user
      namespace, that process should not be able to take advantage of
      setuid files or selinux entrypoints from that filesystem.  Prevent
      this by treating mounts from other mount namespaces and those not
      owned by current_user_ns() or an ancestor as nosuid.
      
      This will make it safer to allow more complex filesystems to be
      mounted in non-root user namespaces.
      
      This does not remove the need for MNT_LOCK_NOSUID.  The setuid,
      setgid, and file capability bits can no longer be abused if code in
      a user namespace were to clear nosuid on an untrusted filesystem,
      but this patch, by itself, is insufficient to protect the system
      from abuse of files that, when execed, would increase MAC privilege.
      
      As a more concrete explanation, any task that can manipulate a
      vfsmount associated with a given user namespace already has
      capabilities in that namespace and all of its descendents.  If they
      can cause a malicious setuid, setgid, or file-caps executable to
      appear in that mount, then that executable will only allow them to
      elevate privileges in exactly the set of namespaces in which they
      are already privileges.
      
      On the other hand, if they can cause a malicious executable to
      appear with a dangerous MAC label, running it could change the
      caller's security context in a way that should not have been
      possible, even inside the namespace in which the task is confined.
      
      As a hardening measure, this would have made CVE-2014-5207 much
      more difficult to exploit.
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Acked-by: default avatarJames Morris <james.l.morris@oracle.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      380cf5ba
    • Seth Forshee's avatar
      fs: Limit file caps to the user namespace of the super block · d07b846f
      Seth Forshee authored
      Capability sets attached to files must be ignored except in the
      user namespaces where the mounter is privileged, i.e. s_user_ns
      and its descendants. Otherwise a vector exists for gaining
      privileges in namespaces where a user is not already privileged.
      
      Add a new helper function, current_in_user_ns(), to test whether a user
      namespace is the same as or a descendant of another namespace.
      Use this helper to determine whether a file's capability set
      should be applied to the caps constructed during exec.
      
      --EWB Replaced in_userns with the simpler current_in_userns.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      d07b846f
  6. 23 Jun, 2016 12 commits
    • Eric W. Biederman's avatar
      userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag · cc50a07a
      Eric W. Biederman authored
      Now that SB_I_NODEV controls the nodev behavior devpts can just clear
      this flag during mount.  Simplifying the code and making it easier
      to audit how the code works.  While still preserving the invariant
      that s_iflags is only modified during mount.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      cc50a07a
    • Eric W. Biederman's avatar
      userns: Remove implicit MNT_NODEV fragility. · 67690f93
      Eric W. Biederman authored
      Replace the implict setting of MNT_NODEV on mounts that happen with
      just user namespace permissions with an implicit setting of SB_I_NODEV
      in s_iflags.  The visibility of the implicit MNT_NODEV has caused
      problems in the past.
      
      With this change the fragile case where an implicit MNT_NODEV needs to
      be preserved in do_remount is removed.  Using SB_I_NODEV is much less
      fragile as s_iflags are set during the original mount and never
      changed.
      
      In do_new_mount with the implicit setting of MNT_NODEV gone, the only
      code that can affect mnt_flags is fs_fully_visible so simplify the if
      statement and reduce the indentation of the code to make that clear.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      67690f93
    • Eric W. Biederman's avatar
      mnt: Simplify mount_too_revealing · a1935c17
      Eric W. Biederman authored
      Verify all filesystems that we check in mount_too_revealing set
      SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags.  That is true for today
      and it should remain true in the future.
      
      Remove the now unnecessary checks from mnt_already_visibile that
      ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are
      preserved.  Making the code shorter and easier to read.
      
      Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible
      MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current
      systems where proc and sysfs are mounted with "nosuid, nodev, noexec"
      and several slightly buggy container applications don't bother to
      set those flags continue to work.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a1935c17
    • Eric W. Biederman's avatar
      vfs: Generalize filesystem nodev handling. · a2982cc9
      Eric W. Biederman authored
      Introduce a function may_open_dev that tests MNT_NODEV and a new
      superblock flab SB_I_NODEV.  Use this new function in all of the
      places where MNT_NODEV was previously tested.
      
      Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
      filesystems should never support device nodes, and a simple superblock
      flags makes that very hard to get wrong.  With SB_I_NODEV set if any
      device nodes somehow manage to show up on on a filesystem those
      device nodes will be unopenable.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a2982cc9
    • Eric W. Biederman's avatar
      ipc/mqueue: The mqueue filesystem should never contain executables · 3ee69014
      Eric W. Biederman authored
      Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
      do not result in executable on mqueuefs by accident.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3ee69014
    • Eric W. Biederman's avatar
      kernfs: The cgroup filesystem also benefits from SB_I_NOEXEC · 29a517c2
      Eric W. Biederman authored
      The cgroup filesystem is in the same boat as sysfs.  No one ever
      permits executables of any kind on the cgroup filesystem, and there is
      no reasonable future case to support executables in the future.
      
      Therefore move the setting of SB_I_NOEXEC which makes the code proof
      against future mistakes of accidentally creating executables from
      sysfs to kernfs itself.  Making the code simpler and covering the
      sysfs, cgroup, and cgroup2 filesystems.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      29a517c2
    • Eric W. Biederman's avatar
      mnt: Move the FS_USERNS_MOUNT check into sget_userns · a001e74c
      Eric W. Biederman authored
      Allowing a filesystem to be mounted by other than root in the initial
      user namespace is a filesystem property not a mount namespace property
      and as such should be checked in filesystem specific code.  Move the
      FS_USERNS_MOUNT test into super.c:sget_userns().
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      a001e74c
    • Eric W. Biederman's avatar
      fs: Add user namespace member to struct super_block · 6e4eab57
      Eric W. Biederman authored
      Start marking filesystems with a user namespace owner, s_user_ns.  In
      this change this is only used for permission checks of who may mount a
      filesystem.  Ultimately s_user_ns will be used for translating ids and
      checking capabilities for filesystems mounted from user namespaces.
      
      The default policy for setting s_user_ns is implemented in sget(),
      which arranges for s_user_ns to be set to current_user_ns() and to
      ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that
      user_ns.
      
      The guts of sget are split out into another function sget_userns().
      The function sget_userns calls alloc_super with the specified user
      namespace or it verifies the existing superblock that was found
      has the expected user namespace, and fails with EBUSY when it is not.
      This failing prevents users with the wrong privileges mounting a
      filesystem.
      
      The reason for the split of sget_userns from sget is that in some
      cases such as mount_ns and kernfs_mount_ns a different policy for
      permission checking of mounts and setting s_user_ns is necessary, and
      the existence of sget_userns() allows those policies to be
      implemented.
      
      The helper mount_ns is expected to be used for filesystems such as
      proc and mqueuefs which present per namespace information.  The
      function mount_ns is modified to call sget_userns instead of sget to
      ensure the user namespace owner of the namespace whose information is
      presented by the filesystem is used on the superblock.
      
      For sysfs and cgroup the appropriate permission checks are already in
      place, and kernfs_mount_ns is modified to call sget_userns so that
      the init_user_ns is the only user namespace used.
      
      For the cgroup filesystem cgroup namespace mounts are bind mounts of a
      subset of the full cgroup filesystem and as such s_user_ns must be the
      same for all of them as there is only a single superblock.
      
      Mounts of sysfs that vary based on the network namespace could in principle
      change s_user_ns but it keeps the analysis and implementation of kernfs
      simpler if that is not supported, and at present there appear to be no
      benefits from supporting a different s_user_ns on any sysfs mount.
      
      Getting the details of setting s_user_ns correct has been
      a long process.  Thanks to Pavel Tikhorirorv who spotted a leak
      in sget_userns.  Thanks to Seth Forshee who has kept the work alive.
      
      Thanks-to: Seth Forshee <seth.forshee@canonical.com>
      Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      6e4eab57
    • Eric W. Biederman's avatar
      proc: Convert proc_mount to use mount_ns. · e94591d0
      Eric W. Biederman authored
      Move the call of get_pid_ns, the call of proc_parse_options, and
      the setting of s_iflags into proc_fill_super so that mount_ns
      can be used.
      
      Convert proc_mount to call mount_ns and remove the now unnecessary
      code.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Reviewed-by: default avatarDjalal Harouni <tixxdz@gmail.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      e94591d0
    • Eric W. Biederman's avatar
      vfs: Pass data, ns, and ns->userns to mount_ns · d91ee87d
      Eric W. Biederman authored
      Today what is normally called data (the mount options) is not passed
      to fill_super through mount_ns.
      
      Pass the mount options and the namespace separately to mount_ns so
      that filesystems such as proc that have mount options, can use
      mount_ns.
      
      Pass the user namespace to mount_ns so that the standard permission
      check that verifies the mounter has permissions over the namespace can
      be performed in mount_ns instead of in each filesystems .mount method.
      Thus removing the duplication between mqueuefs and proc in terms of
      permission checks.  The extra permission check does not currently
      affect the rpc_pipefs filesystem and the nfsd filesystem as those
      filesystems do not currently allow unprivileged mounts.  Without
      unpvileged mounts it is guaranteed that the caller has already passed
      capable(CAP_SYS_ADMIN) which guarantees extra permission check will
      pass.
      
      Update rpc_pipefs and the nfsd filesystem to ensure that the network
      namespace reference is always taken in fill_super and always put in kill_sb
      so that the logic is simpler and so that errors originating inside of
      fill_super do not cause a network namespace leak.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      d91ee87d
    • Eric W. Biederman's avatar
      ipc: Initialize ipc_namespace->user_ns early. · b236017a
      Eric W. Biederman authored
      Allow the ipc namespace initialization code to depend on ns->user_ns
      being set during initialization.
      
      In particular this allows mq_init_ns to use ns->user_ns for permission
      checks and initializating s_user_ns while the the mq filesystem is
      being mounted.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Suggested-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      b236017a
    • Eric W. Biederman's avatar
      mnt: Refactor fs_fully_visible into mount_too_revealing · 8654df4e
      Eric W. Biederman authored
      Replace the call of fs_fully_visible in do_new_mount from before the
      new superblock is allocated with a call of mount_too_revealing after
      the superblock is allocated.   This winds up being a much better location
      for maintainability of the code.
      
      The first change this enables is the replacement of FS_USERNS_VISIBLE
      with SB_I_USERNS_VISIBLE.  Moving the flag from struct filesystem_type
      to sb_iflags on the superblock.
      
      Unfortunately mount_too_revealing fundamentally needs to touch
      mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
      times.  If the mnt_flags did not need to be touched the code
      could be easily moved into the filesystem specific mount code.
      Acked-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      8654df4e
  7. 15 Jun, 2016 1 commit
    • Eric W. Biederman's avatar
      mnt: Account for MS_RDONLY in fs_fully_visible · 695e9df0
      Eric W. Biederman authored
      In rare cases it is possible for s_flags & MS_RDONLY to be set but
      MNT_READONLY to be clear.  This starting combination can cause
      fs_fully_visible to fail to ensure that the new mount is readonly.
      Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
      is set on the source filesystem of the mount.
      
      In general both MS_RDONLY and MNT_READONLY are set at the same for
      mounts so I don't expect any programs to care.  Nor do I expect
      MS_RDONLY to be set on proc or sysfs in the initial user namespace,
      which further decreases the likelyhood of problems.
      
      Which means this change should only affect system configurations by
      paranoid sysadmins who should welcome the additional protection
      as it keeps people from wriggling out of their policies.
      
      Cc: stable@vger.kernel.org
      Fixes: 8c6cf9cc ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      695e9df0
  8. 07 Jun, 2016 2 commits
  9. 05 Jun, 2016 7 commits
    • Linus Torvalds's avatar
      Linux 4.7-rc2 · af8c34ce
      Linus Torvalds authored
      af8c34ce
    • Linus Torvalds's avatar
      Merge branch 'parisc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 5975b2c0
      Linus Torvalds authored
      Pull parisc fixes from Helge Deller:
      
       - Fix printk time stamps on SMP systems which got wrong due to a patch
         which was added during the merge window
      
       - Fix two bugs in the stack backtrace code: Races in module unloading
         and possible invalid accesses to memory due to wrong instruction
         decoding (Mikulas Patocka)
      
       - Fix userspace crash when syscalls access invalid unaligned userspace
         addresses.  Those syscalls will now return EFAULT as expected.
         (tagged for stable kernel series)
      
      * 'parisc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Move die_if_kernel() prototype into traps.h header
        parisc: Fix pagefault crash in unaligned __get_user() call
        parisc: Fix printk time during boot
        parisc: Fix backtrace on PA-RISC
      5975b2c0
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · d834502e
      Linus Torvalds authored
      Pull key handling update from James Morris:
       "This alters a new keyctl function added in the current merge window to
        allow for a future extension planned for the next merge window"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
        KEYS: Add placeholder for KDF usage with DH
      d834502e
    • Eric W. Biederman's avatar
      devpts: Make each mount of devpts an independent filesystem. · eedf265a
      Eric W. Biederman authored
      The /dev/ptmx device node is changed to lookup the directory entry "pts"
      in the same directory as the /dev/ptmx device node was opened in.  If
      there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
      uses that filesystem.  Otherwise the open of /dev/ptmx fails.
      
      The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
      userspace can now safely depend on each mount of devpts creating a new
      instance of the filesystem.
      
      Each mount of devpts is now a separate and equal filesystem.
      
      Reserved ttys are now available to all instances of devpts where the
      mounter is in the initial mount namespace.
      
      A new vfs helper path_pts is introduced that finds a directory entry
      named "pts" in the directory of the passed in path, and changes the
      passed in path to point to it.  The helper path_pts uses a function
      path_parent_directory that was factored out of follow_dotdot.
      
      In the implementation of devpts:
       - devpts_mnt is killed as it is no longer meaningful if all mounts of
         devpts are equal.
       - pts_sb_from_inode is replaced by just inode->i_sb as all cached
         inodes in the tty layer are now from the devpts filesystem.
       - devpts_add_ref is rolled into the new function devpts_ptmx.  And the
         unnecessary inode hold is removed.
       - devpts_del_ref is renamed devpts_release and reduced to just a
         deacrivate_super.
       - The newinstance mount option continues to be accepted but is now
         ignored.
      
      In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
      they are never used.
      
      Documentation/filesystems/devices.txt is updated to describe the current
      situation.
      
      This has been verified to work properly on openwrt-15.05, centos5,
      centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
      ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
      slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01.  With the
      caveat that on centos6 and on slackware-14.1 that there wind up being
      two instances of the devpts filesystem mounted on /dev/pts, the lower
      copy does not end up getting used.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eedf265a
    • Helge Deller's avatar
      58f1c654
    • Helge Deller's avatar
      parisc: Fix pagefault crash in unaligned __get_user() call · 8b78f260
      Helge Deller authored
      One of the debian buildd servers had this crash in the syslog without
      any other information:
      
       Unaligned handler failed, ret = -2
       clock_adjtime (pid 22578): Unaligned data reference (code 28)
       CPU: 1 PID: 22578 Comm: clock_adjtime Tainted: G  E  4.5.0-2-parisc64-smp #1 Debian 4.5.4-1
       task: 000000007d9960f8 ti: 00000001bde7c000 task.ti: 00000001bde7c000
      
            YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
       PSW: 00001000000001001111100000001111 Tainted: G            E
       r00-03  000000ff0804f80f 00000001bde7c2b0 00000000402d2be8 00000001bde7c2b0
       r04-07  00000000409e1fd0 00000000fa6f7fff 00000001bde7c148 00000000fa6f7fff
       r08-11  0000000000000000 00000000ffffffff 00000000fac9bb7b 000000000002b4d4
       r12-15  000000000015241c 000000000015242c 000000000000002d 00000000fac9bb7b
       r16-19  0000000000028800 0000000000000001 0000000000000070 00000001bde7c218
       r20-23  0000000000000000 00000001bde7c210 0000000000000002 0000000000000000
       r24-27  0000000000000000 0000000000000000 00000001bde7c148 00000000409e1fd0
       r28-31  0000000000000001 00000001bde7c320 00000001bde7c350 00000001bde7c218
       sr00-03  0000000001200000 0000000001200000 0000000000000000 0000000001200000
       sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
       IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000402d2e84 00000000402d2e88
        IIR: 0ca0d089    ISR: 0000000001200000  IOR: 00000000fa6f7fff
        CPU:        1   CR30: 00000001bde7c000 CR31: ffffffffffffffff
        ORIG_R28: 00000002369fe628
        IAOQ[0]: compat_get_timex+0x2dc/0x3c0
        IAOQ[1]: compat_get_timex+0x2e0/0x3c0
        RP(r2): compat_get_timex+0x40/0x3c0
       Backtrace:
        [<00000000402d4608>] compat_SyS_clock_adjtime+0x40/0xc0
        [<0000000040205024>] syscall_exit+0x0/0x14
      
      This means the userspace program clock_adjtime called the clock_adjtime()
      syscall and then crashed inside the compat_get_timex() function.
      Syscalls should never crash programs, but instead return EFAULT.
      
      The IIR register contains the executed instruction, which disassebles
      into "ldw 0(sr3,r5),r9".
      This load-word instruction is part of __get_user() which tried to read the word
      at %r5/IOR (0xfa6f7fff). This means the unaligned handler jumped in.  The
      unaligned handler is able to emulate all ldw instructions, but it fails if it
      fails to read the source e.g. because of page fault.
      
      The following program reproduces the problem:
      
      #define _GNU_SOURCE
      #include <unistd.h>
      #include <sys/syscall.h>
      #include <sys/mman.h>
      
      int main(void) {
              /* allocate 8k */
              char *ptr = mmap(NULL, 2*4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
              /* free second half (upper 4k) and make it invalid. */
              munmap(ptr+4096, 4096);
              /* syscall where first int is unaligned and clobbers into invalid memory region */
              /* syscall should return EFAULT */
              return syscall(__NR_clock_adjtime, 0, ptr+4095);
      }
      
      To fix this issue we simply need to check if the faulting instruction address
      is in the exception fixup table when the unaligned handler failed. If it
      is, call the fixup routine instead of crashing.
      
      While looking at the unaligned handler I found another issue as well: The
      target register should not be modified if the handler was unsuccessful.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org
      8b78f260
    • Helge Deller's avatar
      parisc: Fix printk time during boot · 0032c088
      Helge Deller authored
      Avoid showing invalid printk time stamps during boot.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Reviewed-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      0032c088