1. 13 Feb, 2017 1 commit
    • Konstantin Khlebnikov's avatar
      proc/sysctl: prune stale dentries during unregistering · d6cffbbe
      Konstantin Khlebnikov authored
      Currently unregistering sysctl table does not prune its dentries.
      Stale dentries could slowdown sysctl operations significantly.
      
      For example, command:
      
       # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
       creates a millions of stale denties around sysctls of loopback interface:
      
       # sysctl fs.dentry-state
       fs.dentry-state = 25812579  24724135        45      0       0       0
      
       All of them have matching names thus lookup have to scan though whole
       hash chain and call d_compare (proc_sys_compare) which checks them
       under system-wide spinlock (sysctl_lock).
      
       # time sysctl -a > /dev/null
       real    1m12.806s
       user    0m0.016s
       sys     1m12.400s
      
      Currently only memory reclaimer could remove this garbage.
      But without significant memory pressure this never happens.
      
      This patch collects sysctl inodes into list on sysctl table header and
      prunes all their dentries once that table unregisters.
      
      Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
      > On 10.02.2017 10:47, Al Viro wrote:
      >> how about >> the matching stats *after* that patch?
      >
      > dcache size doesn't grow endlessly, so stats are fine
      >
      > # sysctl fs.dentry-state
      > fs.dentry-state = 92712	58376	45	0	0	0
      >
      > # time sysctl -a &>/dev/null
      >
      > real	0m0.013s
      > user	0m0.004s
      > sys	0m0.008s
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      d6cffbbe
  2. 03 Feb, 2017 5 commits
    • Eric W. Biederman's avatar
      mnt: Tuck mounts under others instead of creating shadow/side mounts. · 1064f874
      Eric W. Biederman authored
      Ever since mount propagation was introduced in cases where a mount in
      propagated to parent mount mountpoint pair that is already in use the
      code has placed the new mount behind the old mount in the mount hash
      table.
      
      This implementation detail is problematic as it allows creating
      arbitrary length mount hash chains.
      
      Furthermore it invalidates the constraint maintained elsewhere in the
      mount code that a parent mount and a mountpoint pair will have exactly
      one mount upon them.  Making it hard to deal with and to talk about
      this special case in the mount code.
      
      Modify mount propagation to notice when there is already a mount at
      the parent mount and mountpoint where a new mount is propagating to
      and place that preexisting mount on top of the new mount.
      
      Modify unmount propagation to notice when a mount that is being
      unmounted has another mount on top of it (and no other children), and
      to replace the unmounted mount with the mount on top of it.
      
      Move the MNT_UMUONT test from __lookup_mnt_last into
      __propagate_umount as that is the only call of __lookup_mnt_last where
      MNT_UMOUNT may be set on any mount visible in the mount hash table.
      
      These modifications allow:
       - __lookup_mnt_last to be removed.
       - attach_shadows to be renamed __attach_mnt and its shadow
         handling to be removed.
       - commit_tree to be simplified
       - copy_tree to be simplified
      
      The result is an easier to understand tree of mounts that does not
      allow creation of arbitrary length hash chains in the mount hash table.
      
      The result is also a very slight userspace visible difference in semantics.
      The following two cases now behave identically, where before order
      mattered:
      
      case 1: (explicit user action)
      	B is a slave of A
      	mount something on A/a , it will propagate to B/a
      	and than mount something on B/a
      
      case 2: (tucked mount)
      	B is a slave of A
      	mount something on B/a
      	and than mount something on A/a
      
      Histroically umount A/a would fail in case 1 and succeed in case 2.
      Now umount A/a succeeds in both configurations.
      
      This very small change in semantics appears if anything to be a bug
      fix to me and my survey of userspace leads me to believe that no programs
      will notice or care of this subtle semantic change.
      
      v2: Updated to mnt_change_mountpoint to not call dput or mntput
      and instead to decrement the counts directly.  It is guaranteed
      that there will be other references when mnt_change_mountpoint is
      called so this is safe.
      
      v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
          As the locking in fs/namespace.c changed between v2 and v3.
      
      v4: Reworked the logic in propagate_mount_busy and __propagate_umount
          that detects when a mount completely covers another mount.
      
      v5: Removed unnecessary tests whose result is alwasy true in
          find_topper and attach_recursive_mnt.
      
      v6: Document the user space visible semantic difference.
      
      Cc: stable@vger.kernel.org
      Fixes: b90fa9ae ("[PATCH] shared mount handling: bind and rbind")
      Tested-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      1064f874
    • Pavel Tikhomirov's avatar
      prctl: propagate has_child_subreaper flag to every descendant · 749860ce
      Pavel Tikhomirov authored
      If process forks some children when it has is_child_subreaper
      flag enabled they will inherit has_child_subreaper flag - first
      group, when is_child_subreaper is disabled forked children will
      not inherit it - second group. So child-subreaper does not reparent
      all his descendants when their parents die. Having these two
      differently behaving groups can lead to confusion. Also it is
      a problem for CRIU, as when we restore process tree we need to
      somehow determine which descendants belong to which group and
      much harder - to put them exactly to these group.
      
      To simplify these we can add a propagation of has_child_subreaper
      flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
      subreaper to setup has_child_subreaper flag.
      
      In common cases when process like systemd first sets itself to
      be a child-subreaper and only after that forks its services, we will
      have zero-length list of descendants to walk. Testing with binary
      subtree of 2^15 processes prctl took < 0.007 sec and has shown close
      to linear dependency(~0.2 * n * usec) on lower numbers of processes.
      
      Moreover, I doubt someone intentionaly pre-forks the children whitch
      should reparent to init before becoming subreaper, because some our
      ancestor migh have had is_child_subreaper flag while forking our
      sub-tree and our childs will all inherit has_child_subreaper flag,
      and we have no way to influence it. And only way to check if we have
      no has_child_subreaper flag is to create some childs, kill them and
      see where they will reparent to.
      
      Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
      seems to be the same.
      
      Optimize:
      
      a) When descendant already has has_child_subreaper flag all his subtree
      has it too already.
      
      * for a) to be true need to move has_child_subreaper inheritance under
      the same tasklist_lock with adding task to its ->real_parent->children
      as without it process can inherit zero has_child_subreaper, then we
      set 1 to it's parent flag, check that parent has no more children, and
      only after child with wrong flag is added to the tree.
      
      * Also make these inheritance more clear by using real_parent instead of
      current, as on clone(CLONE_PARENT) if current has is_child_subreaper
      and real_parent has no is_child_subreaper or has_child_subreaper, child
      will have has_child_subreaper flag set without actually having a
      subreaper in it's ancestors.
      
      b) When some descendant is child_reaper, it's subtree is in different
      pidns from us(original child-subreaper) and processes from other pidns
      will never reparent to us.
      
      So we can skip their(a,b) subtree from walk.
      
      v2: switch to walk_process_tree() general helper, move
      has_child_subreaper inheritance
      v3: remove csr_descendant leftover, change current to real_parent
      in has_child_subreaper inheritance
      v4: small commit message fix
      
      Fixes: ebec18a6 ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision")
      Signed-off-by: default avatarPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      749860ce
    • Oleg Nesterov's avatar
      introduce the walk_process_tree() helper · 0f1b92cb
      Oleg Nesterov authored
      Add the new helper to walk the process tree, the next patch adds a user.
      Note that it visits the group leaders only, proc_visitor can do
      for_each_thread itself or we can trivially extend walk_process_tree() to
      do this.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      0f1b92cb
    • Eric W. Biederman's avatar
      Merge branch 'nsfs-discovery' · 015bb305
      Eric W. Biederman authored
      Michael Kerrisk <<mtk.manpages@gmail.com> writes:
      
      I would like to write code that discovers the namespace setup on a live
      system.  The NS_GET_PARENT and NS_GET_USERNS ioctl() operations added in
      Linux 4.9 provide much of what I want, but there are still a couple of
      small pieces missing. Those pieces are added with this patch series.
      
      Here's an example program that makes use of the new ioctl() operations.
      
      8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
      /* ns_capable.c
      
         (C) 2016 Michael Kerrisk, <mtk.manpages@gmail.com>
      
         Licensed under the GNU General Public License v2 or later.
      
         Test whether a process (identified by PID) might (subject to LSM checks)
         have capabilities in a namespace (identified by a /proc/PID/ns/xxx file).
      */
      
      			} while (0)
      
           			     exit(EXIT_FAILURE); } while (0)
      
      /* Display capabilities sets of process with specified PID */
      
      static void
      show_cap(pid_t pid)
      {
          cap_t caps;
          char *cap_string;
      
          caps = cap_get_pid(pid);
          if (caps == NULL)
      	errExit("cap_get_proc");
      
          cap_string = cap_to_text(caps, NULL);
          if (cap_string == NULL)
      	errExit("cap_to_text");
      
          printf("Capabilities: %s\n", cap_string);
      }
      
      /* Obtain the effective UID pf the process 'pid' by
         scanning its /proc/PID/file */
      
      static uid_t
      get_euid_of_process(pid_t pid)
      {
          char path[PATH_MAX];
          char line[1024];
          int uid;
      
          snprintf(path, sizeof(path), "/proc/%ld/status", (long) pid);
      
          FILE *fp;
          fp = fopen(path, "r");
          if (fp == NULL)
      	errExit("fopen-/proc/PID/status");
      
          for (;;) {
      	if (fgets(line, sizeof(line), fp) == NULL) {
      
      	    /* Should never happen... */
      
      	    fprintf(stderr, "Failure scanning %s\n", path);
      	    exit(EXIT_FAILURE);
      	}
      
      	if (strstr(line, "Uid:") == line) {
      	    sscanf(line, "Uid: %*d %d %*d %*d", &uid);
      	    return uid;
      	}
          }
      }
      
      int
      main(int argc, char *argv[])
      {
          int ns_fd, userns_fd, pid_userns_fd;
          int nstype;
          int next_fd;
          struct stat pid_stat;
          struct stat target_stat;
          char *pid_str;
          pid_t pid;
          char path[PATH_MAX];
      
          if (argc < 2) {
      	fprintf(stderr, "Usage: %s PID [ns-file]\n", argv[0]);
      	fprintf(stderr, "\t'ns-file' is a /proc/PID/ns/xxxx file; "
      		        "if omitted, use the namespace\n"
      			"\treferred to by standard input "
      			"(file descriptor 0)\n");
      	exit(EXIT_FAILURE);
          }
      
          pid_str = argv[1];
          pid = atoi(pid_str);
      
          if (argc <= 2) {
      	ns_fd = STDIN_FILENO;
          } else {
              ns_fd = open(argv[2], O_RDONLY);
              if (ns_fd == -1)
      	    errExit("open-ns-file");
          }
      
          /* Get the relevant user namespace FD, which is 'ns_fd' if 'ns_fd' refers
             to a user namespace, otherwise the user namespace that owns 'ns_fd' */
      
          nstype = ioctl(ns_fd, NS_GET_NSTYPE);
          if (nstype == -1)
      	errExit("ioctl-NS_GET_NSTYPE");
      
          if (nstype == CLONE_NEWUSER) {
      	userns_fd = ns_fd;
          } else {
      	userns_fd = ioctl(ns_fd, NS_GET_USERNS);
              if (userns_fd == -1)
      	    errExit("ioctl-NS_GET_USERNS");
          }
      
          /* Obtain 'stat' info for the user namespace of the specified PID */
      
          snprintf(path, sizeof(path), "/proc/%s/ns/user", pid_str);
      
          pid_userns_fd = open(path, O_RDONLY);
          if (pid_userns_fd == -1)
      	errExit("open-PID");
      
          if (fstat(pid_userns_fd, &pid_stat) == -1)
      	errExit("fstat-PID");
      
          /* Get 'stat' info for the target user namesapce */
      
          if (fstat(userns_fd, &target_stat) == -1)
      	errExit("fstat-PID");
      
          /* If the PID is in the target user namespace, then it has
             whatever capabilities are in its sets. */
      
          if (pid_stat.st_dev == target_stat.st_dev &&
      		pid_stat.st_ino == target_stat.st_ino) {
              printf("PID is in target namespace\n");
      	printf("Subject to LSM checks, it has the following capabilities\n");
      
      	show_cap(pid);
      
      	exit(EXIT_SUCCESS);
          }
      
          /* Otherwise, we need to walk through the ancestors of the target
             user namespace to see if PID is in an ancestor namespace */
      
          for (;;) {
      	int f;
      
      	next_fd = ioctl(userns_fd, NS_GET_PARENT);
      
      	if (next_fd == -1) {
      
      	    /* The error here should be EPERM... */
      
      	    if (errno != EPERM)
      	        errExit("ioctl-NS_GET_PARENT");
      
      	    printf("PID is not in an ancestor namespace\n");
      	    printf("It has no capabilities in the target namespace\n");
      
      	    exit(EXIT_SUCCESS);
      	}
      
              if (fstat(next_fd, &target_stat) == -1)
      	    errExit("fstat-PID");
      
      	/* If the 'stat' info for this user namespace matches the 'stat'
      	 * info for 'next_fd', then the PID is in an ancestor namespace */
      
              if (pid_stat.st_dev == target_stat.st_dev &&
      		    pid_stat.st_ino == target_stat.st_ino)
      	    break;
      
      	/* Next time round, get the next parent */
      
      	f = userns_fd;
      	userns_fd = next_fd;
      	close(f);
          }
      
          /* At this point, we found that PID is in an ancestor of the target
             user namespace, and 'userns_fd' refers to the immediate descendant
             user namespace of PID in the chain of user namespaces from PID to
             the target user namespace. If the effective UID of PID matches the
             owner UID of descendant user namespace, then PID has all
             capabilities in the descendant namespace(s); otherwise, it just has
             the capabilities that are in its sets. */
      
          uid_t owner_uid, uid;
          if (ioctl(userns_fd, NS_GET_OWNER_UID, &owner_uid) == -1) {
      	perror("ioctl-NS_GET_OWNER_UID");
      	exit(EXIT_FAILURE);
          }
      
          uid = get_euid_of_process(pid);
      
          printf("PID is in an ancestor namespace\n");
          if (owner_uid == uid) {
      	printf("And its effective UID matches the owner "
      		"of the namespace\n");
      	printf("Subject to LSM checks, PID has all capabilities in "
      		"that namespace!\n");
          } else {
      	printf("But its effective UID does not match the owner "
      		"of the namespace\n");
      	printf("Subject to LSM checks, it has the following capabilities\n");
      	show_cap(pid);
          }
      
          exit(EXIT_SUCCESS);
      }
      8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
      
      Michael Kerrisk (2):
        nsfs: Add an ioctl() to return the namespace type
        nsfs: Add an ioctl() to return owner UID of a userns
      
       fs/nsfs.c                 | 13 +++++++++++++
       include/uapi/linux/nsfs.h |  9 +++++++--
       2 files changed, 20 insertions(+), 2 deletions(-)
      015bb305
    • Michael Kerrisk (man-pages)'s avatar
      nsfs: Add an ioctl() to return owner UID of a userns · d95fa3c7
      Michael Kerrisk (man-pages) authored
      I'd like to write code that discovers the user namespace hierarchy on a
      running system, and also shows who owns the various user namespaces.
      Currently, there is no way of getting the owner UID of a user namespace.
      Therefore, this patch adds a new NS_GET_CREATOR_UID ioctl() that fetches
      the UID (as seen in the user namespace of the caller) of the creator of
      the user namespace referred to by the specified file descriptor.
      
      If the supplied file descriptor does not refer to a user namespace,
      the operation fails with the error EINVAL. If the owner UID does
      not have a mapping in the caller's user namespace return the
      overflow UID as that appears easier to deal with in practice
      in user-space applications.
      
      -- EWB Changed the handling of unmapped UIDs from -EOVERFLOW
         back to the overflow uid.  Per conversation with
         Michael Kerrisk after examining his test code.
      Acked-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarMichael Kerrisk <mtk-manpages@gmail.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      d95fa3c7
  3. 01 Feb, 2017 3 commits
    • Eric W. Biederman's avatar
      fs: Better permission checking for submounts · 93faccbb
      Eric W. Biederman authored
      To support unprivileged users mounting filesystems two permission
      checks have to be performed: a test to see if the user allowed to
      create a mount in the mount namespace, and a test to see if
      the user is allowed to access the specified filesystem.
      
      The automount case is special in that mounting the original filesystem
      grants permission to mount the sub-filesystems, to any user who
      happens to stumble across the their mountpoint and satisfies the
      ordinary filesystem permission checks.
      
      Attempting to handle the automount case by using override_creds
      almost works.  It preserves the idea that permission to mount
      the original filesystem is permission to mount the sub-filesystem.
      Unfortunately using override_creds messes up the filesystems
      ordinary permission checks.
      
      Solve this by being explicit that a mount is a submount by introducing
      vfs_submount, and using it where appropriate.
      
      vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
      sget and friends know that a mount is a submount so they can take appropriate
      action.
      
      sget and sget_userns are modified to not perform any permission checks
      on submounts.
      
      follow_automount is modified to stop using override_creds as that
      has proven problemantic.
      
      do_mount is modified to always remove the new MS_SUBMOUNT flag so
      that we know userspace will never by able to specify it.
      
      autofs4 is modified to stop using current_real_cred that was put in
      there to handle the previous version of submount permission checking.
      
      cifs is modified to pass the mountpoint all of the way down to vfs_submount.
      
      debugfs is modified to pass the mountpoint all of the way down to
      trace_automount by adding a new parameter.  To make this change easier
      a new typedef debugfs_automount_t is introduced to capture the type of
      the debugfs automount function.
      
      Cc: stable@vger.kernel.org
      Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
      Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
      Reviewed-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      93faccbb
    • Oleg Nesterov's avatar
      exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction · c6c70f44
      Oleg Nesterov authored
      find_new_reaper() checks same_thread_group(reaper, child_reaper) to
      prevent the cross-namespace reparenting but this is not enough if the
      exiting parent was injected by setns() + fork().
      
      Suppose we have a process P in the root namespace and some namespace X.
      P does setns() to enter the X namespace, and forks the child C.
      C forks a grandchild G and exits.
      
      The grandchild G should be re-parented to X->child_reaper, but in this
      case the ->real_parent chain does not lead to ->child_reaper, so it will
      be wrongly reparanted to P's sub-reaper or a global init.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      c6c70f44
    • Seth Forshee's avatar
      vfs: open() with O_CREAT should not create inodes with unknown ids · 1328c727
      Seth Forshee authored
      may_create() rejects creation of inodes with ids which lack a
      mapping into s_user_ns. However for O_CREAT may_o_create() is
      is used instead. Add a similar check there.
      
      Fixes: 036d5236 ("vfs: Don't create inodes with a uid or gid unknown to the vfs")
      Signed-off-by: default avatarSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      1328c727
  4. 25 Jan, 2017 1 commit
    • Michael Kerrisk (man-pages)'s avatar
      nsfs: Add an ioctl() to return the namespace type · e5ff5ce6
      Michael Kerrisk (man-pages) authored
      Linux 4.9 added two ioctl() operations that can be used to discover:
      
      * the parental relationships for hierarchical namespaces (user and PID)
        [NS_GET_PARENT]
      * the user namespaces that owns a specified non-user-namespace
        [NS_GET_USERNS]
      
      For no good reason that I can glean, NS_GET_USERNS was made synonymous
      with NS_GET_PARENT for user namespaces. It might have been better if
      NS_GET_USERNS had returned an error if the supplied file descriptor
      referred to a user namespace, since it suggests that the caller may be
      confused. More particularly, if it had generated an error, then I wouldn't
      need the new ioctl() operation proposed here. (On the other hand, what
      I propose here may be more generally useful.)
      
      I would like to write code that discovers namespace relationships for
      the purpose of understanding the namespace setup on a running system.
      In particular, given a file descriptor (or pathname) for a namespace,
      N, I'd like to obtain the corresponding user namespace.  Namespace N
      might be a user namespace (in which case my code would just use N) or
      a non-user namespace (in which case my code will use NS_GET_USERNS to
      get the user namespace associated with N). The problem is that there
      is no way to tell the difference by looking at the file descriptor
      (and if I try to use NS_GET_USERNS on an N that is a user namespace, I
      get the parent user namespace of N, which is not what I want).
      
      This patch therefore adds a new ioctl(), NS_GET_NSTYPE, which, given
      a file descriptor that refers to a user namespace, returns the
      namespace type (one of the CLONE_NEW* constants).
      Signed-off-by: default avatarMichael Kerrisk <mtk-manpages@gmail.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      e5ff5ce6
  5. 23 Jan, 2017 6 commits
    • Eric W. Biederman's avatar
      proc: Better ownership of files for non-dumpable tasks in user namespaces · 68eb94f1
      Eric W. Biederman authored
      Instead of making the files owned by the GLOBAL_ROOT_USER.  Make
      non-dumpable files whose mm has always lived in a user namespace owned
      by the user namespace root.  This allows the container root to have
      things work as expected in a container.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      68eb94f1
    • Eric W. Biederman's avatar
      exec: Remove LSM_UNSAFE_PTRACE_CAP · 9227dd2a
      Eric W. Biederman authored
      With previous changes every location that tests for
      LSM_UNSAFE_PTRACE_CAP also tests for LSM_UNSAFE_PTRACE making the
      LSM_UNSAFE_PTRACE_CAP redundant, so remove it.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      9227dd2a
    • Eric W. Biederman's avatar
      exec: Test the ptracer's saved cred to see if the tracee can gain caps · 20523132
      Eric W. Biederman authored
      Now that we have user namespaces and non-global capabilities verify
      the tracer has capabilities in the relevant user namespace instead
      of in the current_user_ns().
      
      As the test for setting LSM_UNSAFE_PTRACE_CAP is currently
      ptracer_capable(p, current_user_ns()) and the new task credentials are
      in current_user_ns() this change does not have any user visible change
      and simply moves the test to where it is used, making the code easier
      to read.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      20523132
    • Eric W. Biederman's avatar
      exec: Don't reset euid and egid when the tracee has CAP_SETUID · 70169420
      Eric W. Biederman authored
      Don't reset euid and egid when the tracee has CAP_SETUID in
      it's user namespace.  I punted on relaxing this permission check
      long ago but now that I have read this code closely it is clear
      it is safe to test against CAP_SETUID in the user namespace.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      70169420
    • Nikolay Borisov's avatar
      inotify: Convert to using per-namespace limits · 1cce1eea
      Nikolay Borisov authored
      This patchset converts inotify to using the newly introduced
      per-userns sysctl infrastructure.
      
      Currently the inotify instances/watches are being accounted in the
      user_struct structure. This means that in setups where multiple
      users in unprivileged containers map to the same underlying
      real user (i.e. pointing to the same user_struct) the inotify limits
      are going to be shared as well, allowing one user(or application) to exhaust
      all others limits.
      
      Fix this by switching the inotify sysctls to using the
      per-namespace/per-user limits. This will allow the server admin to
      set sensible global limits, which can further be tuned inside every
      individual user namespace. Additionally, in order to preserve the
      sysctl ABI make the existing inotify instances/watches sysctls
      modify the values of the initial user namespace.
      Signed-off-by: default avatarNikolay Borisov <n.borisov.lkml@gmail.com>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      1cce1eea
    • Nikolay Borisov's avatar
      userns: Make ucounts lock irq-safe · 880a3854
      Nikolay Borisov authored
      The ucounts_lock is being used to protect various ucounts lifecycle
      management functionalities. However, those services can also be invoked
      when a pidns is being freed in an RCU callback (e.g. softirq context).
      This can lead to deadlocks. There were already efforts trying to
      prevent similar deadlocks in add7c65c ("pid: fix lockdep deadlock
      warning due to ucount_lock"), however they just moved the context
      from hardirq to softrq. Fix this issue once and for all by explictly
      making the lock disable irqs altogether.
      
      Dmitry Vyukov <dvyukov@google.com> reported:
      
      > I've got the following deadlock report while running syzkaller fuzzer
      > on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
      > device if it matters):
      >
      > =================================
      > [ INFO: inconsistent lock state ]
      > 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
      > ---------------------------------
      > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      > swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      >  (ucounts_lock){+.?...}, at: [<     inline     >] spin_lock
      > ./include/linux/spinlock.h:302
      >  (ucounts_lock){+.?...}, at: [<ffff2000081678c8>]
      > put_ucounts+0x60/0x138 kernel/ucount.c:162
      > {SOFTIRQ-ON-W} state was registered at:
      > [<ffff2000081c82d8>] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2941
      > [<ffff2000081c97a8>] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<     inline     >] get_ucounts kernel/ucount.c:131
      > [<ffff200008167c28>] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
      > [<     inline     >] inc_mnt_namespaces fs/namespace.c:2818
      > [<ffff200008481850>] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
      > [<ffff200008487298>] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
      > [<     inline     >] init_mount_tree fs/namespace.c:3199
      > [<ffff200009bd6674>] mnt_init+0x258/0x384 fs/namespace.c:3251
      > [<ffff200009bd60bc>] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
      > [<ffff200009bb1114>] start_kernel+0x414/0x460 init/main.c:648
      > [<ffff200009bb01e8>] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
      > irq event stamp: 2316924
      > hardirqs last  enabled at (2316924): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2911
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last  enabled at (2316924): [<ffff200008210414>]
      > rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
      > hardirqs last disabled at (2316923): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2900
      > hardirqs last disabled at (2316923): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last disabled at (2316923): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last disabled at (2316923): [<ffff20000820fe80>]
      > rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
      > softirqs last  enabled at (2316912): [<ffff20000811b4c4>]
      > _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
      > softirqs last disabled at (2316913): [<     inline     >]
      > do_softirq_own_stack ./include/linux/interrupt.h:488
      > softirqs last disabled at (2316913): [<     inline     >]
      > invoke_softirq kernel/softirq.c:371
      > softirqs last disabled at (2316913): [<ffff20000811c994>]
      > irq_exit+0x264/0x308 kernel/softirq.c:405
      >
      > other info that might help us debug this:
      >  Possible unsafe locking scenario:
      >
      >        CPU0
      >        ----
      >   lock(ucounts_lock);
      >   <Interrupt>
      >     lock(ucounts_lock);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by swapper/2/0:
      >  #0:  (rcu_callback){......}, at: [<     inline     >] __rcu_reclaim
      > kernel/rcu/rcu.h:108
      >  #0:  (rcu_callback){......}, at: [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2919
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      >  #0:  (rcu_callback){......}, at: [<ffff200008210390>]
      > rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
      >
      > stack backtrace:
      > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
      > Hardware name: Hardkernel ODROID-C2 (DT)
      > Call trace:
      > [<ffff20000808fa60>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
      > [<ffff20000808fec0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
      > [<ffff2000088a99e0>] dump_stack+0x110/0x168
      > [<ffff2000082fa2b4>] print_usage_bug.part.27+0x49c/0x4bc
      > kernel/locking/lockdep.c:2387
      > [<     inline     >] print_usage_bug kernel/locking/lockdep.c:2357
      > [<     inline     >] valid_state kernel/locking/lockdep.c:2400
      > [<     inline     >] mark_lock_irq kernel/locking/lockdep.c:2617
      > [<ffff2000081c89ec>] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2923
      > [<ffff2000081c9a60>] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<ffff2000081678c8>] put_ucounts+0x60/0x138 kernel/ucount.c:162
      > [<ffff200008168364>] dec_ucount+0xf4/0x158 kernel/ucount.c:214
      > [<     inline     >] dec_pid_namespaces kernel/pid_namespace.c:89
      > [<ffff200008293dc8>] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
      > [<     inline     >] __rcu_reclaim kernel/rcu/rcu.h:118
      > [<     inline     >] rcu_do_batch kernel/rcu/tree.c:2919
      > [<     inline     >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > [<     inline     >] __rcu_process_callbacks kernel/rcu/tree.c:3149
      > [<ffff2000082103d8>] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
      > [<ffff2000080821dc>] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
      > [<     inline     >] do_softirq_own_stack ./include/linux/interrupt.h:488
      > [<     inline     >] invoke_softirq kernel/softirq.c:371
      > [<ffff20000811c994>] irq_exit+0x264/0x308 kernel/softirq.c:405
      > [<ffff2000081ecc28>] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
      > [<ffff200008081c80>] gic_handle_irq+0x68/0xd8
      > Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
      > 7dc0:                                   ffff8000648d4b3c 0000000000000007
      > 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
      > 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
      > 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
      > 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
      > 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
      > 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
      > 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
      > 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
      > 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
      > [<ffff2000080837f8>] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
      > [<     inline     >] arch_local_irq_restore
      > ./arch/arm64/include/asm/irqflags.h:81
      > [<ffff20000820a47c>] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
      > [<     inline     >] cpuidle_idle_call kernel/sched/idle.c:200
      > [<ffff2000081bcbfc>] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
      > [<ffff2000081bd1cc>] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
      > [<ffff200008099f8c>] secondary_start_kernel+0x2cc/0x358
      > arch/arm64/kernel/smp.c:276
      > [<000000000279f1a4>] 0x279f1a4
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: add7c65c ("pid: fix lockdep deadlock warning due to ucount_lock")
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Link: https://www.spinics.net/lists/kernel/msg2426637.htmlSigned-off-by: default avatarNikolay Borisov <n.borisov.lkml@gmail.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      880a3854
  6. 10 Jan, 2017 4 commits
    • Zhou Chengming's avatar
      sysctl: Drop reference added by grab_header in proc_sys_readdir · 93362fa4
      Zhou Chengming authored
      Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
      added by grab_header when return from !dir_emit_dots path.
      It can cause any path called unregister_sysctl_table will
      wait forever.
      
      The calltrace of CVE-2016-9191:
      
      [ 5535.960522] Call Trace:
      [ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
      [ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
      [ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
      [ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
      [ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
      [ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
      [ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
      [ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
      [ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
      [ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
      [ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
      [ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
      [ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
      [ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
      [ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
      [ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
      [ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
      [ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
      [ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
      [ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
      [ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
      [ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
      [ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
      [ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
      [ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
      [ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
      [ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
      [ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230
      
      One cgroup maintainer mentioned that "cgroup is trying to offline
      a cpuset css, which takes place under cgroup_mutex.  The offlining
      ends up trying to drain active usages of a sysctl table which apprently
      is not happening."
      The real reason is that proc_sys_readdir doesn't drop reference added
      by grab_header when return from !dir_emit_dots path. So this cpuset
      offline path will wait here forever.
      
      See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13
      
      Fixes: f0c3b509 ("[readdir] convert procfs")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarCAI Qian <caiqian@redhat.com>
      Tested-by: default avatarYang Shukui <yangshukui@huawei.com>
      Signed-off-by: default avatarZhou Chengming <zhouchengming1@huawei.com>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      93362fa4
    • Andrei Vagin's avatar
      pid: fix lockdep deadlock warning due to ucount_lock · add7c65c
      Andrei Vagin authored
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G        W
      ---------------------------------------------------------
      swapper/1/0 just changed the state of lock:
       (&(&sighand->siglock)->rlock){-.....}, at: [<ffffffffbd0a1bc6>] __lock_task_sighand+0xb6/0x2c0
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (ucounts_lock){+.+...}
      and interrupts could create inverse lock ordering between them.
      other info that might help us debug this:
      Chain exists of:                 &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(ucounts_lock);
                                     local_irq_disable();
                                     lock(&(&sighand->siglock)->rlock);
                                     lock(&(&tty->ctrl_lock)->rlock);
        <Interrupt>
          lock(&(&sighand->siglock)->rlock);
      
       *** DEADLOCK ***
      
      This patch removes a dependency between rlock and ucount_lock.
      
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrei Vagin <avagin@openvz.org>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      add7c65c
    • Eric W. Biederman's avatar
      libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount · 75422726
      Eric W. Biederman authored
      Add MS_KERNMOUNT to the flags that are passed.
      Use sget_userns and force &init_user_ns instead of calling sget so that
      even if called from a weird context the internal filesystem will be
      considered to be in the intial user namespace.
      
      Luis Ressel reported that the the failure to pass MS_KERNMOUNT into
      mount_pseudo broke his in development graphics driver that uses the
      generic drm infrastructure.  I am not certain the deriver was bug
      free in it's usage of that infrastructure but since
      mount_pseudo_xattr can never be triggered by userspace it is clearer
      and less error prone, and less problematic for the code to be explicit.
      Reported-by: default avatarLuis Ressel <aranea@aixah.de>
      Tested-by: default avatarLuis Ressel <aranea@aixah.de>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      75422726
    • Eric W. Biederman's avatar
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman authored
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      >
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      hashtable.
      
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3895dbf8
  7. 01 Jan, 2017 2 commits
    • Linus Torvalds's avatar
      Linux 4.10-rc2 · 0c744ea4
      Linus Torvalds authored
      0c744ea4
    • Linus Torvalds's avatar
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 4759d386
      Linus Torvalds authored
      Pull DAX updates from Dan Williams:
       "The completion of Jan's DAX work for 4.10.
      
        As I mentioned in the libnvdimm-for-4.10 pull request, these are some
        final fixes for the DAX dirty-cacheline-tracking invalidation work
        that was merged through the -mm, ext4, and xfs trees in -rc1. These
        patches were prepared prior to the merge window, but we waited for
        4.10-rc1 to have a stable merge base after all the prerequisites were
        merged.
      
        Quoting Jan on the overall changes in these patches:
      
           "So I'd like all these 6 patches to go for rc2. The first three
            patches fix invalidation of exceptional DAX entries (a bug which
            is there for a long time) - without these patches data loss can
            occur on power failure even though user called fsync(2). The other
            three patches change locking of DAX faults so that ->iomap_begin()
            is called in a more relaxed locking context and we are safe to
            start a transaction there for ext4"
      
        These have received a build success notification from the kbuild
        robot, and pass the latest libnvdimm unit tests. There have not been
        any -next releases since -rc1, so they have not appeared there"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        ext4: Simplify DAX fault path
        dax: Call ->iomap_begin without entry lock during dax fault
        dax: Finish fault completely when loading holes
        dax: Avoid page invalidation races and unnecessary radix tree traversals
        mm: Invalidate DAX radix tree entries only if appropriate
        ext2: Return BH_New buffers for zeroed blocks
      4759d386
  8. 30 Dec, 2016 2 commits
  9. 29 Dec, 2016 2 commits
    • Olof Johansson's avatar
      mm/filemap: fix parameters to test_bit() · 98473f9f
      Olof Johansson authored
       mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
        mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
          return test_bit(PG_waiters);
               ^~~~~~~~
      
      Fixes: b91e1302 ('mm: optimize PageWaiters bit use for unlock_page()')
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Brown-paper-bag-by: default avatarLinus Torvalds <dummy@duh.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98473f9f
    • Linus Torvalds's avatar
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds authored
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: default avatarNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  10. 28 Dec, 2016 2 commits
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 2d706e79
      Linus Torvalds authored
      Pull crypto fix from Herbert Xu:
       "This fixes a hash corruption bug in the marvell driver"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: marvell - Copy IVDIG before launching partial DMA ahash requests
      2d706e79
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 8f18e4d0
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.
      
          The most important is to not assume the packet is RX just because
          the destination address matches that of the device. Such an
          assumption causes problems when an interface is put into loopback
          mode.
      
       2) If we retry when creating a new tc entry (because we dropped the
          RTNL mutex in order to load a module, for example) we end up with
          -EAGAIN and then loop trying to replay the request. But we didn't
          reset some state when looping back to the top like this, and if
          another thread meanwhile inserted the same tc entry we were trying
          to, we re-link it creating an enless loop in the tc chain. Fix from
          Daniel Borkmann.
      
       3) There are two different WRITE bits in the MDIO address register for
          the stmmac chip, depending upon the chip variant. Due to a bug we
          could set them both, fix from Hock Leong Kweh.
      
       4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: stmmac: fix incorrect bit set in gmac4 mdio addr register
        r8169: add support for RTL8168 series add-on card.
        net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
        openvswitch: upcall: Fix vlan handling.
        ipv4: Namespaceify tcp_tw_reuse knob
        net: korina: Fix NAPI versus resources freeing
        net, sched: fix soft lockup in tc_classify
        net/mlx4_en: Fix user prio field in XDP forward
        tipc: don't send FIN message from connectionless socket
        ipvlan: fix multicast processing
        ipvlan: fix various issues in ipvlan_process_multicast()
      8f18e4d0
  11. 27 Dec, 2016 12 commits