1. 18 Jan, 2020 2 commits
    • Aleksa Sarai's avatar
      selftests: add openat2(2) selftests · b28a10ae
      Aleksa Sarai authored
      Test all of the various openat2(2) flags. A small stress-test of a
      symlink-rename attack is included to show that the protections against
      ".."-based attacks are sufficient.
      
      The main things these self-tests are enforcing are:
      
        * The struct+usize ABI for openat2(2) and copy_struct_from_user() to
          ensure that upgrades will be handled gracefully (in addition,
          ensuring that misaligned structures are also handled correctly).
      
        * The -EINVAL checks for openat2(2) are all correctly handled to avoid
          userspace passing unknown or conflicting flag sets (most
          importantly, ensuring that invalid flag combinations are checked).
      
        * All of the RESOLVE_* semantics (including errno values) are
          correctly handled with various combinations of paths and flags.
      
        * RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
          attack that has been responsible for several CVEs (and likely will
          be responsible for several more).
      
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b28a10ae
    • Aleksa Sarai's avatar
      open: introduce openat2(2) syscall · fddb5d43
      Aleksa Sarai authored
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fddb5d43
  2. 09 Dec, 2019 10 commits
    • Aleksa Sarai's avatar
      namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution · ab87f9a5
      Aleksa Sarai authored
      Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
      (in the case of LOOKUP_BENEATH the resolution will still fail if ".."
      resolution would resolve a path outside of the root -- while
      LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
      still disallowed entirely[*].
      
      As Jann explains[1,2], the need for this patch (and the original no-".."
      restriction) is explained by observing there is a fairly easy-to-exploit
      race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
      LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
      used to "skip over" nd->root and thus escape to the filesystem above
      nd->root.
      
        thread1 [attacker]:
          for (;;)
            renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
        thread2 [victim]:
          for (;;)
            openat2(dirb, "b/c/../../etc/shadow",
                    { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
      
      With fairly significant regularity, thread2 will resolve to
      "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
      (though somewhat more privileged) attack using MS_MOVE.
      
      With this patch, such cases will be detected *during* ".." resolution
      and will return -EAGAIN for userspace to decide to either retry or abort
      the lookup. It should be noted that ".." is the weak point of chroot(2)
      -- walking *into* a subdirectory tautologically cannot result in you
      walking *outside* nd->root (except through a bind-mount or magic-link).
      There is also no other way for a directory's parent to change (which is
      the primary worry with ".." resolution here) other than a rename or
      MS_MOVE.
      
      The primary reason for deferring to userspace with -EAGAIN is that an
      in-kernel retry loop (or doing a path_is_under() check after re-taking
      the relevant seqlocks) can become unreasonably expensive on machines
      with lots of VFS activity (nfsd can cause lots of rename_lock updates).
      Thus it should be up to userspace how many times they wish to retry the
      lookup -- the selftests for this attack indicate that there is a ~35%
      chance of the lookup succeeding on the first try even with an attacker
      thrashing rename_lock.
      
      A variant of the above attack is included in the selftests for
      openat2(2) later in this patch series. I've run this test on several
      machines for several days and no instances of a breakout were detected.
      While this is not concrete proof that this is safe, when combined with
      the above argument it should lend some trustworthiness to this
      construction.
      
      [*] It may be acceptable in the future to do a path_is_under() check for
          magic-links after they are resolved. However this seems unlikely to
          be a feature that people *really* need -- it can be added later if
          it turns out a lot of people want it.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      [2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ab87f9a5
    • Aleksa Sarai's avatar
      namei: LOOKUP_IN_ROOT: chroot-like scoped resolution · 8db52c7e
      Aleksa Sarai authored
      /* Background. */
      Container runtimes or other administrative management processes will
      often interact with root filesystems while in the host mount namespace,
      because the cost of doing a chroot(2) on every operation is too
      prohibitive (especially in Go, which cannot safely use vfork). However,
      a malicious program can trick the management process into doing
      operations on files outside of the root filesystem through careful
      crafting of symlinks.
      
      Most programs that need this feature have attempted to make this process
      safe, by doing all of the path resolution in userspace (with symlinks
      being scoped to the root of the malicious root filesystem).
      Unfortunately, this method is prone to foot-guns and usually such
      implementations have subtle security bugs.
      
      Thus, what userspace needs is a way to resolve a path as though it were
      in a chroot(2) -- with all absolute symlinks being resolved relative to
      the dirfd root (and ".." components being stuck under the dirfd root).
      It is much simpler and more straight-forward to provide this
      functionality in-kernel (because it can be done far more cheaply and
      correctly).
      
      More classical applications that also have this problem (which have
      their own potentially buggy userspace path sanitisation code) include
      web servers, archive extraction tools, network file servers, and so on.
      
      /* Userspace API. */
      LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_IN_ROOT applies to all components of the path.
      
      With LOOKUP_IN_ROOT, any path component which attempts to cross the
      starting point of the pathname lookup (the dirfd passed to openat) will
      remain at the starting point. Thus, all absolute paths and symlinks will
      be scoped within the starting point.
      
      There is a slight change in behaviour regarding pathnames -- if the
      pathname is absolute then the dirfd is still used as the root of
      resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
      foot-guns, at the cost of a minor API inconsistency).
      
      As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
      LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
      will be lifted in a future patch, but requires more work to ensure that
      permitting ".." is done safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8db52c7e
    • Aleksa Sarai's avatar
      namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution · adb21d2b
      Aleksa Sarai authored
      /* Background. */
      There are many circumstances when userspace wants to resolve a path and
      ensure that it doesn't go outside of a particular root directory during
      resolution. Obvious examples include archive extraction tools, as well as
      other security-conscious userspace programs. FreeBSD spun out O_BENEATH
      from their Capsicum project[1,2], so it also seems reasonable to
      implement similar functionality for Linux.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
      variation on David Drysdale's O_BENEATH patchset[4], which in turn was
      based on the Capsicum project[5]).
      
      /* Userspace API. */
      LOOKUP_BENEATH will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_BENEATH applies to all components of the path.
      
      With LOOKUP_BENEATH, any path component which attempts to "escape" the
      starting point of the filesystem lookup (the dirfd passed to openat)
      will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
      
      Due to a security concern brought up by Jann[6], any ".." path
      components are also blocked. This restriction will be lifted in a future
      patch, but requires more work to ensure that permitting ".." is done
      safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
      
      [1]: https://reviews.freebsd.org/D2808
      [2]: https://reviews.freebsd.org/D17547
      [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: default avatarDavid Drysdale <drysdale@google.com>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      adb21d2b
    • Aleksa Sarai's avatar
      namei: LOOKUP_NO_XDEV: block mountpoint crossing · 72ba2929
      Aleksa Sarai authored
      /* Background. */
      The need to contain path operations within a mountpoint has been a
      long-standing usecase that userspace has historically implemented
      manually with liberal usage of stat(). find, rsync, tar and
      many other programs implement these semantics -- but it'd be much
      simpler to have a fool-proof way of refusing to open a path if it
      crosses a mountpoint.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_XDEV applies to all components of the path.
      
      With LOOKUP_NO_XDEV, any path component which crosses a mount-point
      during path resolution (including "..") will yield an -EXDEV. Absolute
      paths, absolute symlinks, and magic-links will only yield an -EXDEV if
      the jump involved changing mount-points.
      
      /* Testing. */
      LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: default avatarDavid Drysdale <drysdale@google.com>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      72ba2929
    • Aleksa Sarai's avatar
      namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution · 4b99d499
      Aleksa Sarai authored
      /* Background. */
      There has always been a special class of symlink-like objects in procfs
      (and a few other pseudo-filesystems) which allow for non-lexical
      resolution of paths using nd_jump_link(). These "magic-links" do not
      follow traditional mount namespace boundaries, and have been used
      consistently in container escape attacks because they can be used to
      trick unsuspecting privileged processes into resolving unexpected paths.
      
      It is also non-trivial for userspace to unambiguously avoid resolving
      magic-links, because they do not have a reliable indication that they
      are a magic-link (in order to verify them you'd have to manually open
      the path given by readlink(2) and then verify that the two file
      descriptors reference the same underlying file, which is plagued with
      possible race conditions or supplementary attack scenarios).
      
      It would therefore be very helpful for userspace to be able to avoid
      these symlinks easily, thus hopefully removing a tool from attackers'
      toolboxes.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_MAGICLINKS applies to all components of the path.
      
      With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
      during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
      for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.
      
      LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.
      
      /* Testing. */
      LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: default avatarDavid Drysdale <drysdale@google.com>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4b99d499
    • Aleksa Sarai's avatar
      namei: LOOKUP_NO_SYMLINKS: block symlink resolution · 27812141
      Aleksa Sarai authored
      /* Background. */
      Userspace cannot easily resolve a path without resolving symlinks, and
      would have to manually resolve each path component with O_PATH and
      O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
      up (resulting in possible security bugs). Linus has mentioned that Git
      has a particular need for this kind of flag[1]. It also resolves a
      fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
      blocks the opening of trailing symlinks.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
      variation on David Drysdale's O_BENEATH patchset[3], which in turn was
      based on the Capsicum project[4]).
      
      /* Userspace API. */
      LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_SYMLINKS applies to all components of the path.
      
      With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
      path resolution will yield -ELOOP. If the trailing component is a
      symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
      will not error out and will instead provide a handle to the trailing
      symlink -- without resolving it.
      
      /* Testing. */
      LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
      [2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      27812141
    • Aleksa Sarai's avatar
      namei: allow set_root() to produce errors · 740a1678
      Aleksa Sarai authored
      For LOOKUP_BENEATH and LOOKUP_IN_ROOT it is necessary to ensure that
      set_root() is never called, and thus (for hardening purposes) it should
      return an error rather than permit a breakout from the root. In
      addition, move all of the repetitive set_root() calls to nd_jump_root().
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      740a1678
    • Aleksa Sarai's avatar
      namei: allow nd_jump_link() to produce errors · 1bc82070
      Aleksa Sarai authored
      In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
      ability for nd_jump_link() to return an error which the corresponding
      get_link() caller must propogate back up to the VFS.
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1bc82070
    • Aleksa Sarai's avatar
      nsfs: clean-up ns_get_path() signature to return int · ce623f89
      Aleksa Sarai authored
      ns_get_path() and ns_get_path_cb() only ever return either NULL or an
      ERR_PTR. It is far more idiomatic to simply return an integer, and it
      makes all of the callers of ns_get_path() more straightforward to read.
      
      Fixes: e149ed2b ("take the targets of /proc/*/ns/* symlinks to separate fs")
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ce623f89
    • Aleksa Sarai's avatar
      namei: only return -ECHILD from follow_dotdot_rcu() · 2b98149c
      Aleksa Sarai authored
      It's over-zealous to return hard errors under RCU-walk here, given that
      a REF-walk will be triggered for all other cases handling ".." under
      RCU.
      
      The original purpose of this check was to ensure that if a rename occurs
      such that a directory is moved outside of the bind-mount which the
      resolution started in, it would be detected and blocked to avoid being
      able to mess with paths outside of the bind-mount. However, triggering a
      new REF-walk is just as effective a solution.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Suggested-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2b98149c
  3. 08 Dec, 2019 12 commits
    • Linus Torvalds's avatar
      Linux 5.5-rc1 · e42617b8
      Linus Torvalds authored
      e42617b8
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 95e6ba51
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) More jumbo frame fixes in r8169, from Heiner Kallweit.
      
       2) Fix bpf build in minimal configuration, from Alexei Starovoitov.
      
       3) Use after free in slcan driver, from Jouni Hogander.
      
       4) Flower classifier port ranges don't work properly in the HW offload
          case, from Yoshiki Komachi.
      
       5) Use after free in hns3_nic_maybe_stop_tx(), from Yunsheng Lin.
      
       6) Out of bounds access in mqprio_dump(), from Vladyslav Tarasiuk.
      
       7) Fix flow dissection in dsa TX path, from Alexander Lobakin.
      
       8) Stale syncookie timestampe fixes from Guillaume Nault.
      
      [ Did an evil merge to silence a warning introduced by this pull - Linus ]
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
        r8169: fix rtl_hw_jumbo_disable for RTL8168evl
        net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add()
        r8169: add missing RX enabling for WoL on RTL8125
        vhost/vsock: accept only packets with the right dst_cid
        net: phy: dp83867: fix hfs boot in rgmii mode
        net: ethernet: ti: cpsw: fix extra rx interrupt
        inet: protect against too small mtu values.
        gre: refetch erspan header from skb->data after pskb_may_pull()
        pppoe: remove redundant BUG_ON() check in pppoe_pernet
        tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
        tcp: tighten acceptance of ACKs not matching a child socket
        tcp: fix rejected syncookies due to stale timestamps
        lpc_eth: kernel BUG on remove
        tcp: md5: fix potential overestimation of TCP option space
        net: sched: allow indirect blocks to bind to clsact in TC
        net: core: rename indirect block ingress cb function
        net-sysfs: Call dev_hold always in netdev_queue_add_kobject
        net: dsa: fix flow dissection on Tx path
        net/tls: Fix return values to avoid ENOTSUPP
        net: avoid an indirect call in ____sys_recvmsg()
        ...
      95e6ba51
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 138f371d
      Linus Torvalds authored
      Pull more SCSI updates from James Bottomley:
       "Eleven patches, all in drivers (no core changes) that are either minor
        cleanups or small fixes.
      
        They were late arriving, but still safe for -rc1"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: MAINTAINERS: Add the linux-scsi mailing list to the ISCSI entry
        scsi: megaraid_sas: Make poll_aen_lock static
        scsi: sd_zbc: Improve report zones error printout
        scsi: qla2xxx: Fix qla2x00_request_irqs() for MSI
        scsi: qla2xxx: unregister ports after GPN_FT failure
        scsi: qla2xxx: fix rports not being mark as lost in sync fabric scan
        scsi: pm80xx: Remove unused include of linux/version.h
        scsi: pm80xx: fix logic to break out of loop when register value is 2 or 3
        scsi: scsi_transport_sas: Fix memory leak when removing devices
        scsi: lpfc: size cpu map by last cpu id set
        scsi: ibmvscsi_tgt: Remove unneeded variable rc
      138f371d
    • Linus Torvalds's avatar
      Merge tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 · a78f7cdd
      Linus Torvalds authored
      Pull cifs fixes from Steve French:
       "Nine cifs/smb3 fixes:
      
         - one fix for stable (oops during oplock break)
      
         - two timestamp fixes including important one for updating mtime at
           close to avoid stale metadata caching issue on dirty files (also
           improves perf by using SMB2_CLOSE_FLAG_POSTQUERY_ATTRIB over the
           wire)
      
         - two fixes for "modefromsid" mount option for file create (now
           allows mode bits to be set more atomically and accurately on create
           by adding "sd_context" on create when modefromsid specified on
           mount)
      
         - two fixes for multichannel found in testing this week against
           different servers
      
         - two small cleanup patches"
      
      * tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: improve check for when we send the security descriptor context on create
        smb3: fix mode passed in on create for modetosid mount option
        cifs: fix possible uninitialized access and race on iface_list
        cifs: Fix lookup of SMB connections on multichannel
        smb3: query attributes on file close
        smb3: remove unused flag passed into close functions
        cifs: remove redundant assignment to pointer pneg_ctxt
        fs: cifs: Fix atime update check vs mtime
        CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
      a78f7cdd
    • Linus Torvalds's avatar
      Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 5bf9a06a
      Linus Torvalds authored
      Pull misc vfs cleanups from Al Viro:
       "No common topic, just three cleanups".
      
      * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        make __d_alloc() static
        fs/namespace: add __user to open_tree and move_mount syscalls
        fs/fnctl: fix missing __user in fcntl_rw_hint()
      5bf9a06a
    • Linus Torvalds's avatar
      Merge tag 'ntb-5.5' of git://github.com/jonmason/ntb · 9455d25f
      Linus Torvalds authored
      Pull NTB update from Jon Mason:
       "Just a simple patch to add a new Hygon Device ID to the AMD NTB device
        driver"
      
      * tag 'ntb-5.5' of git://github.com/jonmason/ntb:
        NTB: Add Hygon Device ID
      9455d25f
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 73721451
      Linus Torvalds authored
      Pull more input updates from Dmitry Torokhov:
      
       - fixups for Synaptics RMI4 driver
      
       - a quirk for Goodinx touchscreen on Teclast tablet
      
       - a new keycode definition for activating privacy screen feature found
         on a few "enterprise" laptops
      
       - updates to snvs_pwrkey driver
      
       - polling uinput device for writing (which is always allowed) now works
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers
        Input: synaptics-rmi4 - re-enable IRQs in f34v7_do_reflash
        Input: goodix - add upside-down quirk for Teclast X89 tablet
        Input: add privacy screen toggle keycode
        Input: uinput - fix returning EPOLLOUT from uinput_poll
        Input: snvs_pwrkey - remove gratuitous NULL initializers
        Input: snvs_pwrkey - send key events for i.MX6 S, DL and Q
      73721451
    • Linus Torvalds's avatar
      Merge tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 95207d55
      Linus Torvalds authored
      Pull iomap fixes from Darrick Wong:
       "Fix a race condition and a use-after-free error:
      
         - Fix a UAF when reporting writeback errors
      
         - Fix a race condition when handling page uptodate on fragmented file
           with blocksize < pagesize"
      
      * tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        iomap: stop using ioend after it's been freed in iomap_finish_ioend()
        iomap: fix sub-page uptodate handling
      95207d55
    • Linus Torvalds's avatar
      Merge tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 50caca9d
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
       "Fix a couple of resource management errors and a hang:
      
         - fix a crash in the log setup code when log mounting fails
      
         - fix a hang when allocating space on the realtime device
      
         - fix a block leak when freeing space on the realtime device"
      
      * tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: fix mount failure crash on invalid iclog memory access
        xfs: don't check for AG deadlock for realtime files in bunmapi
        xfs: fix realtime file data space leak
      50caca9d
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux · 316933cf
      Linus Torvalds authored
      Pull orangefs update from Mike Marshall:
       "orangefs: posix open permission checking...
      
        Orangefs has no open, and orangefs checks file permissions on each
        file access. Posix requires that file permissions be checked on open
        and nowhere else. Orangefs-through-the-kernel needs to seem posix
        compliant.
      
        The VFS opens files, even if the filesystem provides no method. We can
        see if a file was successfully opened for read and or for write by
        looking at file->f_mode.
      
        When writes are flowing from the page cache, file is no longer
        available. We can trust the VFS to have checked file->f_mode before
        writing to the page cache.
      
        The mode of a file might change between when it is opened and IO
        commences, or it might be created with an arbitrary mode.
      
        We'll make sure we don't hit EACCES during the IO stage by using
        UID 0"
      
      [ This is "posixish", but not a great solution in the long run, since a
        proper secure network server shouldn't really trust the client like this.
        But proper and secure POSIX behavior requires an open method and a
        resulting cookie for IO of some kind, or similar.    - Linus ]
      
      * tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
        orangefs: posix open permission checking...
      316933cf
    • Linus Torvalds's avatar
      Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux · 911d137a
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
       "This is a relatively quiet cycle for nfsd, mainly various bugfixes.
      
        Possibly most interesting is Trond's fixes for some callback races
        that were due to my incomplete understanding of rpc client shutdown.
        Unfortunately at the last minute I've started noticing a new
        intermittent failure to send callbacks. As the logic seems basically
        correct, I'm leaving Trond's patches in for now, and hope to find a
        fix in the next week so I don't have to revert those patches"
      
      * tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits)
        nfsd: depend on CRYPTO_MD5 for legacy client tracking
        NFSD fixing possible null pointer derefering in copy offload
        nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
        nfsd: Ensure CLONE persists data and metadata changes to the target file
        SUNRPC: Fix backchannel latency metrics
        nfsd: restore NFSv3 ACL support
        nfsd: v4 support requires CRYPTO_SHA256
        nfsd: Fix cld_net->cn_tfm initialization
        lockd: remove __KERNEL__ ifdefs
        sunrpc: remove __KERNEL__ ifdefs
        race in exportfs_decode_fh()
        nfsd: Drop LIST_HEAD where the variable it declares is never used.
        nfsd: document callback_wq serialization of callback code
        nfsd: mark cb path down on unknown errors
        nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
        nfsd: minor 4.1 callback cleanup
        SUNRPC: Fix svcauth_gss_proxy_init()
        SUNRPC: Trace gssproxy upcall results
        sunrpc: fix crash when cache_head become valid before update
        nfsd: remove private bin2hex implementation
        ...
      911d137a
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · fb9bf40c
      Linus Torvalds authored
      Pull NFS client updates from Trond Myklebust:
       "Highlights include:
      
        Features:
      
         - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded
           copy of a file from one source server to a different target
           server).
      
         - New RDMA tracepoints for debugging congestion control and Local
           Invalidate WRs.
      
        Bugfixes and cleanups
      
         - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
           layoutreturn
      
         - Handle bad/dead sessions correctly in nfs41_sequence_process()
      
         - Various bugfixes to the delegation return operation.
      
         - Various bugfixes pertaining to delegations that have been revoked.
      
         - Cleanups to the NFS timespec code to avoid unnecessary conversions
           between timespec and timespec64.
      
         - Fix unstable RDMA connections after a reconnect
      
         - Close race between waking an RDMA sender and posting a receive
      
         - Wake pending RDMA tasks if connection fails
      
         - Fix MR list corruption, and clean up MR usage
      
         - Fix another RPCSEC_GSS issue with MIC buffer space"
      
      * tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
        SUNRPC: Capture completion of all RPC tasks
        SUNRPC: Fix another issue with MIC buffer space
        NFS4: Trace lock reclaims
        NFS4: Trace state recovery operation
        NFSv4.2 fix memory leak in nfs42_ssc_open
        NFSv4.2 fix kfree in __nfs42_copy_file_range
        NFS: remove duplicated include from nfs4file.c
        NFSv4: Make _nfs42_proc_copy_notify() static
        NFS: Fallocate should use the nfs4_fattr_bitmap
        NFS: Return -ETXTBSY when attempting to write to a swapfile
        fs: nfs: sysfs: Remove NULL check before kfree
        NFS: remove unneeded semicolon
        NFSv4: add declaration of current_stateid
        NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn
        NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()
        nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list
        SUNRPC: Avoid RPC delays when exiting suspend
        NFS: Add a tracepoint in nfs_fh_to_dentry()
        NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
        NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
        ...
      fb9bf40c
  4. 07 Dec, 2019 16 commits
    • Steve French's avatar
      smb3: improve check for when we send the security descriptor context on create · 231e2a0b
      Steve French authored
      We had cases in the previous patch where we were sending the security
      descriptor context on SMB3 open (file create) in cases when we hadn't
      mounted with with "modefromsid" mount option.
      
      Add check for that mount flag before calling ad_sd_context in
      open init.
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      231e2a0b
    • Linus Torvalds's avatar
      Merge tag 'vfio-v5.5-rc1' of git://github.com/awilliam/linux-vfio · 94e89b40
      Linus Torvalds authored
      Pull VFIO updates from Alex Williamson:
      
       - Remove hugepage checks for reserved pfns (Ben Luo)
      
       - Fix irq-bypass unregister ordering (Jiang Yi)
      
      * tag 'vfio-v5.5-rc1' of git://github.com/awilliam/linux-vfio:
        vfio/pci: call irq_bypass_unregister_producer() before freeing irq
        vfio/type1: remove hugepage checks in is_invalid_reserved_pfn()
      94e89b40
    • Linus Torvalds's avatar
      Merge tag 'for-linus-5.5b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · f74fd13f
      Linus Torvalds authored
      Pull more xen updates from Juergen Gross:
      
       - a patch to fix a build warning
      
       - a cleanup of no longer needed code in the Xen event handling
      
       - a small series for the Xen grant driver avoiding high order
         allocations and replacing an insane global limit by a per-call one
      
       - a small series fixing Xen frontend/backend module referencing
      
      * tag 'for-linus-5.5b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen-blkback: allow module to be cleanly unloaded
        xen/xenbus: reference count registered modules
        xen/gntdev: switch from kcalloc() to kvcalloc()
        xen/gntdev: replace global limit of mapped pages by limit per call
        xen/gntdev: remove redundant non-zero check on ret
        xen/events: remove event handling recursion detection
      f74fd13f
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 6dc517a3
      Linus Torvalds authored
      Merge misc Kconfig updates from Andrew Morton:
       "A number of changes to Kconfig files under lib/ from Changbin Du and
        Krzysztof Kozlowski"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        lib/: fix Kconfig indentation
        kernel-hacking: move DEBUG_FS to 'Generic Kernel Debugging Instruments'
        kernel-hacking: move DEBUG_BUGVERBOSE to 'printk and dmesg options'
        kernel-hacking: create a submenu for scheduler debugging options
        kernel-hacking: move SCHED_STACK_END_CHECK after DEBUG_STACK_USAGE
        kernel-hacking: move Oops into 'Lockups and Hangs'
        kernel-hacking: move kernel testing and coverage options to same submenu
        kernel-hacking: group kernel data structures debugging together
        kernel-hacking: create submenu for arch special debugging options
        kernel-hacking: group sysrq/kgdb/ubsan into 'Generic Kernel Debugging Instruments'
      6dc517a3
    • Heiner Kallweit's avatar
      r8169: fix rtl_hw_jumbo_disable for RTL8168evl · 0fc75219
      Heiner Kallweit authored
      In referenced fix we removed the RTL8168e-specific jumbo config for
      RTL8168evl in rtl_hw_jumbo_enable(). We have to do the same in
      rtl_hw_jumbo_disable().
      
      v2: fix referenced commit id
      
      Fixes: 14012c9f ("r8169: fix jumbo configuration for RTL8168evl")
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fc75219
    • Linus Torvalds's avatar
      pipe: don't use 'pipe_wait() for basic pipe IO · 85190d15
      Linus Torvalds authored
      pipe_wait() may be simple, but since it relies on the pipe lock, it
      means that we have to do the wakeup while holding the lock.  That's
      unfortunate, because the very first thing the waked entity will want to
      do is to get the pipe lock for itself.
      
      So get rid of the pipe_wait() usage by simply releasing the pipe lock,
      doing the wakeup (if required) and then using wait_event_interruptible()
      to wait on the right condition instead.
      
      wait_event_interruptible() handles races on its own by comparing the
      wakeup condition before and after adding itself to the wait queue, so
      you can use an optimistic unlocked condition for it.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85190d15
    • Jiasen Lin's avatar
      NTB: Add Hygon Device ID · 9b5b99a8
      Jiasen Lin authored
      Signed-off-by: default avatarJiasen Lin <linjiasen@hygon.cn>
      Signed-off-by: default avatarJon Mason <jdmason@kudzu.us>
      9b5b99a8
    • Linus Torvalds's avatar
      pipe: remove 'waiting_writers' merging logic · a28c8b9d
      Linus Torvalds authored
      This code is ancient, and goes back to when we only had a single page
      for the pipe buffers.  The exact history is hidden in the mists of time
      (ie "before git", and in fact predates the BK repository too).
      
      At that long-ago point in time, it actually helped to try to merge big
      back-and-forth pipe reads and writes, and not limit pipe reads to the
      single pipe buffer in length just because that was all we had at a time.
      
      However, since then we've expanded the pipe buffers to multiple pages,
      and this logic really doesn't seem to make sense.  And a lot of it is
      somewhat questionable (ie "hmm, the user asked for a non-blocking read,
      but we see that there's a writer pending, so let's wait anyway to get
      the extra data that the writer will have").
      
      But more importantly, it makes the "go to sleep" logic much less
      obvious, and considering the wakeup issues we've had, I want to make for
      less of those kinds of things.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a28c8b9d
    • Linus Torvalds's avatar
      pipe: fix and clarify pipe read wakeup logic · f467a6a6
      Linus Torvalds authored
      This is the read side version of the previous commit: it simplifies the
      logic to only wake up waiting writers when necessary, and makes sure to
      use a synchronous wakeup.  This time not so much for GNU make jobserver
      reasons (that pipe never fills up), but simply to get the writer going
      quickly again.
      
      A bit less verbose commentary this time, if only because I assume that
      the write side commentary isn't going to be ignored if you touch this
      code.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f467a6a6
    • Linus Torvalds's avatar
      pipe: fix and clarify pipe write wakeup logic · 1b6b26ae
      Linus Torvalds authored
      The pipe rework ends up having been extra painful, partly becaused of
      actual bugs with ordering and caching of the pipe state, but also
      because of subtle performance issues.
      
      In particular, the pipe rework caused the kernel build to inexplicably
      slow down.
      
      The reason turns out to be that the GNU make jobserver (which limits the
      parallelism of the build) uses a pipe to implement a "token" system: a
      parallel submake will read a character from the pipe to get the job
      token before starting a new job, and will write a character back to the
      pipe when it is done.  The overall job limit is thus easily controlled
      by just writing the appropriate number of initial token characters into
      the pipe.
      
      But to work well, that really means that the old behavior of write
      wakeups being synchronous (WF_SYNC) is very important - when the pipe
      writer wakes up a reader, we want the reader to actually get scheduled
      immediately.  Otherwise you lose the parallelism of the build.
      
      The pipe rework lost that synchronous wakeup on write, and we had
      clearly all forgotten the reasons and rules for it.
      
      This rewrites the pipe write wakeup logic to do the required Wsync
      wakeups, but also clarifies the logic and avoids extraneous wakeups.
      
      It also ends up addign a number of comments about what oit does and why,
      so that we hopefully don't end up forgetting about this next time we
      change this code.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b6b26ae
    • Eric Dumazet's avatar
      net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add() · 2dd5616e
      Eric Dumazet authored
      Use the new tcf_proto_check_kind() helper to make sure user
      provided value is well formed.
      
      BUG: KMSAN: uninit-value in string_nocheck lib/vsprintf.c:606 [inline]
      BUG: KMSAN: uninit-value in string+0x4be/0x600 lib/vsprintf.c:668
      CPU: 0 PID: 12358 Comm: syz-executor.1 Not tainted 5.4.0-rc8-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x220 lib/dump_stack.c:118
       kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108
       __msan_warning+0x64/0xc0 mm/kmsan/kmsan_instr.c:245
       string_nocheck lib/vsprintf.c:606 [inline]
       string+0x4be/0x600 lib/vsprintf.c:668
       vsnprintf+0x218f/0x3210 lib/vsprintf.c:2510
       __request_module+0x2b1/0x11c0 kernel/kmod.c:143
       tcf_proto_lookup_ops+0x171/0x700 net/sched/cls_api.c:139
       tc_chain_tmplt_add net/sched/cls_api.c:2730 [inline]
       tc_ctl_chain+0x1904/0x38a0 net/sched/cls_api.c:2850
       rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5224
       netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5242
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg net/socket.c:657 [inline]
       ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311
       __sys_sendmsg net/socket.c:2356 [inline]
       __do_sys_sendmsg net/socket.c:2365 [inline]
       __se_sys_sendmsg+0x305/0x460 net/socket.c:2363
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2363
       do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45a649
      Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f0790795c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 000000000045a649
      RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000006
      RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f07907966d4
      R13: 00000000004c8db5 R14: 00000000004df630 R15: 00000000ffffffff
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:149 [inline]
       kmsan_internal_poison_shadow+0x5c/0x110 mm/kmsan/kmsan.c:132
       kmsan_slab_alloc+0x97/0x100 mm/kmsan/kmsan_hooks.c:86
       slab_alloc_node mm/slub.c:2773 [inline]
       __kmalloc_node_track_caller+0xe27/0x11a0 mm/slub.c:4381
       __kmalloc_reserve net/core/skbuff.c:141 [inline]
       __alloc_skb+0x306/0xa10 net/core/skbuff.c:209
       alloc_skb include/linux/skbuff.h:1049 [inline]
       netlink_alloc_large_skb net/netlink/af_netlink.c:1174 [inline]
       netlink_sendmsg+0x783/0x1330 net/netlink/af_netlink.c:1892
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg net/socket.c:657 [inline]
       ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311
       __sys_sendmsg net/socket.c:2356 [inline]
       __do_sys_sendmsg net/socket.c:2365 [inline]
       __se_sys_sendmsg+0x305/0x460 net/socket.c:2363
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2363
       do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 6f96c3c6 ("net_sched: fix backward compatibility for TCA_KIND")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2dd5616e
    • Heiner Kallweit's avatar
      r8169: add missing RX enabling for WoL on RTL8125 · 00222d13
      Heiner Kallweit authored
      RTL8125 also requires to enable RX for WoL.
      
      v2: add missing Fixes tag
      
      Fixes: f1bce4ad ("r8169: add support for RTL8125")
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00222d13
    • Stefano Garzarella's avatar
      vhost/vsock: accept only packets with the right dst_cid · 8a3cc29c
      Stefano Garzarella authored
      When we receive a new packet from the guest, we check if the
      src_cid is correct, but we forgot to check the dst_cid.
      
      The host should accept only packets where dst_cid is
      equal to the host CID.
      Signed-off-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a3cc29c
    • Grygorii Strashko's avatar
      net: phy: dp83867: fix hfs boot in rgmii mode · fafc5db2
      Grygorii Strashko authored
      The commit ef87f7da ("net: phy: dp83867: move dt parsing to probe")
      causes regression on TI dra71x-evm and dra72x-evm, where DP83867 PHY is
      used in "rgmii-id" mode - the networking stops working.
      Unfortunately, it's not enough to just move DT parsing code to .probe() as
      it depends on phydev->interface value, which is set to correct value abter
      the .probe() is completed and before calling .config_init(). So, RGMII
      configuration can't be loaded from DT.
      
      To fix and issue
      - move RGMII validation code to .config_init()
      - parse RGMII parameters in dp83867_of_init(), but consider them as
      optional.
      
      Fixes: ef87f7da ("net: phy: dp83867: move dt parsing to probe")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fafc5db2
    • Grygorii Strashko's avatar
      net: ethernet: ti: cpsw: fix extra rx interrupt · 51302f77
      Grygorii Strashko authored
      Now RX interrupt is triggered twice every time, because in
      cpsw_rx_interrupt() it is asked first and then disabled. So there will be
      pending interrupt always, when RX interrupt is enabled again in NAPI
      handler.
      
      Fix it by first disabling IRQ and then do ask.
      
      Fixes: 870915fe ("drivers: net: cpsw: remove disable_irq/enable_irq as irq can be masked from cpsw itself")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51302f77
    • Eric Dumazet's avatar
      inet: protect against too small mtu values. · 501a90c9
      Eric Dumazet authored
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      501a90c9