1. 21 Jul, 2017 38 commits
    • Andy Lutomirski's avatar
      selftests/capabilities: Fix the test_execve test · a2e0b1c1
      Andy Lutomirski authored
      commit 796a3bae upstream.
      
      test_execve does rather odd mount manipulations to safely create
      temporary setuid and setgid executables that aren't visible to the
      rest of the system.  Those executables end up in the test's cwd, but
      that cwd is MNT_DETACHed.
      
      The core namespace code considers MNT_DETACHed trees to belong to no
      mount namespace at all and, in general, MNT_DETACHed trees are only
      barely function.  This interacted with commit 380cf5ba ("fs:
      Treat foreign mounts as nosuid") to cause all MNT_DETACHed trees to
      act as though they're nosuid, breaking the test.
      
      Fix it by just not detaching the tree.  It's still in a private
      mount namespace and is therefore still invisible to the rest of the
      system (except via /proc, and the same nosuid logic will protect all
      other programs on the system from believing in test_execve's setuid
      bits).
      
      While we're at it, fix some blatant whitespace problems.
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Fixes: 380cf5ba ("fs: Treat foreign mounts as nosuid")
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: linux-kselftest@vger.kernel.org
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarShuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2e0b1c1
    • Eric W. Biederman's avatar
      mnt: Make propagate_umount less slow for overlapping mount propagation trees · f07288cf
      Eric W. Biederman authored
      commit 296990de upstream.
      
      Andrei Vagin pointed out that time to executue propagate_umount can go
      non-linear (and take a ludicrious amount of time) when the mount
      propogation trees of the mounts to be unmunted by a lazy unmount
      overlap.
      
      Make the walk of the mount propagation trees nearly linear by
      remembering which mounts have already been visited, allowing
      subsequent walks to detect when walking a mount propgation tree or a
      subtree of a mount propgation tree would be duplicate work and to skip
      them entirely.
      
      Walk the list of mounts whose propgatation trees need to be traversed
      from the mount highest in the mount tree to mounts lower in the mount
      tree so that odds are higher that the code will walk the largest trees
      first, allowing later tree walks to be skipped entirely.
      
      Add cleanup_umount_visitation to remover the code's memory of which
      mounts have been visited.
      
      Add the functions last_slave and skip_propagation_subtree to allow
      skipping appropriate parts of the mount propagation tree without
      needing to change the logic of the rest of the code.
      
      A script to generate overlapping mount propagation trees:
      
      $ cat runs.h
      set -e
      mount -t tmpfs zdtm /mnt
      mkdir -p /mnt/1 /mnt/2
      mount -t tmpfs zdtm /mnt/1
      mount --make-shared /mnt/1
      mkdir /mnt/1/1
      
      iteration=10
      if [ -n "$1" ] ; then
      	iteration=$1
      fi
      
      for i in $(seq $iteration); do
      	mount --bind /mnt/1/1 /mnt/1/1
      done
      
      mount --rbind /mnt/1 /mnt/2
      
      TIMEFORMAT='%Rs'
      nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
      echo -n "umount -l /mnt/1 -> $nr        "
      time umount -l /mnt/1
      
      nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
      time umount -l /mnt/2
      
      $ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done
      
      Here are the performance numbers with and without the patch:
      
           mhash |  8192   |  8192  | 1048576 | 1048576
          mounts | before  | after  |  before | after
          ------------------------------------------------
            1025 |  0.040s | 0.016s |  0.038s | 0.019s
            2049 |  0.094s | 0.017s |  0.080s | 0.018s
            4097 |  0.243s | 0.019s |  0.206s | 0.023s
            8193 |  1.202s | 0.028s |  1.562s | 0.032s
           16385 |  9.635s | 0.036s |  9.952s | 0.041s
           32769 | 60.928s | 0.063s | 44.321s | 0.064s
           65537 |         | 0.097s |         | 0.097s
          131073 |         | 0.233s |         | 0.176s
          262145 |         | 0.653s |         | 0.344s
          524289 |         | 2.305s |         | 0.735s
         1048577 |         | 7.107s |         | 2.603s
      
      Andrei Vagin reports fixing the performance problem is part of the
      work to fix CVE-2016-6213.
      
      Fixes: a05964f3 ("[PATCH] shared mounts handling: umount")
      Reported-by: default avatarAndrei Vagin <avagin@openvz.org>
      Reviewed-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f07288cf
    • Eric W. Biederman's avatar
      mnt: In propgate_umount handle visiting mounts in any order · fdb8f104
      Eric W. Biederman authored
      commit 99b19d16 upstream.
      
      While investigating some poor umount performance I realized that in
      the case of overlapping mount trees where some of the mounts are locked
      the code has been failing to unmount all of the mounts it should
      have been unmounting.
      
      This failure to unmount all of the necessary
      mounts can be reproduced with:
      
      $ cat locked_mounts_test.sh
      
      mount -t tmpfs test-base /mnt
      mount --make-shared /mnt
      mkdir -p /mnt/b
      
      mount -t tmpfs test1 /mnt/b
      mount --make-shared /mnt/b
      mkdir -p /mnt/b/10
      
      mount -t tmpfs test2 /mnt/b/10
      mount --make-shared /mnt/b/10
      mkdir -p /mnt/b/10/20
      
      mount --rbind /mnt/b /mnt/b/10/20
      
      unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
      sleep 1
      umount -l /mnt/b
      wait %%
      
      $ unshare -Urm ./locked_mounts_test.sh
      
      This failure is corrected by removing the prepass that marks mounts
      that may be umounted.
      
      A first pass is added that umounts mounts if possible and if not sets
      mount mark if they could be unmounted if they weren't locked and adds
      them to a list to umount possibilities.  This first pass reconsiders
      the mounts parent if it is on the list of umount possibilities, ensuring
      that information of umoutability will pass from child to mount parent.
      
      A second pass then walks through all mounts that are umounted and processes
      their children unmounting them or marking them for reparenting.
      
      A last pass cleans up the state on the mounts that could not be umounted
      and if applicable reparents them to their first parent that remained
      mounted.
      
      While a bit longer than the old code this code is much more robust
      as it allows information to flow up from the leaves and down
      from the trunk making the order in which mounts are encountered
      in the umount propgation tree irrelevant.
      
      Fixes: 0c56fe31 ("mnt: Don't propagate unmounts to locked mounts")
      Reviewed-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fdb8f104
    • Eric W. Biederman's avatar
      mnt: In umount propagation reparent in a separate pass · 7cbc3955
      Eric W. Biederman authored
      commit 570487d3 upstream.
      
      It was observed that in some pathlogical cases that the current code
      does not unmount everything it should.  After investigation it
      was determined that the issue is that mnt_change_mntpoint can
      can change which mounts are available to be unmounted during mount
      propagation which is wrong.
      
      The trivial reproducer is:
      $ cat ./pathological.sh
      
      mount -t tmpfs test-base /mnt
      cd /mnt
      mkdir 1 2 1/1
      mount --bind 1 1
      mount --make-shared 1
      mount --bind 1 2
      mount --bind 1/1 1/1
      mount --bind 1/1 1/1
      echo
      grep test-base /proc/self/mountinfo
      umount 1/1
      echo
      grep test-base /proc/self/mountinfo
      
      $ unshare -Urm ./pathological.sh
      
      The expected output looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      The output without the fix looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      That last mount in the output was in the propgation tree to be unmounted but
      was missed because the mnt_change_mountpoint changed it's parent before the walk
      through the mount propagation tree observed it.
      
      Fixes: 1064f874 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
      Acked-by: default avatarAndrei Vagin <avagin@virtuozzo.com>
      Reviewed-by: default avatarRam Pai <linuxram@us.ibm.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cbc3955
    • Adam Borowski's avatar
      vt: fix unchecked __put_user() in tioclinux ioctls · 050b074e
      Adam Borowski authored
      commit 6987dc8a upstream.
      
      Only read access is checked before this call.
      
      Actually, at the moment this is not an issue, as every in-tree arch does
      the same manual checks for VERIFY_READ vs VERIFY_WRITE, relying on the MMU
      to tell them apart, but this wasn't the case in the past and may happen
      again on some odd arch in the future.
      
      If anyone cares about 3.7 and earlier, this is a security hole (untested)
      on real 80386 CPUs.
      Signed-off-by: default avatarAdam Borowski <kilobyte@angband.pl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      050b074e
    • Kees Cook's avatar
      exec: Limit arg stack to at most 75% of _STK_LIM · 86949eb9
      Kees Cook authored
      commit da029c11 upstream.
      
      To avoid pathological stack usage or the need to special-case setuid
      execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86949eb9
    • Kees Cook's avatar
      s390: reduce ELF_ET_DYN_BASE · 7888c029
      Kees Cook authored
      commit a73dc537 upstream.
      
      Now that explicitly executed loaders are loaded in the mmap region, we
      have more freedom to decide where we position PIE binaries in the
      address space to avoid possible collisions with mmap or stack regions.
      
      For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
      address space for 32-bit pointers.  On 32-bit use 4MB, which is the
      traditional x86 minimum load location, likely to avoid historically
      requiring a 4MB page table entry when only a portion of the first 4MB
      would be used (since the NULL address is avoided).  For s390 the
      position could be 0x10000, but that is needlessly close to the NULL
      address.
      
      Link: http://lkml.kernel.org/r/1498154792-49952-5-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7888c029
    • Kees Cook's avatar
      powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB · 72a333a0
      Kees Cook authored
      commit 47ebb09d upstream.
      
      Now that explicitly executed loaders are loaded in the mmap region, we
      have more freedom to decide where we position PIE binaries in the
      address space to avoid possible collisions with mmap or stack regions.
      
      For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
      address space for 32-bit pointers.  On 32-bit use 4MB, which is the
      traditional x86 minimum load location, likely to avoid historically
      requiring a 4MB page table entry when only a portion of the first 4MB
      would be used (since the NULL address is avoided).
      
      Link: http://lkml.kernel.org/r/1498154792-49952-4-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72a333a0
    • Kees Cook's avatar
      arm64: move ELF_ET_DYN_BASE to 4GB / 4MB · 43cf90f7
      Kees Cook authored
      commit 02445990 upstream.
      
      Now that explicitly executed loaders are loaded in the mmap region, we
      have more freedom to decide where we position PIE binaries in the
      address space to avoid possible collisions with mmap or stack regions.
      
      For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
      address space for 32-bit pointers.  On 32-bit use 4MB, to match ARM.
      This could be 0x8000, the standard ET_EXEC load address, but that is
      needlessly close to the NULL address, and anyone running arm compat PIE
      will have an MMU, so the tight mapping is not needed.
      
      Link: http://lkml.kernel.org/r/1498251600-132458-4-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43cf90f7
    • Kees Cook's avatar
      arm: move ELF_ET_DYN_BASE to 4MB · d2471b5e
      Kees Cook authored
      commit 6a9af90a upstream.
      
      Now that explicitly executed loaders are loaded in the mmap region, we
      have more freedom to decide where we position PIE binaries in the
      address space to avoid possible collisions with mmap or stack regions.
      
      4MB is chosen here mainly to have parity with x86, where this is the
      traditional minimum load location, likely to avoid historically
      requiring a 4MB page table entry when only a portion of the first 4MB
      would be used (since the NULL address is avoided).
      
      For ARM the position could be 0x8000, the standard ET_EXEC load address,
      but that is needlessly close to the NULL address, and anyone running PIE
      on 32-bit ARM will have an MMU, so the tight mapping is not needed.
      
      Link: http://lkml.kernel.org/r/1498154792-49952-2-git-send-email-keescook@chromium.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Qualys Security Advisory <qsa@qualys.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2471b5e
    • Kees Cook's avatar
      binfmt_elf: use ELF_ET_DYN_BASE only for PIE · 7eb968cd
      Kees Cook authored
      commit eab09532 upstream.
      
      The ELF_ET_DYN_BASE position was originally intended to keep loaders
      away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
      /bin/cat" might cause the subsequent load of /bin/cat into where the
      loader had been loaded.)
      
      With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
      ELF_ET_DYN_BASE continued to be used since the kernel was only looking
      at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
      top 1/3rd of the TASK_SIZE, a substantial portion of the address space
      is unused.
      
      For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
      loaded above the mmap region.  This means they can be made to collide
      (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
      pathological stack regions.
      
      Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
      region in all cases, and will now additionally avoid programs falling
      back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
      if it would have collided with the stack, now it will fail to load
      instead of falling back to the mmap region).
      
      To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
      are loaded into the mmap region, leaving space available for either an
      ET_EXEC binary with a fixed location or PIE being loaded into mmap by
      the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
      which means architectures can now safely lower their values without risk
      of loaders colliding with their subsequently loaded programs.
      
      For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
      the entire 32-bit address space for 32-bit pointers.
      
      Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
      suggestions on how to implement this solution.
      
      Fixes: d1fd836d ("mm: split ET_DYN ASLR from mmap ASLR")
      Link: http://lkml.kernel.org/r/20170621173201.GA114489@beastSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Qualys Security Advisory <qsa@qualys.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7eb968cd
    • Cyril Bur's avatar
      checkpatch: silence perl 5.26.0 unescaped left brace warnings · 4544e9eb
      Cyril Bur authored
      commit 8d81ae05 upstream.
      
      As of perl 5, version 26, subversion 0 (v5.26.0) some new warnings have
      occurred when running checkpatch.
      
      Unescaped left brace in regex is deprecated here (and will be fatal in
      Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s*){
      <-- HERE \s*/ at scripts/checkpatch.pl line 3544.
      
      Unescaped left brace in regex is deprecated here (and will be fatal in
      Perl 5.30), passed through in regex; marked by <-- HERE in m/^(.\s*){
      <-- HERE \s*/ at scripts/checkpatch.pl line 3885.
      
      Unescaped left brace in regex is deprecated here (and will be fatal in
      Perl 5.30), passed through in regex; marked by <-- HERE in
      m/^(\+.*(?:do|\))){ <-- HERE / at scripts/checkpatch.pl line 4374.
      
      It seems perfectly reasonable to do as the warning suggests and simply
      escape the left brace in these three locations.
      
      Link: http://lkml.kernel.org/r/20170607060135.17384-1-cyrilbur@gmail.comSigned-off-by: default avatarCyril Bur <cyrilbur@gmail.com>
      Acked-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4544e9eb
    • Sahitya Tummala's avatar
      fs/dcache.c: fix spin lockup issue on nlru->lock · 68b0f5d8
      Sahitya Tummala authored
      commit b17c070f upstream.
      
      __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
      duration if there are more number of items in the lru list.  As per the
      current code, it can hold the spin lock for upto maximum UINT_MAX
      entries at a time.  So if there are more number of items in the lru
      list, then "BUG: spinlock lockup suspected" is observed in the below
      path:
      
        spin_bug+0x90
        do_raw_spin_lock+0xfc
        _raw_spin_lock+0x28
        list_lru_add+0x28
        dput+0x1c8
        path_put+0x20
        terminate_walk+0x3c
        path_lookupat+0x100
        filename_lookup+0x6c
        user_path_at_empty+0x54
        SyS_faccessat+0xd0
        el0_svc_naked+0x24
      
      This nlru->lock is acquired by another CPU in this path -
      
        d_lru_shrink_move+0x34
        dentry_lru_isolate_shrink+0x48
        __list_lru_walk_one.isra.10+0x94
        list_lru_walk_node+0x40
        shrink_dcache_sb+0x60
        do_remount_sb+0xbc
        do_emergency_remount+0xb0
        process_one_work+0x228
        worker_thread+0x2e0
        kthread+0xf4
        ret_from_fork+0x10
      
      Fix this lockup by reducing the number of entries to be shrinked from
      the lru list to 1024 at once.  Also, add cond_resched() before
      processing the lru list again.
      
      Link: http://marc.info/?t=149722864900001&r=1&w=2
      Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.orgSigned-off-by: default avatarSahitya Tummala <stummala@codeaurora.org>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Suggested-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Alexander Polakov <apolyakov@beget.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68b0f5d8
    • Sahitya Tummala's avatar
      mm/list_lru.c: fix list_lru_count_node() to be race free · 2d0db02d
      Sahitya Tummala authored
      commit 2c80cd57 upstream.
      
      list_lru_count_node() iterates over all memcgs to get the total number of
      entries on the node but it can race with memcg_drain_all_list_lrus(),
      which migrates the entries from a dead cgroup to another.  This can return
      incorrect number of entries from list_lru_count_node().
      
      Fix this by keeping track of entries per node and simply return it in
      list_lru_count_node().
      
      Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.orgSigned-off-by: default avatarSahitya Tummala <stummala@codeaurora.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Alexander Polakov <apolyakov@beget.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d0db02d
    • Marcin Nowakowski's avatar
      kernel/extable.c: mark core_kernel_text notrace · 717ce69e
      Marcin Nowakowski authored
      commit c0d80dda upstream.
      
      core_kernel_text is used by MIPS in its function graph trace processing,
      so having this method traced leads to an infinite set of recursive calls
      such as:
      
        Call Trace:
           ftrace_return_to_handler+0x50/0x128
           core_kernel_text+0x10/0x1b8
           prepare_ftrace_return+0x6c/0x114
           ftrace_graph_caller+0x20/0x44
           return_to_handler+0x10/0x30
           return_to_handler+0x0/0x30
           return_to_handler+0x0/0x30
           ftrace_ops_no_ops+0x114/0x1bc
           core_kernel_text+0x10/0x1b8
           core_kernel_text+0x10/0x1b8
           core_kernel_text+0x10/0x1b8
           ftrace_ops_no_ops+0x114/0x1bc
           core_kernel_text+0x10/0x1b8
           prepare_ftrace_return+0x6c/0x114
           ftrace_graph_caller+0x20/0x44
           (...)
      
      Mark the function notrace to avoid it being traced.
      
      Link: http://lkml.kernel.org/r/1498028607-6765-1-git-send-email-marcin.nowakowski@imgtec.comSigned-off-by: default avatarMarcin Nowakowski <marcin.nowakowski@imgtec.com>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Meyer <thomas@m3y3r.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      717ce69e
    • Ben Hutchings's avatar
      tools/lib/lockdep: Reduce MAX_LOCK_DEPTH to avoid overflowing lock_chain/: Depth · 0d6ee0be
      Ben Hutchings authored
      commit 98dcea0c upstream.
      
      liblockdep has been broken since commit 75dd602a ("lockdep: Fix
      lock_chain::base size"), as that adds a check that MAX_LOCK_DEPTH is
      within the range of lock_chain::depth and in liblockdep it is much
      too large.
      
      That should have resulted in a compiler error, but didn't because:
      
      - the check uses ARRAY_SIZE(), which isn't yet defined in liblockdep
        so is assumed to be an (undeclared) function
      - putting a function call inside a BUILD_BUG_ON() expression quietly
        turns it into some nonsense involving a variable-length array
      
      It did produce a compiler warning, but I didn't notice because
      liblockdep already produces too many warnings if -Wall is enabled
      (which I'll fix shortly).
      
      Even before that commit, which reduced lock_chain::depth from 8 bits
      to 6, MAX_LOCK_DEPTH was too large.
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: a.p.zijlstra@chello.nl
      Link: http://lkml.kernel.org/r/20170525130005.5947-3-alexander.levin@verizon.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d6ee0be
    • Helge Deller's avatar
      parisc/mm: Ensure IRQs are off in switch_mm() · b2914574
      Helge Deller authored
      commit 649aa242 upstream.
      
      This is because of commit f98db601 ("sched/core: Add switch_mm_irqs_off()
      and use it in the scheduler") in which switch_mm_irqs_off() is called by the
      scheduler, vs switch_mm() which is used by use_mm().
      
      This patch lets the parisc code mirror the x86 and powerpc code, ie. it
      disables interrupts in switch_mm(), and optimises the scheduler case by
      defining switch_mm_irqs_off().
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2914574
    • Thomas Bogendoerfer's avatar
      parisc: DMA API: return error instead of BUG_ON for dma ops on non dma devs · 635a5822
      Thomas Bogendoerfer authored
      commit 33f9e024 upstream.
      
      Enabling parport pc driver on a B2600 (and probably other 64bit PARISC
      systems) produced following BUG:
      
      CPU: 0 PID: 1 Comm: swapper Not tainted 4.12.0-rc5-30198-g1132d5e7 #156
      task: 000000009e050000 task.stack: 000000009e04c000
      
           YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
      PSW: 00001000000001101111111100001111 Not tainted
      r00-03  000000ff0806ff0f 000000009e04c990 0000000040871b78 000000009e04cac0
      r04-07  0000000040c14de0 ffffffffffffffff 000000009e07f098 000000009d82d200
      r08-11  000000009d82d210 0000000000000378 0000000000000000 0000000040c345e0
      r12-15  0000000000000005 0000000040c345e0 0000000000000000 0000000040c9d5e0
      r16-19  0000000040c345e0 00000000f00001c4 00000000f00001bc 0000000000000061
      r20-23  000000009e04ce28 0000000000000010 0000000000000010 0000000040b89e40
      r24-27  0000000000000003 0000000000ffffff 000000009d82d210 0000000040c14de0
      r28-31  0000000000000000 000000009e04ca90 000000009e04cb40 0000000000000000
      sr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000404aece0 00000000404aece4
       IIR: 03ffe01f    ISR: 0000000010340000  IOR: 000001781304cac8
       CPU:        0   CR30: 000000009e04c000 CR31: 00000000e2976de2
       ORIG_R28: 0000000000000200
       IAOQ[0]: sba_dma_supported+0x80/0xd0
       IAOQ[1]: sba_dma_supported+0x84/0xd0
       RP(r2): parport_pc_probe_port+0x178/0x1200
      
      Cause is a call to dma_coerce_mask_and_coherenet in parport_pc_probe_port,
      which PARISC DMA API doesn't handle very nicely. This commit gives back
      DMA_ERROR_CODE for DMA API calls, if device isn't capable of DMA
      transaction.
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      635a5822
    • Eric Biggers's avatar
      parisc: use compat_sys_keyctl() · f265641d
      Eric Biggers authored
      commit b0f94efd upstream.
      
      Architectures with a compat syscall table must put compat_sys_keyctl()
      in it, not sys_keyctl().  The parisc architecture was not doing this;
      fix it.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f265641d
    • Helge Deller's avatar
      parisc: Report SIGSEGV instead of SIGBUS when running out of stack · e18ca17b
      Helge Deller authored
      commit 24746231 upstream.
      
      When a process runs out of stack the parisc kernel wrongly faults with SIGBUS
      instead of the expected SIGSEGV signal.
      
      This example shows how the kernel faults:
      do_page_fault() command='a.out' type=15 address=0xfaac2000 in libc-2.24.so[f8308000+16c000]
      trap #15: Data TLB miss fault, vm_start = 0xfa2c2000, vm_end = 0xfaac2000
      
      The vma->vm_end value is the first address which does not belong to the vma, so
      adjust the check to include vma->vm_end to the range for which to send the
      SIGSEGV signal.
      
      This patch unbreaks building the debian libsigsegv package.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e18ca17b
    • Suzuki K Poulose's avatar
      irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity · 97061646
      Suzuki K Poulose authored
      commit 866d7c1b upstream.
      
      The GICv3 driver doesn't check if the target CPU for gic_set_affinity
      is valid before going ahead and making the changes. This triggers the
      following splat with KASAN:
      
      [  141.189434] BUG: KASAN: global-out-of-bounds in gic_set_affinity+0x8c/0x140
      [  141.189704] Read of size 8 at addr ffff200009741d20 by task swapper/1/0
      [  141.189958]
      [  141.190158] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-rc7
      [  141.190458] Hardware name: Foundation-v8A (DT)
      [  141.190658] Call trace:
      [  141.190908] [<ffff200008089d70>] dump_backtrace+0x0/0x328
      [  141.191224] [<ffff20000808a1b4>] show_stack+0x14/0x20
      [  141.191507] [<ffff200008504c3c>] dump_stack+0xa4/0xc8
      [  141.191858] [<ffff20000826c19c>] print_address_description+0x13c/0x250
      [  141.192219] [<ffff20000826c5c8>] kasan_report+0x210/0x300
      [  141.192547] [<ffff20000826ad54>] __asan_load8+0x84/0x98
      [  141.192874] [<ffff20000854eeec>] gic_set_affinity+0x8c/0x140
      [  141.193158] [<ffff200008148b14>] irq_do_set_affinity+0x54/0xb8
      [  141.193473] [<ffff200008148d2c>] irq_set_affinity_locked+0x64/0xf0
      [  141.193828] [<ffff200008148e00>] __irq_set_affinity+0x48/0x78
      [  141.194158] [<ffff200008bc48a4>] arm_perf_starting_cpu+0x104/0x150
      [  141.194513] [<ffff2000080d73bc>] cpuhp_invoke_callback+0x17c/0x1f8
      [  141.194783] [<ffff2000080d94ec>] notify_cpu_starting+0x8c/0xb8
      [  141.195130] [<ffff2000080911ec>] secondary_start_kernel+0x15c/0x200
      [  141.195390] [<0000000080db81b4>] 0x80db81b4
      [  141.195603]
      [  141.195685] The buggy address belongs to the variable:
      [  141.196012]  __cpu_logical_map+0x200/0x220
      [  141.196176]
      [  141.196315] Memory state around the buggy address:
      [  141.196586]  ffff200009741c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [  141.196913]  ffff200009741c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [  141.197158] >ffff200009741d00: 00 00 00 00 fa fa fa fa 00 00 00 00 00 00 00 00
      [  141.197487]                                ^
      [  141.197758]  ffff200009741d80: 00 00 00 00 00 00 00 00 fa fa fa fa 00 00 00 00
      [  141.198060]  ffff200009741e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [  141.198358] ==================================================================
      [  141.198609] Disabling lock debugging due to kernel taint
      [  141.198961] CPU1: Booted secondary processor [410fd051]
      
      This patch adds the check to make sure the cpu is valid.
      
      Fixes: commit 021f6537 ("irqchip: gic-v3: Initial support for GICv3")
      Signed-off-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      97061646
    • Srinivas Dasari's avatar
      cfg80211: Check if PMKID attribute is of expected size · 2d3c10e2
      Srinivas Dasari authored
      commit 9361df14 upstream.
      
      nla policy checks for only maximum length of the attribute data
      when the attribute type is NLA_BINARY. If userspace sends less
      data than specified, the wireless drivers may access illegal
      memory. When type is NLA_UNSPEC, nla policy check ensures that
      userspace sends minimum specified length number of bytes.
      
      Remove type assignment to NLA_BINARY from nla_policy of
      NL80211_ATTR_PMKID to make this NLA_UNSPEC and to make sure minimum
      WLAN_PMKID_LEN bytes are received from userspace with
      NL80211_ATTR_PMKID.
      
      Fixes: 67fbb16b ("nl80211: PMKSA caching support")
      Signed-off-by: default avatarSrinivas Dasari <dasaris@qti.qualcomm.com>
      Signed-off-by: default avatarJouni Malinen <jouni@qca.qualcomm.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d3c10e2
    • Srinivas Dasari's avatar
      cfg80211: Validate frequencies nested in NL80211_ATTR_SCAN_FREQUENCIES · 24d04107
      Srinivas Dasari authored
      commit d7f13f74 upstream.
      
      validate_scan_freqs() retrieves frequencies from attributes
      nested in the attribute NL80211_ATTR_SCAN_FREQUENCIES with
      nla_get_u32(), which reads 4 bytes from each attribute
      without validating the size of data received. Attributes
      nested in NL80211_ATTR_SCAN_FREQUENCIES don't have an nla policy.
      
      Validate size of each attribute before parsing to avoid potential buffer
      overread.
      
      Fixes: 2a519311 ("cfg80211/nl80211: scanning (and mac80211 update to use it)")
      Signed-off-by: default avatarSrinivas Dasari <dasaris@qti.qualcomm.com>
      Signed-off-by: default avatarJouni Malinen <jouni@qca.qualcomm.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24d04107
    • Srinivas Dasari's avatar
      cfg80211: Define nla_policy for NL80211_ATTR_LOCAL_MESH_POWER_MODE · 05bf0b6e
      Srinivas Dasari authored
      commit 8feb69c7 upstream.
      
      Buffer overread may happen as nl80211_set_station() reads 4 bytes
      from the attribute NL80211_ATTR_LOCAL_MESH_POWER_MODE without
      validating the size of data received when userspace sends less
      than 4 bytes of data with NL80211_ATTR_LOCAL_MESH_POWER_MODE.
      Define nla_policy for NL80211_ATTR_LOCAL_MESH_POWER_MODE to avoid
      the buffer overread.
      
      Fixes: 3b1c5a53 ("{cfg,nl}80211: mesh power mode primitives and userspace access")
      Signed-off-by: default avatarSrinivas Dasari <dasaris@qti.qualcomm.com>
      Signed-off-by: default avatarJouni Malinen <jouni@qca.qualcomm.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05bf0b6e
    • Arend van Spriel's avatar
      brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx() · 4c7021c2
      Arend van Spriel authored
      commit 8f44c9a4 upstream.
      
      The lower level nl80211 code in cfg80211 ensures that "len" is between
      25 and NL80211_ATTR_FRAME (2304).  We subtract DOT11_MGMT_HDR_LEN (24) from
      "len" so thats's max of 2280.  However, the action_frame->data[] buffer is
      only BRCMF_FIL_ACTION_FRAME_SIZE (1800) bytes long so this memcpy() can
      overflow.
      
      	memcpy(action_frame->data, &buf[DOT11_MGMT_HDR_LEN],
      	       le16_to_cpu(action_frame->len));
      
      Fixes: 18e2f61d ("brcmfmac: P2P action frame tx.")
      Reported-by: default avatar"freenerguo(郭大兴)" <freenerguo@tencent.com>
      Signed-off-by: default avatarArend van Spriel <arend.vanspriel@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c7021c2
    • Sowmini Varadhan's avatar
      rds: tcp: use sock_create_lite() to create the accept socket · 9618eb4a
      Sowmini Varadhan authored
      commit 0933a578 upstream.
      
      There are two problems with calling sock_create_kern() from
      rds_tcp_accept_one()
      1. it sets up a new_sock->sk that is wasteful, because this ->sk
         is going to get replaced by inet_accept() in the subsequent ->accept()
      2. The new_sock->sk is a leaked reference in sock_graft() which
         expects to find a null parent->sk
      
      Avoid these problems by calling sock_create_lite().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9618eb4a
    • Nikolay Aleksandrov's avatar
      vrf: fix bug_on triggered by rx when destroying a vrf · 89e7f17f
      Nikolay Aleksandrov authored
      commit f630c38e upstream.
      
      When destroying a VRF device we cleanup the slaves in its ndo_uninit()
      function, but that causes packets to be switched (skb->dev == vrf being
      destroyed) even though we're pass the point where the VRF should be
      receiving any packets while it is being dismantled. This causes a BUG_ON
      to trigger if we have raw sockets (trace below).
      The reason is that the inetdev of the VRF has been destroyed but we're
      still sending packets up the stack with it, so let's free the slaves in
      the dellink callback as David Ahern suggested.
      
      Note that this fix doesn't prevent packets from going up when the VRF
      device is admin down.
      
      [   35.631371] ------------[ cut here ]------------
      [   35.631603] kernel BUG at net/ipv4/fib_frontend.c:285!
      [   35.631854] invalid opcode: 0000 [#1] SMP
      [   35.631977] Modules linked in:
      [   35.632081] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.12.0-rc7+ #45
      [   35.632247] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [   35.632477] task: ffff88005ad68000 task.stack: ffff88005ad64000
      [   35.632632] RIP: 0010:fib_compute_spec_dst+0xfc/0x1ee
      [   35.632769] RSP: 0018:ffff88005ad67978 EFLAGS: 00010202
      [   35.632910] RAX: 0000000000000001 RBX: ffff880059a7f200 RCX: 0000000000000000
      [   35.633084] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff82274af0
      [   35.633256] RBP: ffff88005ad679f8 R08: 000000000001ef70 R09: 0000000000000046
      [   35.633430] R10: ffff88005ad679f8 R11: ffff880037731cb0 R12: 0000000000000001
      [   35.633603] R13: ffff8800599e3000 R14: 0000000000000000 R15: ffff8800599cb852
      [   35.634114] FS:  0000000000000000(0000) GS:ffff88005d900000(0000) knlGS:0000000000000000
      [   35.634306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   35.634456] CR2: 00007f3563227095 CR3: 000000000201d000 CR4: 00000000000406e0
      [   35.634632] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   35.634865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   35.635055] Call Trace:
      [   35.635271]  ? __lock_acquire+0xf0d/0x1117
      [   35.635522]  ipv4_pktinfo_prepare+0x82/0x151
      [   35.635831]  raw_rcv_skb+0x17/0x3c
      [   35.636062]  raw_rcv+0xe5/0xf7
      [   35.636287]  raw_local_deliver+0x169/0x1d9
      [   35.636534]  ip_local_deliver_finish+0x87/0x1c4
      [   35.636820]  ip_local_deliver+0x63/0x7f
      [   35.637058]  ip_rcv_finish+0x340/0x3a1
      [   35.637295]  ip_rcv+0x314/0x34a
      [   35.637525]  __netif_receive_skb_core+0x49f/0x7c5
      [   35.637780]  ? lock_acquire+0x13f/0x1d7
      [   35.638018]  ? lock_acquire+0x15e/0x1d7
      [   35.638259]  __netif_receive_skb+0x1e/0x94
      [   35.638502]  ? __netif_receive_skb+0x1e/0x94
      [   35.638748]  netif_receive_skb_internal+0x74/0x300
      [   35.639002]  ? dev_gro_receive+0x2ed/0x411
      [   35.639246]  ? lock_is_held_type+0xc4/0xd2
      [   35.639491]  napi_gro_receive+0x105/0x1a0
      [   35.639736]  receive_buf+0xc32/0xc74
      [   35.639965]  ? detach_buf+0x67/0x153
      [   35.640201]  ? virtqueue_get_buf_ctx+0x120/0x176
      [   35.640453]  virtnet_poll+0x128/0x1c5
      [   35.640690]  net_rx_action+0x103/0x343
      [   35.640932]  __do_softirq+0x1c7/0x4b7
      [   35.641171]  run_ksoftirqd+0x23/0x5c
      [   35.641403]  smpboot_thread_fn+0x24f/0x26d
      [   35.641646]  ? sort_range+0x22/0x22
      [   35.641878]  kthread+0x129/0x131
      [   35.642104]  ? __list_add+0x31/0x31
      [   35.642335]  ? __list_add+0x31/0x31
      [   35.642568]  ret_from_fork+0x2a/0x40
      [   35.642804] Code: 05 bd 87 a3 00 01 e8 1f ef 98 ff 4d 85 f6 48 c7 c7 f0 4a 27 82 41 0f 94 c4 31 c9 31 d2 41 0f b6 f4 e8 04 71 a1 ff 45 84 e4 74 02 <0f> 0b 0f b7 93 c4 00 00 00 4d 8b a5 80 05 00 00 48 03 93 d0 00
      [   35.644342] RIP: fib_compute_spec_dst+0xfc/0x1ee RSP: ffff88005ad67978
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Reported-by: default avatarChris Cormier <chriscormier@cumulusnetworks.com>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [backport to 4.4 - gregkh]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89e7f17f
    • David Ahern's avatar
      net: ipv6: Compare lwstate in detecting duplicate nexthops · eb7bef1d
      David Ahern authored
      commit f06b7549 upstream.
      
      Lennert reported a failure to add different mpls encaps in a multipath
      route:
      
        $ ip -6 route add 1234::/16 \
              nexthop encap mpls 10 via fe80::1 dev ens3 \
              nexthop encap mpls 20 via fe80::1 dev ens3
        RTNETLINK answers: File exists
      
      The problem is that the duplicate nexthop detection does not compare
      lwtunnel configuration. Add it.
      
      Fixes: 19e42e45 ("ipv6: support for fib route lwtunnel encap attributes")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Reported-by: default avatarJoão Taveira Araújo <joao.taveira@gmail.com>
      Reported-by: default avatarLennert Buytenhek <buytenh@wantstofly.org>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Tested-by: default avatarLennert Buytenhek <buytenh@wantstofly.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb7bef1d
    • Sabrina Dubroca's avatar
      ipv6: dad: don't remove dynamic addresses if link is down · 0c32b01e
      Sabrina Dubroca authored
      commit ec8add2a upstream.
      
      Currently, when the link for $DEV is down, this command succeeds but the
      address is removed immediately by DAD (1):
      
          ip addr add 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
      
      In the same situation, this will succeed and not remove the address (2):
      
          ip addr add 1111::12/64 dev $DEV
          ip addr change 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
      
      The comment in addrconf_dad_begin() when !IF_READY makes it look like
      this is the intended behavior, but doesn't explain why:
      
           * If the device is not ready:
           * - keep it tentative if it is a permanent address.
           * - otherwise, kill it.
      
      We clearly cannot prevent userspace from doing (2), but we can make (1)
      work consistently with (2).
      
      addrconf_dad_stop() is only called in two cases: if DAD failed, or to
      skip DAD when the link is down. In that second case, the fix is to avoid
      deleting the address, like we already do for permanent addresses.
      
      Fixes: 3c21edbd ("[IPV6]: Defer IPv6 device initialization until the link becomes ready.")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c32b01e
    • Michal Kubeček's avatar
      net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() · 38ae32c9
      Michal Kubeček authored
      commit e44699d2 upstream.
      
      Recently I started seeing warnings about pages with refcount -1. The
      problem was traced to packets being reused after their head was merged into
      a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
      commit c21b48cc ("net: adjust skb->truesize in ___pskb_trim()") and
      I have never seen it on a kernel with it reverted, I believe the real
      problem appeared earlier when the option to merge head frag in GRO was
      implemented.
      
      Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
      branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
      and head is merged (which in my case happens after the skb_condense()
      call added by the commit mentioned above), the skb is reused including the
      head that has been merged. As a result, we release the page reference
      twice and eventually end up with negative page refcount.
      
      To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
      the same way it's done in napi_skb_finish().
      
      Fixes: d7e8883c ("net: make GRO aware of skb->head_frag")
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38ae32c9
    • Daniel Borkmann's avatar
      bpf: prevent leaking pointer via xadd on unpriviledged · 1a4f13e0
      Daniel Borkmann authored
      commit 6bdf6abc upstream.
      
      Leaking kernel addresses on unpriviledged is generally disallowed,
      for example, verifier rejects the following:
      
        0: (b7) r0 = 0
        1: (18) r2 = 0xffff897e82304400
        3: (7b) *(u64 *)(r1 +48) = r2
        R2 leaks addr into ctx
      
      Doing pointer arithmetic on them is also forbidden, so that they
      don't turn into unknown value and then get leaked out. However,
      there's xadd as a special case, where we don't check the src reg
      for being a pointer register, e.g. the following will pass:
      
        0: (b7) r0 = 0
        1: (7b) *(u64 *)(r1 +48) = r0
        2: (18) r2 = 0xffff897e82304400 ; map
        4: (db) lock *(u64 *)(r1 +48) += r2
        5: (95) exit
      
      We could store the pointer into skb->cb, loose the type context,
      and then read it out from there again to leak it eventually out
      of a map value. Or more easily in a different variant, too:
      
         0: (bf) r6 = r1
         1: (7a) *(u64 *)(r10 -8) = 0
         2: (bf) r2 = r10
         3: (07) r2 += -8
         4: (18) r1 = 0x0
         6: (85) call bpf_map_lookup_elem#1
         7: (15) if r0 == 0x0 goto pc+3
         R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
         8: (b7) r3 = 0
         9: (7b) *(u64 *)(r0 +0) = r3
        10: (db) lock *(u64 *)(r0 +0) += r6
        11: (b7) r0 = 0
        12: (95) exit
      
        from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
        11: (b7) r0 = 0
        12: (95) exit
      
      Prevent this by checking xadd src reg for pointer types. Also
      add a couple of test cases related to this.
      
      Fixes: 1be7f75d ("bpf: enable non-root eBPF programs")
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a4f13e0
    • Eric Dumazet's avatar
      net: prevent sign extension in dev_get_stats() · d598f7ff
      Eric Dumazet authored
      commit 6f64ec74 upstream.
      
      Similar to the fix provided by Dominik Heidler in commit
      9b3dc0a1 ("l2tp: cast l2tp traffic counter to unsigned")
      we need to take care of 32bit kernels in dev_get_stats().
      
      When using atomic_long_read(), we add a 'long' to u64 and
      might misinterpret high order bit, unless we cast to unsigned.
      
      Fixes: caf586e5 ("net: add a core netdev->rx_dropped counter")
      Fixes: 015f0688 ("net: net: add a core netdev->tx_dropped counter")
      Fixes: 6e7333d3 ("net: add rx_nohandler stat counter")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d598f7ff
    • WANG Cong's avatar
      tcp: reset sk_rx_dst in tcp_disconnect() · 32a44f1b
      WANG Cong authored
      commit d747a7a5 upstream.
      
      We have to reset the sk->sk_rx_dst when we disconnect a TCP
      connection, because otherwise when we re-connect it this
      dst reference is simply overridden in tcp_finish_connect().
      
      This fixes a dst leak which leads to a loopback dev refcnt
      leak. It is a long-standing bug, Kevin reported a very similar
      (if not same) bug before. Thanks to Andrei for providing such
      a reliable reproducer which greatly narrows down the problem.
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Reported-by: default avatarAndrei Vagin <avagin@gmail.com>
      Reported-by: default avatarKevin Xu <kaiwen.xu@hulu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32a44f1b
    • Richard Cochran's avatar
      net: dp83640: Avoid NULL pointer dereference. · ccff2f4a
      Richard Cochran authored
      commit db9d8b29 upstream.
      
      The function, skb_complete_tx_timestamp(), used to allow passing in a
      NULL pointer for the time stamps, but that was changed in commit
      62bccb8c ("net-timestamp: Make the
      clone operation stand-alone from phy timestamping"), and the existing
      call sites, all of which are in the dp83640 driver, were fixed up.
      
      Even though the kernel-doc was subsequently updated in commit
      7a76a021 ("net-timestamp: Update
      skb_complete_tx_timestamp comment"), still a bug fix from Manfred
      Rudigier came into the driver using the old semantics.  Probably
      Manfred derived that patch from an older kernel version.
      
      This fix should be applied to the stable trees as well.
      
      Fixes: 81e8f2e9 ("net: dp83640: Fix tx timestamp overflow handling.")
      Signed-off-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ccff2f4a
    • WANG Cong's avatar
      ipv6: avoid unregistering inet6_dev for loopback · 6a87cca3
      WANG Cong authored
      commit 60abc0be upstream.
      
      The per netns loopback_dev->ip6_ptr is unregistered and set to
      NULL when its mtu is set to smaller than IPV6_MIN_MTU, this
      leads to that we could set rt->rt6i_idev NULL after a
      rt6_uncached_list_flush_dev() and then crash after another
      call.
      
      In this case we should just bring its inet6_dev down, rather
      than unregistering it, at least prior to commit 176c39af
      ("netns: fix addrconf_ifdown kernel panic") we always
      override the case for loopback.
      
      Thanks a lot to Andrey for finding a reliable reproducer.
      
      Fixes: 176c39af ("netns: fix addrconf_ifdown kernel panic")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a87cca3
    • Zach Brown's avatar
      net/phy: micrel: configure intterupts after autoneg workaround · f71e5140
      Zach Brown authored
      commit b866203d upstream.
      
      The commit ("net/phy: micrel: Add workaround for bad autoneg") fixes an
      autoneg failure case by resetting the hardware. This turns off
      intterupts. Things will work themselves out if the phy polls, as it will
      figure out it's state during a poll. However if the phy uses only
      intterupts, the phy will stall, since interrupts are off. This patch
      fixes the issue by calling config_intr after resetting the phy.
      
      Fixes: d2fd719b ("net/phy: micrel: Add workaround for bad autoneg ")
      Signed-off-by: default avatarZach Brown <zach.brown@ni.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f71e5140
    • Gao Feng's avatar
      net: sched: Fix one possible panic when no destroy callback · c485792e
      Gao Feng authored
      commit c1a4872e upstream.
      
      When qdisc fail to init, qdisc_create would invoke the destroy callback
      to cleanup. But there is no check if the callback exists really. So it
      would cause the panic if there is no real destroy callback like the qdisc
      codel, fq, and so on.
      
      Take codel as an example following:
      When a malicious user constructs one invalid netlink msg, it would cause
      codel_init->codel_change->nla_parse_nested failed.
      Then kernel would invoke the destroy callback directly but qdisc codel
      doesn't define one. It causes one panic as a result.
      
      Now add one the check for destroy to avoid the possible panic.
      
      Fixes: 87b60cfa ("net_sched: fix error recovery at qdisc creation")
      Signed-off-by: default avatarGao Feng <gfree.wind@vip.163.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c485792e
    • Eric Dumazet's avatar
      net_sched: fix error recovery at qdisc creation · 0be4c96e
      Eric Dumazet authored
      commit 87b60cfa upstream.
      
      Dmitry reported uses after free in qdisc code [1]
      
      The problem here is that ops->init() can return an error.
      
      qdisc_create_dflt() then call ops->destroy(),
      while qdisc_create() does _not_ call it.
      
      Four qdisc chose to call their own ops->destroy(), assuming their caller
      would not.
      
      This patch makes sure qdisc_create() calls ops->destroy()
      and fixes the four qdisc to avoid double free.
      
      [1]
      BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440
      Read of size 8 by task syz-executor2/5030
      CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400
       ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898
       ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0
      Call Trace:
       [<ffffffff81bbbed4>] __dump_stack lib/dump_stack.c:15 [inline]
       [<ffffffff81bbbed4>] dump_stack+0x6c/0x98 lib/dump_stack.c:51
       [<ffffffff816682b1>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
       [<ffffffff81668524>] print_address_description mm/kasan/report.c:196 [inline]
       [<ffffffff81668524>] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285
       [<ffffffff81668953>] kasan_report mm/kasan/report.c:305 [inline]
       [<ffffffff81668953>] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326
       [<ffffffff82527b02>] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33
       [<ffffffff82524bdd>] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953
       [<ffffffff82524e30>] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848
       [<ffffffff8252550d>] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline]
       [<ffffffff8252550d>] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064
       [<ffffffff824b1db1>] __dev_open+0x221/0x320 net/core/dev.c:1403
       [<ffffffff824b24ce>] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858
       [<ffffffff824b27de>] dev_change_flags+0x8e/0x140 net/core/dev.c:6926
       [<ffffffff824f5bf6>] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260
       [<ffffffff824f61fa>] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546
       [<ffffffff82430509>] sock_do_ioctl+0x99/0xb0 net/socket.c:879
       [<ffffffff82430d30>] sock_ioctl+0x2a0/0x390 net/socket.c:958
       [<ffffffff816f3b68>] vfs_ioctl fs/ioctl.c:44 [inline]
       [<ffffffff816f3b68>] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611
       [<ffffffff816f41a4>] SYSC_ioctl fs/ioctl.c:626 [inline]
       [<ffffffff816f41a4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617
       [<ffffffff8123e357>] entry_SYSCALL_64_fastpath+0x12/0x17
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0be4c96e
  2. 15 Jul, 2017 2 commits