1. 06 Apr, 2019 24 commits
    • Linus Torvalds's avatar
      Merge branch 'parisc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 373c3925
      Linus Torvalds authored
      Pull parisc fixes from Helge Deller:
       "A 32-bit boot regression fix introduced in the merge window, a QEMU
        detection fix and two fixes by Sven regarding ptrace & kprobes"
      
      * 'parisc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Detect QEMU earlier in boot process
        parisc: also set iaoq_b in instruction_pointer_set()
        parisc: regs_return_value() should return gpr28
        Revert: parisc: Use F_EXTEND() macro in iosapic code
      373c3925
    • Helge Deller's avatar
      parisc: Detect QEMU earlier in boot process · d006e95b
      Helge Deller authored
      While adding LASI support to QEMU, I noticed that the QEMU detection in
      the kernel happens much too late. For example, when a LASI chip is found
      by the kernel, it registers the LASI LED driver as well.  But when we
      run on QEMU it makes sense to avoid spending unnecessary CPU cycles, so
      we need to access the running_on_QEMU flag earlier than before.
      
      This patch now makes the QEMU detection the fist task of the Linux
      kernel by moving it to where the kernel enters the C-coding.
      
      Fixes: 310d8278 ("parisc: qemu idle sleep support")
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org # v4.14+
      d006e95b
    • Sven Schnelle's avatar
      parisc: also set iaoq_b in instruction_pointer_set() · f324fa58
      Sven Schnelle authored
      When setting the instruction pointer on PA-RISC we also need
      to set the back of the instruction queue to the new offset, otherwise
      we will execute on instruction from the new location, and jumping
      back to the old location stored in iaoq_b.
      Signed-off-by: default avatarSven Schnelle <svens@stackframe.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Fixes: 75ebedf1 ("parisc: Add HAVE_REGS_AND_STACK_ACCESS_API feature")
      Cc: stable@vger.kernel.org # 4.19+
      f324fa58
    • Sven Schnelle's avatar
      parisc: regs_return_value() should return gpr28 · 45efd871
      Sven Schnelle authored
      While working on kretprobes for PA-RISC I was wondering while the
      kprobes sanity test always fails on kretprobes. This is caused by
      returning gpr20 instead of gpr28.
      Signed-off-by: default avatarSven Schnelle <svens@stackframe.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org # 4.14+
      45efd871
    • Helge Deller's avatar
      Revert: parisc: Use F_EXTEND() macro in iosapic code · c2f8d7cb
      Helge Deller authored
      Revert parts of commit 97d7e2e3 ("parisc: Use F_EXTEND() macro in
      iosapic code"). It breaks booting the 32-bit kernel on some machines.
      Reported-by: default avatarSven Schnelle <svens@stackframe.org>
      Tested-by: default avatarSven Schnelle <svens@stackframe.org>
      Fixes: 97d7e2e3 ("parisc: Use F_EXTEND() macro in iosapic code")
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      c2f8d7cb
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
    • Linus Torvalds's avatar
      Merge tag 'rtc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · be76865d
      Linus Torvalds authored
      Pull RTC fixes from Alexandre Belloni:
      
       - Various alarm fixes for da9063, cros-ec and sh
      
       - sd3078 manufacturer name fix as this was introduced this cycle
      
      * tag 'rtc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
        rtc: da9063: set uie_unsupported when relevant
        rtc: sd3078: fix manufacturer name
        rtc: sh: Fix invalid alarm warning for non-enabled alarm
        rtc: cros-ec: Fail suspend/resume if wake IRQ can't be configured
      be76865d
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f654f0fc
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "14 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        kernel/sysctl.c: fix out-of-bounds access when setting file-max
        mm/util.c: fix strndup_user() comment
        sh: fix multiple function definition build errors
        MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM
        MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM
        mm: writeback: use exact memcg dirty counts
        psi: clarify the units used in pressure files
        mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
        hugetlbfs: fix memory leak for resv_map
        mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX()
        lib/lzo: fix bugs for very short or empty input
        include/linux/bitrev.h: fix constant bitrev
        kmemleak: powerpc: skip scanning holes in the .bss section
        lib/string.c: implement a basic bcmp
      f654f0fc
    • Will Deacon's avatar
      kernel/sysctl.c: fix out-of-bounds access when setting file-max · 9002b214
      Will Deacon authored
      Commit 32a5ad9c ("sysctl: handle overflow for file-max") hooked up
      min/max values for the file-max sysctl parameter via the .extra1 and
      .extra2 fields in the corresponding struct ctl_table entry.
      
      Unfortunately, the minimum value points at the global 'zero' variable,
      which is an int.  This results in a KASAN splat when accessed as a long
      by proc_doulongvec_minmax on 64-bit architectures:
      
        | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
        | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
        |
        | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
        | Hardware name: linux,dummy-virt (DT)
        | Call trace:
        |  dump_backtrace+0x0/0x228
        |  show_stack+0x14/0x20
        |  dump_stack+0xe8/0x124
        |  print_address_description+0x60/0x258
        |  kasan_report+0x140/0x1a0
        |  __asan_report_load8_noabort+0x18/0x20
        |  __do_proc_doulongvec_minmax+0x5d8/0x6a0
        |  proc_doulongvec_minmax+0x4c/0x78
        |  proc_sys_call_handler.isra.19+0x144/0x1d8
        |  proc_sys_write+0x34/0x58
        |  __vfs_write+0x54/0xe8
        |  vfs_write+0x124/0x3c0
        |  ksys_write+0xbc/0x168
        |  __arm64_sys_write+0x68/0x98
        |  el0_svc_common+0x100/0x258
        |  el0_svc_handler+0x48/0xc0
        |  el0_svc+0x8/0xc
        |
        | The buggy address belongs to the variable:
        |  zero+0x0/0x40
        |
        | Memory state around the buggy address:
        |  ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
        |  ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
        | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
        |                                ^
        |  ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
        |  ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fix the splat by introducing a unsigned long 'zero_ul' and using that
      instead.
      
      Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
      Fixes: 32a5ad9c ("sysctl: handle overflow for file-max")
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarChristian Brauner <christian@brauner.io>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matteo Croce <mcroce@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9002b214
    • Andrew Morton's avatar
      mm/util.c: fix strndup_user() comment · e9145521
      Andrew Morton authored
      The kerneldoc misdescribes strndup_user()'s return value.
      
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Timur Tabi <timur@freescale.com>
      Cc: Mihai Caraman <mihai.caraman@freescale.com>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9145521
    • Randy Dunlap's avatar
      sh: fix multiple function definition build errors · acaf892e
      Randy Dunlap authored
      Many of the sh CPU-types have their own plat_irq_setup() and
      arch_init_clk_ops() functions, so these same (empty) functions in
      arch/sh/boards/of-generic.c are not needed and cause build errors.
      
      If there is some case where these empty functions are needed, they can
      be retained by marking them as "__weak" while at the same time making
      builds that do not need them succeed.
      
      Fixes these build errors:
      
      arch/sh/boards/of-generic.o: In function `plat_irq_setup':
      (.init.text+0x134): multiple definition of `plat_irq_setup'
      arch/sh/kernel/cpu/sh2/setup-sh7619.o:(.init.text+0x30): first defined here
      arch/sh/boards/of-generic.o: In function `arch_init_clk_ops':
      (.init.text+0x118): multiple definition of `arch_init_clk_ops'
      arch/sh/kernel/cpu/sh2/clock-sh7619.o:(.init.text+0x0): first defined here
      
      Link: http://lkml.kernel.org/r/9ee4e0c5-f100-86a2-bd4d-1d3287ceab31@infradead.orgSigned-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      acaf892e
    • Tomer Maimon's avatar
      MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM · 803cfadc
      Tomer Maimon authored
      Add Tali Perry as Nuvoton NPCM maintainer, replace Brendan Higgins
      Nuvoton NPCM reviewer with Benjamin Fair.
      
      Link: http://lkml.kernel.org/r/20190328235752.334462-2-tmaimon77@gmail.comSigned-off-by: default avatarTomer Maimon <tmaimon77@gmail.com>
      Reviewed-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Reviewed-by: default avatarBenjamin Fair <benjaminfair@google.com>
      Reviewed-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Avi Fishman <avifishman70@gmail.com>
      Cc: Patrick Venture <venture@google.com>
      Cc: Nancy Yuen <yuenn@google.com>
      Cc: Tali Perry <tali.perry1@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      803cfadc
    • Tomer Maimon's avatar
      MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM · 166dbd93
      Tomer Maimon authored
      In the process of upstreaming architecture support for ARM/NUVOTON NPCM
      include/dt-bindings/clock/nuvoton,npcm7xx-clks.h was renamed
      include/dt-bindings/clock/nuvoton,npcm7xx-clock.h without updating
      MAINTAINERS.  This updates the MAINTAINERS pattern to match the new name
      of this file.
      
      Link: http://lkml.kernel.org/r/20190328235752.334462-1-tmaimon77@gmail.com
      Fixes: 6a498e06 ("MAINTAINERS: Add entry for the Nuvoton NPCM architecture")
      Signed-off-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarTomer Maimon <tmaimon77@gmail.com>
      Reported-by: default avatarJoe Perches <joe@perches.com>
      Reviewed-by: default avatarBenjamin Fair <benjaminfair@google.com>
      Cc: Avi Fishman <avifishman70@gmail.com>
      Cc: Mukesh Ojha <mojha@codeaurora.org>
      Cc: Nancy Yuen <yuenn@google.com>
      Cc: Patrick Venture <venture@google.com>
      Cc: Tali Perry <tali.perry1@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      166dbd93
    • Greg Thelen's avatar
      mm: writeback: use exact memcg dirty counts · 0b3d6e6f
      Greg Thelen authored
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b3d6e6f
    • Waiman Long's avatar
      psi: clarify the units used in pressure files · be87ab0a
      Waiman Long authored
      The output of the PSI files show a bunch of numbers with no unit.  The
      psi.txt documentation file also does not indicate what units are used.
      One can only find out by looking at the source code.  The units are
      percentage for the averages and useconds for the total.  Make the
      information easier to find by documenting the units in psi.txt.
      
      Link: http://lkml.kernel.org/r/20190402193810.3450-1-longman@redhat.comSigned-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be87ab0a
    • Aneesh Kumar K.V's avatar
      mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() · c6f3c5ee
      Aneesh Kumar K.V authored
      With some architectures like ppc64, set_pmd_at() cannot cope with a
      situation where there is already some (different) valid entry present.
      
      Use pmdp_set_access_flags() instead to modify the pfn which is built to
      deal with modifying existing PMD entries.
      
      This is similar to commit cae85cb8 ("mm/memory.c: fix modifying of
      page protection by insert_pfn()")
      
      We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
      support pud pfn entries now.
      
      Without this patch we also see the below message in kernel log "BUG:
      non-zero pgtables_bytes on freeing mm:"
      
      Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: default avatarChandan Rajendra <chandan@linux.ibm.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6f3c5ee
    • Mike Kravetz's avatar
      hugetlbfs: fix memory leak for resv_map · 58b6e5e8
      Mike Kravetz authored
      When mknod is used to create a block special file in hugetlbfs, it will
      allocate an inode and kmalloc a 'struct resv_map' via resv_map_alloc().
      inode->i_mapping->private_data will point the newly allocated resv_map.
      However, when the device special file is opened bd_acquire() will set
      inode->i_mapping to bd_inode->i_mapping.  Thus the pointer to the
      allocated resv_map is lost and the structure is leaked.
      
      Programs to reproduce:
              mount -t hugetlbfs nodev hugetlbfs
              mknod hugetlbfs/dev b 0 0
              exec 30<> hugetlbfs/dev
              umount hugetlbfs/
      
      resv_map structures are only needed for inodes which can have associated
      page allocations.  To fix the leak, only allocate resv_map for those
      inodes which could possibly be associated with page allocations.
      
      Link: http://lkml.kernel.org/r/20190401213101.16476-1-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reported-by: default avatarYufen Yu <yuyufen@huawei.com>
      Suggested-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58b6e5e8
    • Jann Horn's avatar
      mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX() · fcae96ff
      Jann Horn authored
      Symmetrically to VM_FAULT_SET_HINDEX(), we need a force-cast in
      VM_FAULT_GET_HINDEX() to tell sparse that this is intentional.
      
      Sparse complains about the current code when building a kernel with
      CONFIG_MEMORY_FAILURE:
      
        arch/x86/mm/fault.c:1058:53: warning: restricted vm_fault_t degrades to integer
      
      Link: http://lkml.kernel.org/r/20190327204117.35215-1-jannh@google.com
      Fixes: 3d353901 ("mm: create the new vm_fault_t type")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcae96ff
    • Dave Rodgman's avatar
      lib/lzo: fix bugs for very short or empty input · b11ed18e
      Dave Rodgman authored
      For very short input data (0 - 1 bytes), lzo-rle was not behaving
      correctly.  Fix this behaviour and update documentation accordingly.
      
      For zero-length input, lzo v0 outputs an end-of-stream marker only,
      which was misinterpreted by lzo-rle as a bitstream version number.
      Ensure bitstream versions > 0 require a minimum stream length of 5.
      
      Also fixes a bug in handling the tail for very short inputs when a
      bitstream version is present.
      
      Link: http://lkml.kernel.org/r/20190326165857.34613-1-dave.rodgman@arm.comSigned-off-by: default avatarDave Rodgman <dave.rodgman@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b11ed18e
    • Arnd Bergmann's avatar
      include/linux/bitrev.h: fix constant bitrev · 6147e136
      Arnd Bergmann authored
      clang points out with hundreds of warnings that the bitrev macros have a
      problem with constant input:
      
        drivers/hwmon/sht15.c:187:11: error: variable '__x' is uninitialized when used within its own initialization
              [-Werror,-Wuninitialized]
                u8 crc = bitrev8(data->val_status & 0x0F);
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        include/linux/bitrev.h:102:21: note: expanded from macro 'bitrev8'
                __constant_bitrev8(__x) :                       \
                ~~~~~~~~~~~~~~~~~~~^~~~
        include/linux/bitrev.h:67:11: note: expanded from macro '__constant_bitrev8'
                u8 __x = x;                     \
                   ~~~   ^
      
      Both the bitrev and the __constant_bitrev macros use an internal
      variable named __x, which goes horribly wrong when passing one to the
      other.
      
      The obvious fix is to rename one of the variables, so this adds an extra
      '_'.
      
      It seems we got away with this because
      
       - there are only a few drivers using bitrev macros
      
       - usually there are no constant arguments to those
      
       - when they are constant, they tend to be either 0 or (unsigned)-1
         (drivers/isdn/i4l/isdnhdlc.o, drivers/iio/amplifiers/ad8366.c) and
         give the correct result by pure chance.
      
      In fact, the only driver that I could find that gets different results
      with this is drivers/net/wan/slic_ds26522.c, which in turn is a driver
      for fairly rare hardware (adding the maintainer to Cc for testing).
      
      Link: http://lkml.kernel.org/r/20190322140503.123580-1-arnd@arndb.de
      Fixes: 556d2f05 ("ARM: 8187/1: add CONFIG_HAVE_ARCH_BITREVERSE to support rbit instruction")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Cc: Zhao Qiang <qiang.zhao@nxp.com>
      Cc: Yalin Wang <yalin.wang@sonymobile.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6147e136
    • Catalin Marinas's avatar
      kmemleak: powerpc: skip scanning holes in the .bss section · 298a32b1
      Catalin Marinas authored
      Commit 2d4f5671 ("KVM: PPC: Introduce kvm_tmp framework") adds
      kvm_tmp[] into the .bss section and then free the rest of unused spaces
      back to the page allocator.
      
      kernel_init
        kvm_guest_init
          kvm_free_tmp
            free_reserved_area
              free_unref_page
                free_unref_page_prepare
      
      With DEBUG_PAGEALLOC=y, it will unmap those pages from kernel.  As the
      result, kmemleak scan will trigger a panic when it scans the .bss
      section with unmapped pages.
      
      This patch creates dedicated kmemleak objects for the .data, .bss and
      potentially .data..ro_after_init sections to allow partial freeing via
      the kmemleak_free_part() in the powerpc kvm_free_tmp() function.
      
      Link: http://lkml.kernel.org/r/20190321171917.62049-1-catalin.marinas@arm.comSigned-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarQian Cai <cai@lca.pw>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: default avatarQian Cai <cai@lca.pw>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      298a32b1
    • Nick Desaulniers's avatar
      lib/string.c: implement a basic bcmp · 5f074f3e
      Nick Desaulniers authored
      A recent optimization in Clang (r355672) lowers comparisons of the
      return value of memcmp against zero to comparisons of the return value
      of bcmp against zero.  This helps some platforms that implement bcmp
      more efficiently than memcmp.  glibc simply aliases bcmp to memcmp, but
      an optimized implementation is in the works.
      
      This results in linkage failures for all targets with Clang due to the
      undefined symbol.  For now, just implement bcmp as a tailcail to memcmp
      to unbreak the build.  This routine can be further optimized in the
      future.
      
      Other ideas discussed:
      
       * A weak alias was discussed, but breaks for architectures that define
         their own implementations of memcmp since aliases to declarations are
         not permitted (only definitions). Arch-specific memcmp
         implementations typically declare memcmp in C headers, but implement
         them in assembly.
      
       * -ffreestanding also is used sporadically throughout the kernel.
      
       * -fno-builtin-bcmp doesn't work when doing LTO.
      
      Link: https://bugs.llvm.org/show_bug.cgi?id=41035
      Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp
      Link: https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13
      Link: https://github.com/ClangBuiltLinux/linux/issues/416
      Link: http://lkml.kernel.org/r/20190313211335.165605-1-ndesaulniers@google.comSigned-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Reported-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Reported-by: default avatarAdhemerval Zanella <adhemerval.zanella@linaro.org>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarJames Y Knight <jyknight@google.com>
      Suggested-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Suggested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Suggested-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Tested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Reviewed-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f074f3e
    • Linus Torvalds's avatar
      Merge tag 'for-5.1/dm-fixes' of... · 4f1cbe07
      Linus Torvalds authored
      Merge tag 'for-5.1/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fixes from Mike Snitzer:
      
       - Two queue_limits stacking fixes: disable discards if underlying
         driver does. And propagate BDI_CAP_STABLE_WRITES to fix sporadic
         checksum errors.
      
       - Fix that reverts a DM core limit that wasn't needed given that
         dm-crypt was already updated to impose an equivalent limit.
      
       - Fix dm-init to properly establish 'const' for __initconst array.
      
       - Fix deadlock in DM integrity target that occurs when overlapping IO
         is being issued to it. And two smaller fixes to the DM integrity
         target.
      
      * tag 'for-5.1/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm integrity: fix deadlock with overlapping I/O
        dm: disable DISCARD if the underlying storage no longer supports it
        dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errors
        dm: revert 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
        dm init: fix const confusion for dm_allowed_targets array
        dm integrity: make dm_integrity_init and dm_integrity_exit static
        dm integrity: change memcmp to strncmp in dm_integrity_ctr
      4f1cbe07
    • Linus Torvalds's avatar
      Merge tag 'vfio-v5.1-rc4' of git://github.com/awilliam/linux-vfio · 3e28fb0f
      Linus Torvalds authored
      Pull VFIO fixes from Alex Williamson:
      
       - Fix clang printk format errors (Louis Taylor)
      
       - Declare structure static to fix sparse warning (Wang Hai)
      
       - Limit user DMA mappings per container (CVE-2019-3882) (Alex
         Williamson)
      
      * tag 'vfio-v5.1-rc4' of git://github.com/awilliam/linux-vfio:
        vfio/type1: Limit DMA mappings per container
        vfio/spapr_tce: Make symbol 'tce_iommu_driver_ops' static
        vfio/pci: use correct format characters
      3e28fb0f
  2. 05 Apr, 2019 16 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · bc5725f9
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86 fixes for overflows and other nastiness"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: nVMX: fix x2APIC VTPR read intercept
        KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887)
        KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflow
        kvm: svm: fix potential get_num_contig_pages overflow
      bc5725f9
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 2f9e10ac
      Linus Torvalds authored
      Pull arm64 fix from Catalin Marinas:
       "Fix unwind_frame() in the context of pseudo NMI"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: fix wrong check of on_sdei_stack in nmi context
      2f9e10ac
    • Linus Torvalds's avatar
      Merge tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 970b766c
      Linus Torvalds authored
      Pull syscall-get-arguments cleanup and fixes from Steven Rostedt:
       "Andy Lutomirski approached me to tell me that the
        syscall_get_arguments() implementation in x86 was horrible and gcc
        certainly gets it wrong.
      
        He said that since the tracepoints only pass in 0 and 6 for i and n
        repectively, it should be optimized for that case. Inspecting the
        kernel, I discovered that all users pass in 0 for i and only one file
        passing in something other than 6 for the number of arguments. That
        code happens to be my own code used for the special syscall tracing.
      
        That can easily be converted to just using 0 and 6 as well, and only
        copying what is needed. Which is probably the faster path anyway for
        that case.
      
        Along the way, a couple of real fixes came from this as the
        syscall_get_arguments() function was incorrect for csky and riscv.
      
        x86 has been optimized to for the new interface that removes the
        variable number of arguments, but the other architectures could still
        use some loving and take more advantage of the simpler interface"
      
      * tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        syscalls: Remove start and number from syscall_set_arguments() args
        syscalls: Remove start and number from syscall_get_arguments() args
        csky: Fix syscall_get_arguments() and syscall_set_arguments()
        riscv: Fix syscall_get_arguments() and syscall_set_arguments()
        tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()
        ptrace: Remove maxargs from task_current_syscall()
      970b766c
    • Mikulas Patocka's avatar
      dm integrity: fix deadlock with overlapping I/O · 4ed319c6
      Mikulas Patocka authored
      dm-integrity will deadlock if overlapping I/O is issued to it, the bug
      was introduced by commit 724376a0 ("dm integrity: implement fair
      range locks").  Users rarely use overlapping I/O so this bug went
      undetected until now.
      
      Fix this bug by correcting, likely cut-n-paste, typos in
      ranges_overlap() and also remove a flawed ranges_overlap() check in
      remove_range_unlocked().  This condition could leave unprocessed bios
      hanging on wait_list forever.
      
      Cc: stable@vger.kernel.org # v4.19+
      Fixes: 724376a0 ("dm integrity: implement fair range locks")
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      4ed319c6
    • Marc Orr's avatar
      KVM: x86: nVMX: fix x2APIC VTPR read intercept · c73f4c99
      Marc Orr authored
      Referring to the "VIRTUALIZING MSR-BASED APIC ACCESSES" chapter of the
      SDM, when "virtualize x2APIC mode" is 1 and "APIC-register
      virtualization" is 0, a RDMSR of 808H should return the VTPR from the
      virtual APIC page.
      
      However, for nested, KVM currently fails to disable the read intercept
      for this MSR. This means that a RDMSR exit takes precedence over
      "virtualize x2APIC mode", and KVM passes through L1's TPR to L2,
      instead of sourcing the value from L2's virtual APIC page.
      
      This patch fixes the issue by disabling the read intercept, in VMCS02,
      for the VTPR when "APIC-register virtualization" is 0.
      
      The issue described above and fix prescribed here, were verified with
      a related patch in kvm-unit-tests titled "Test VMX's virtualize x2APIC
      mode w/ nested".
      Signed-off-by: default avatarMarc Orr <marcorr@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Fixes: c992384b ("KVM: vmx: speed up MSR bitmap merge")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c73f4c99
    • Marc Orr's avatar
      KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887) · acff7847
      Marc Orr authored
      The nested_vmx_prepare_msr_bitmap() function doesn't directly guard the
      x2APIC MSR intercepts with the "virtualize x2APIC mode" MSR. As a
      result, we discovered the potential for a buggy or malicious L1 to get
      access to L0's x2APIC MSRs, via an L2, as follows.
      
      1. L1 executes WRMSR(IA32_SPEC_CTRL, 1). This causes the spec_ctrl
      variable, in nested_vmx_prepare_msr_bitmap() to become true.
      2. L1 disables "virtualize x2APIC mode" in VMCS12.
      3. L1 enables "APIC-register virtualization" in VMCS12.
      
      Now, KVM will set VMCS02's x2APIC MSR intercepts from VMCS12, and then
      set "virtualize x2APIC mode" to 0 in VMCS02. Oops.
      
      This patch closes the leak by explicitly guarding VMCS02's x2APIC MSR
      intercepts with VMCS12's "virtualize x2APIC mode" control.
      
      The scenario outlined above and fix prescribed here, were verified with
      a related patch in kvm-unit-tests titled "Add leak scenario to
      virt_x2apic_mode_test".
      
      Note, it looks like this issue may have been introduced inadvertently
      during a merge---see 15303ba5.
      Signed-off-by: default avatarMarc Orr <marcorr@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      acff7847
    • David Rientjes's avatar
      KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflow · b86bc285
      David Rientjes authored
      This ensures that the address and length provided to DBG_DECRYPT and
      DBG_ENCRYPT do not cause an overflow.
      
      At the same time, pass the actual number of pages pinned in memory to
      sev_unpin_memory() as a cleanup.
      Reported-by: default avatarCfir Cohen <cfir@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b86bc285
    • David Rientjes's avatar
      kvm: svm: fix potential get_num_contig_pages overflow · ede885ec
      David Rientjes authored
      get_num_contig_pages() could potentially overflow int so make its type
      consistent with its usage.
      Reported-by: default avatarCfir Cohen <cfir@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ede885ec
    • Linus Torvalds's avatar
      Merge tag 'mm-compaction-5.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux · 7f46774c
      Linus Torvalds authored
      Pull mm/compaction fixes from Mel Gorman:
       "The merge window for 5.1 introduced a number of compaction-related
        patches. with intermittent reports of corruption and functional
        issues. The bugs are due to sloopy checking of zone boundaries and a
        corner case where invalid indexes are used to access the free lists.
      
        Reports are not common but at least two users and 0-day have tripped
        over them. There is a chance that one of the syzbot reports are
        related but it has not been confirmed properly.
      
        The normal submission path is with Andrew but there have been some
        delays and I consider them urgent enough that they should be picked up
        before RC4 to avoid duplicate reports.
      
        All of these have been successfully tested on older RC windows. This
        will make this branch look like a rebase but in fact, they've simply
        been lifted again from Andrew's tree and placed on a fresh branch.
        I've no reason to believe that this has invalidated the testing given
        the lack of change in compaction and the nature of the fixes"
      
      * tag 'mm-compaction-5.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux:
        mm/compaction.c: abort search if isolation fails
        mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints
      7f46774c
    • Greg Kroah-Hartman's avatar
      tty: mark Siemens R3964 line discipline as BROKEN · c7084edc
      Greg Kroah-Hartman authored
      The n_r3964 line discipline driver was written in a different time, when
      SMP machines were rare, and users were trusted to do the right thing.
      Since then, the world has moved on but not this code, it has stayed
      rooted in the past with its lovely hand-crafted list structures and
      loads of "interesting" race conditions all over the place.
      
      After attempting to clean up most of the issues, I just gave up and am
      now marking the driver as BROKEN so that hopefully someone who has this
      hardware will show up out of the woodwork (I know you are out there!)
      and will help with debugging a raft of changes that I had laying around
      for the code, but was too afraid to commit as odds are they would break
      things.
      
      Many thanks to Jann and Linus for pointing out the initial problems in
      this codebase, as well as many reviews of my attempts to fix the issues.
      It was a case of whack-a-mole, and as you can see, the mole won.
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7084edc
    • Steven Rostedt (VMware)'s avatar
      syscalls: Remove start and number from syscall_set_arguments() args · 32d92586
      Steven Rostedt (VMware) authored
      After removing the start and count arguments of syscall_get_arguments() it
      seems reasonable to remove them from syscall_set_arguments(). Note, as of
      today, there are no users of syscall_set_arguments(). But we are told that
      there will be soon. But for now, at least make it consistent with
      syscall_get_arguments().
      
      Link: http://lkml.kernel.org/r/20190327222014.GA32540@altlinux.org
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      32d92586
    • Steven Rostedt (Red Hat)'s avatar
      syscalls: Remove start and number from syscall_get_arguments() args · b35f549d
      Steven Rostedt (Red Hat) authored
      At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
      function call syscall_get_arguments() implemented in x86 was horribly
      written and not optimized for the standard case of passing in 0 and 6 for
      the starting index and the number of system calls to get. When looking at
      all the users of this function, I discovered that all instances pass in only
      0 and 6 for these arguments. Instead of having this function handle
      different cases that are never used, simply rewrite it to return the first 6
      arguments of a system call.
      
      This should help out the performance of tracing system calls by ptrace,
      ftrace and perf.
      
      Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      b35f549d
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2019-04-05' of git://anongit.freedesktop.org/drm/drm · ea2cec24
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Pretty quiet week, just some amdgpu and i915 fixes.
      
        i915:
         - deadlock fix
         - gvt fixes
      
        amdgpu:
         - PCIE dpm feature fix
         - Powerplay fixes"
      
      * tag 'drm-fixes-2019-04-05' of git://anongit.freedesktop.org/drm/drm:
        drm/i915/gvt: Fix kerneldoc typo for intel_vgpu_emulate_hotplug
        drm/i915/gvt: Correct the calculation of plane size
        drm/amdgpu: remove unnecessary rlc reset function on gfx9
        drm/i915: Always backoff after a drm_modeset_lock() deadlock
        drm/i915/gvt: do not let pin count of shadow mm go negative
        drm/i915/gvt: do not deliver a workload if its creation fails
        drm/amd/display: VBIOS can't be light up HDMI when restart system
        drm/amd/powerplay: fix possible hang with 3+ 4K monitors
        drm/amd/powerplay: correct data type to avoid overflow
        drm/amd/powerplay: add ECC feature bit
        drm/amd/amdgpu: fix PCIe dpm feature issue (v3)
      ea2cec24
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 0548740e
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Several hash table refcount fixes in batman-adv, from Sven
          Eckelmann.
      
       2) Use after free in bpf_evict_inode(), from Daniel Borkmann.
      
       3) Fix mdio bus registration in ixgbe, from Ivan Vecera.
      
       4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni.
      
       5) ila rhashtable corruption fix from Herbert Xu.
      
       6) Don't allow upper-devices to be added to vrf devices, from Sabrina
          Dubroca.
      
       7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork.
      
       8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype,
          from Alexander Lobakin.
      
       9) Missing IDR checks in mlx5 driver, from Aditya Pakki.
      
      10) Fix false connection termination in ktls, from Jakub Kicinski.
      
      11) Work around some ASPM issues with r8169 by disabling rx interrupt
          coalescing on certain chips. From Heiner Kallweit.
      
      12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo
          Abeni.
      
      13) Fully initialize sockaddr_in structures in SCTP, from Xin Long.
      
      14) Various BPF flow dissector fixes from Stanislav Fomichev.
      
      15) Divide by zero in act_sample, from Davide Caratti.
      
      16) Fix bridging multicast regression introduced by rhashtable
          conversion, from Nikolay Aleksandrov.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
        ibmvnic: Fix completion structure initialization
        ipv6: sit: reset ip header pointer in ipip6_rcv
        net: bridge: always clear mcast matching struct on reports and leaves
        libcxgb: fix incorrect ppmax calculation
        vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x
        sch_cake: Make sure we can write the IP header before changing DSCP bits
        sch_cake: Use tc_skb_protocol() helper for getting packet protocol
        tcp: Ensure DCTCP reacts to losses
        net/sched: act_sample: fix divide by zero in the traffic path
        net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop
        net: hns: Fix sparse: some warnings in HNS drivers
        net: hns: Fix WARNING when remove HNS driver with SMMU enabled
        net: hns: fix ICMP6 neighbor solicitation messages discard problem
        net: hns: Fix probabilistic memory overwrite when HNS driver initialized
        net: hns: Use NAPI_POLL_WEIGHT for hns driver
        net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()
        flow_dissector: rst'ify documentation
        ipv6: Fix dangling pointer when ipv6 fragment
        net-gro: Fix GRO flush when receiving a GSO packet.
        flow_dissector: document BPF flow dissector environment
        ...
      0548740e
    • Thomas Falcon's avatar
      ibmvnic: Fix completion structure initialization · bbd669a8
      Thomas Falcon authored
      Fix device initialization completion handling for vNIC adapters.
      Initialize the completion structure on probe and reinitialize when needed.
      This also fixes a race condition during kdump where the driver can attempt
      to access the completion struct before it is initialized:
      
      Unable to handle kernel paging request for data at address 0x00000000
      Faulting instruction address: 0xc0000000081acbe0
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA pSeries
      Modules linked in: ibmvnic(+) ibmveth sunrpc overlay squashfs loop
      CPU: 19 PID: 301 Comm: systemd-udevd Not tainted 4.18.0-64.el8.ppc64le #1
      NIP:  c0000000081acbe0 LR: c0000000081ad964 CTR: c0000000081ad900
      REGS: c000000027f3f990 TRAP: 0300   Not tainted  (4.18.0-64.el8.ppc64le)
      MSR:  800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 28228288  XER: 00000006
      CFAR: c000000008008934 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 1
      GPR00: c0000000081ad964 c000000027f3fc10 c0000000095b5800 c0000000221b4e58
      GPR04: 0000000000000003 0000000000000001 000049a086918581 00000000000000d4
      GPR08: 0000000000000007 0000000000000000 ffffffffffffffe8 d0000000014dde28
      GPR12: c0000000081ad900 c000000009a00c00 0000000000000001 0000000000000100
      GPR16: 0000000000000038 0000000000000007 c0000000095e2230 0000000000000006
      GPR20: 0000000000400140 0000000000000001 c00000000910c880 0000000000000000
      GPR24: 0000000000000000 0000000000000006 0000000000000000 0000000000000003
      GPR28: 0000000000000001 0000000000000001 c0000000221b4e60 c0000000221b4e58
      NIP [c0000000081acbe0] __wake_up_locked+0x50/0x100
      LR [c0000000081ad964] complete+0x64/0xa0
      Call Trace:
      [c000000027f3fc10] [c000000027f3fc60] 0xc000000027f3fc60 (unreliable)
      [c000000027f3fc60] [c0000000081ad964] complete+0x64/0xa0
      [c000000027f3fca0] [d0000000014dad58] ibmvnic_handle_crq+0xce0/0x1160 [ibmvnic]
      [c000000027f3fd50] [d0000000014db270] ibmvnic_tasklet+0x98/0x130 [ibmvnic]
      [c000000027f3fda0] [c00000000813f334] tasklet_action_common.isra.3+0xc4/0x1a0
      [c000000027f3fe00] [c000000008cd13f4] __do_softirq+0x164/0x400
      [c000000027f3fef0] [c00000000813ed64] irq_exit+0x184/0x1c0
      [c000000027f3ff20] [c0000000080188e8] __do_irq+0xb8/0x210
      [c000000027f3ff90] [c00000000802d0a4] call_do_irq+0x14/0x24
      [c000000026a5b010] [c000000008018adc] do_IRQ+0x9c/0x130
      [c000000026a5b060] [c000000008008ce4] hardware_interrupt_common+0x114/0x120
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbd669a8
    • Lorenzo Bianconi's avatar
      ipv6: sit: reset ip header pointer in ipip6_rcv · bb9bd814
      Lorenzo Bianconi authored
      ipip6 tunnels run iptunnel_pull_header on received skbs. This can
      determine the following use-after-free accessing iph pointer since
      the packet will be 'uncloned' running pskb_expand_head if it is a
      cloned gso skb (e.g if the packet has been sent though a veth device)
      
      [  706.369655] BUG: KASAN: use-after-free in ipip6_rcv+0x1678/0x16e0 [sit]
      [  706.449056] Read of size 1 at addr ffffe01b6bd855f5 by task ksoftirqd/1/=
      [  706.669494] Hardware name: HPE ProLiant m400 Server/ProLiant m400 Server, BIOS U02 08/19/2016
      [  706.771839] Call trace:
      [  706.801159]  dump_backtrace+0x0/0x2f8
      [  706.845079]  show_stack+0x24/0x30
      [  706.884833]  dump_stack+0xe0/0x11c
      [  706.925629]  print_address_description+0x68/0x260
      [  706.982070]  kasan_report+0x178/0x340
      [  707.025995]  __asan_report_load1_noabort+0x30/0x40
      [  707.083481]  ipip6_rcv+0x1678/0x16e0 [sit]
      [  707.132623]  tunnel64_rcv+0xd4/0x200 [tunnel4]
      [  707.185940]  ip_local_deliver_finish+0x3b8/0x988
      [  707.241338]  ip_local_deliver+0x144/0x470
      [  707.289436]  ip_rcv_finish+0x43c/0x14b0
      [  707.335447]  ip_rcv+0x628/0x1138
      [  707.374151]  __netif_receive_skb_core+0x1670/0x2600
      [  707.432680]  __netif_receive_skb+0x28/0x190
      [  707.482859]  process_backlog+0x1d0/0x610
      [  707.529913]  net_rx_action+0x37c/0xf68
      [  707.574882]  __do_softirq+0x288/0x1018
      [  707.619852]  run_ksoftirqd+0x70/0xa8
      [  707.662734]  smpboot_thread_fn+0x3a4/0x9e8
      [  707.711875]  kthread+0x2c8/0x350
      [  707.750583]  ret_from_fork+0x10/0x18
      
      [  707.811302] Allocated by task 16982:
      [  707.854182]  kasan_kmalloc.part.1+0x40/0x108
      [  707.905405]  kasan_kmalloc+0xb4/0xc8
      [  707.948291]  kasan_slab_alloc+0x14/0x20
      [  707.994309]  __kmalloc_node_track_caller+0x158/0x5e0
      [  708.053902]  __kmalloc_reserve.isra.8+0x54/0xe0
      [  708.108280]  __alloc_skb+0xd8/0x400
      [  708.150139]  sk_stream_alloc_skb+0xa4/0x638
      [  708.200346]  tcp_sendmsg_locked+0x818/0x2b90
      [  708.251581]  tcp_sendmsg+0x40/0x60
      [  708.292376]  inet_sendmsg+0xf0/0x520
      [  708.335259]  sock_sendmsg+0xac/0xf8
      [  708.377096]  sock_write_iter+0x1c0/0x2c0
      [  708.424154]  new_sync_write+0x358/0x4a8
      [  708.470162]  __vfs_write+0xc4/0xf8
      [  708.510950]  vfs_write+0x12c/0x3d0
      [  708.551739]  ksys_write+0xcc/0x178
      [  708.592533]  __arm64_sys_write+0x70/0xa0
      [  708.639593]  el0_svc_handler+0x13c/0x298
      [  708.686646]  el0_svc+0x8/0xc
      
      [  708.739019] Freed by task 17:
      [  708.774597]  __kasan_slab_free+0x114/0x228
      [  708.823736]  kasan_slab_free+0x10/0x18
      [  708.868703]  kfree+0x100/0x3d8
      [  708.905320]  skb_free_head+0x7c/0x98
      [  708.948204]  skb_release_data+0x320/0x490
      [  708.996301]  pskb_expand_head+0x60c/0x970
      [  709.044399]  __iptunnel_pull_header+0x3b8/0x5d0
      [  709.098770]  ipip6_rcv+0x41c/0x16e0 [sit]
      [  709.146873]  tunnel64_rcv+0xd4/0x200 [tunnel4]
      [  709.200195]  ip_local_deliver_finish+0x3b8/0x988
      [  709.255596]  ip_local_deliver+0x144/0x470
      [  709.303692]  ip_rcv_finish+0x43c/0x14b0
      [  709.349705]  ip_rcv+0x628/0x1138
      [  709.388413]  __netif_receive_skb_core+0x1670/0x2600
      [  709.446943]  __netif_receive_skb+0x28/0x190
      [  709.497120]  process_backlog+0x1d0/0x610
      [  709.544169]  net_rx_action+0x37c/0xf68
      [  709.589131]  __do_softirq+0x288/0x1018
      
      [  709.651938] The buggy address belongs to the object at ffffe01b6bd85580
                      which belongs to the cache kmalloc-1024 of size 1024
      [  709.804356] The buggy address is located 117 bytes inside of
                      1024-byte region [ffffe01b6bd85580, ffffe01b6bd85980)
      [  709.946340] The buggy address belongs to the page:
      [  710.003824] page:ffff7ff806daf600 count:1 mapcount:0 mapping:ffffe01c4001f600 index:0x0
      [  710.099914] flags: 0xfffff8000000100(slab)
      [  710.149059] raw: 0fffff8000000100 dead000000000100 dead000000000200 ffffe01c4001f600
      [  710.242011] raw: 0000000000000000 0000000000380038 00000001ffffffff 0000000000000000
      [  710.334966] page dumped because: kasan: bad access detected
      
      Fix it resetting iph pointer after iptunnel_pull_header
      
      Fixes: a09a4c8d ("tunnels: Remove encapsulation offloads on decap")
      Tested-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb9bd814