1. 17 Apr, 2019 40 commits
    • Ilya Dryomov's avatar
      dm table: propagate BDI_CAP_STABLE_WRITES to fix sporadic checksum errors · 469b40a4
      Ilya Dryomov authored
      commit eb40c0ac upstream.
      
      Some devices don't use blk_integrity but still want stable pages
      because they do their own checksumming.  Examples include rbd and iSCSI
      when data digests are negotiated.  Stacking DM (and thus LVM) on top of
      these devices results in sporadic checksum errors.
      
      Set BDI_CAP_STABLE_WRITES if any underlying device has it set.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      469b40a4
    • Mikulas Patocka's avatar
      dm: revert 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE") · 4f5c99e0
      Mikulas Patocka authored
      commit 75ae1936 upstream.
      
      The limit was already incorporated to dm-crypt with commit 4e870e94
      ("dm crypt: fix error with too large bios"), so we don't need to apply
      it globally to all targets. The quantity BIO_MAX_PAGES * PAGE_SIZE is
      wrong anyway because the variable ti->max_io_len it is supposed to be in
      the units of 512-byte sectors not in bytes.
      
      Reduction of the limit to 1048576 sectors could even cause data
      corruption in rare cases - suppose that we have a dm-striped device with
      stripe size 768MiB. The target will call dm_set_target_max_io_len with
      the value 1572864. The buggy code would reduce it to 1048576. Now, the
      dm-core will errorneously split the bios on 1048576-sector boundary
      insetad of 1572864-sector boundary and pass these stripe-crossing bios
      to the striped target.
      
      Cc: stable@vger.kernel.org # v4.16+
      Fixes: 8f50e358 ("dm: limit the max bio size as BIO_MAX_PAGES * PAGE_SIZE")
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f5c99e0
    • Mikulas Patocka's avatar
      dm integrity: change memcmp to strncmp in dm_integrity_ctr · 30dc4d7b
      Mikulas Patocka authored
      commit 0d74e6a3 upstream.
      
      If the string opt_string is small, the function memcmp can access bytes
      that are beyond the terminating nul character. In theory, it could cause
      segfault, if opt_string were located just below some unmapped memory.
      
      Change from memcmp to strncmp so that we don't read bytes beyond the end
      of the string.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30dc4d7b
    • Sergey Miroshnichenko's avatar
      PCI: pciehp: Ignore Link State Changes after powering off a slot · 5be6e02c
      Sergey Miroshnichenko authored
      commit 3943af9d upstream.
      
      During a safe hot remove, the OS powers off the slot, which may cause a
      Data Link Layer State Changed event.  The slot has already been set to
      OFF_STATE, so that event results in re-enabling the device, making it
      impossible to safely remove it.
      
      Clear out the Presence Detect Changed and Data Link Layer State Changed
      events when the disabled slot has settled down.
      
      It is still possible to re-enable the device if it remains in the slot
      after pressing the Attention Button by pressing it again.
      
      Fixes the problem that Micah reported below: an NVMe drive power button may
      not actually turn off the drive.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=203237Reported-by: default avatarMicah Parrish <micah.parrish@hpe.com>
      Tested-by: default avatarMicah Parrish <micah.parrish@hpe.com>
      Signed-off-by: default avatarSergey Miroshnichenko <s.miroshnichenko@yadro.com>
      [bhelgaas: changelog, add bugzilla URL]
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: stable@vger.kernel.org	# v4.19+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5be6e02c
    • Andre Przywara's avatar
      PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller · 250fef8d
      Andre Przywara authored
      commit 9cde402a upstream.
      
      There is a Marvell 88SE9170 PCIe SATA controller I found on a board here.
      Some quick testing with the ARM SMMU enabled reveals that it suffers from
      the same requester ID mixup problems as the other Marvell chips listed
      already.
      
      Add the PCI vendor/device ID to the list of chips which need the
      workaround.
      Signed-off-by: default avatarAndre Przywara <andre.przywara@arm.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      250fef8d
    • Lendacky, Thomas's avatar
      x86/perf/amd: Remove need to check "running" bit in NMI handler · 05626465
      Lendacky, Thomas authored
      commit 3966c3fe upstream.
      
      Spurious interrupt support was added to perf in the following commit, almost
      a decade ago:
      
        63e6be6d ("perf, x86: Catch spurious interrupts after disabling counters")
      
      The two previous patches (resolving the race condition when disabling a
      PMC and NMI latency mitigation) allow for the removal of this older
      spurious interrupt support.
      
      Currently in x86_pmu_stop(), the bit for the PMC in the active_mask bitmap
      is cleared before disabling the PMC, which sets up a race condition. This
      race condition was mitigated by introducing the running bitmap. That race
      condition can be eliminated by first disabling the PMC, waiting for PMC
      reset on overflow and then clearing the bit for the PMC in the active_mask
      bitmap. The NMI handler will not re-enable a disabled counter.
      
      If x86_pmu_stop() is called from the perf NMI handler, the NMI latency
      mitigation support will guard against any unhandled NMI messages.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # 4.14.x-
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/Message-ID:
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05626465
    • Lendacky, Thomas's avatar
      x86/perf/amd: Resolve NMI latency issues for active PMCs · 23d39b0a
      Lendacky, Thomas authored
      commit 6d3edaae upstream.
      
      On AMD processors, the detection of an overflowed PMC counter in the NMI
      handler relies on the current value of the PMC. So, for example, to check
      for overflow on a 48-bit counter, bit 47 is checked to see if it is 1 (not
      overflowed) or 0 (overflowed).
      
      When the perf NMI handler executes it does not know in advance which PMC
      counters have overflowed. As such, the NMI handler will process all active
      PMC counters that have overflowed. NMI latency in newer AMD processors can
      result in multiple overflowed PMC counters being processed in one NMI and
      then a subsequent NMI, that does not appear to be a back-to-back NMI, not
      finding any PMC counters that have overflowed. This may appear to be an
      unhandled NMI resulting in either a panic or a series of messages,
      depending on how the kernel was configured.
      
      To mitigate this issue, add an AMD handle_irq callback function,
      amd_pmu_handle_irq(), that will invoke the common x86_pmu_handle_irq()
      function and upon return perform some additional processing that will
      indicate if the NMI has been handled or would have been handled had an
      earlier NMI not handled the overflowed PMC. Using a per-CPU variable, a
      minimum value of the number of active PMCs or 2 will be set whenever a
      PMC is active. This is used to indicate the possible number of NMIs that
      can still occur. The value of 2 is used for when an NMI does not arrive
      at the LAPIC in time to be collapsed into an already pending NMI. Each
      time the function is called without having handled an overflowed counter,
      the per-CPU value is checked. If the value is non-zero, it is decremented
      and the NMI indicates that it handled the NMI. If the value is zero, then
      the NMI indicates that it did not handle the NMI.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # 4.14.x-
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/Message-ID:
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23d39b0a
    • Lendacky, Thomas's avatar
      x86/perf/amd: Resolve race condition when disabling PMC · e5a791b4
      Lendacky, Thomas authored
      commit 914123fa upstream.
      
      On AMD processors, the detection of an overflowed counter in the NMI
      handler relies on the current value of the counter. So, for example, to
      check for overflow on a 48 bit counter, bit 47 is checked to see if it
      is 1 (not overflowed) or 0 (overflowed).
      
      There is currently a race condition present when disabling and then
      updating the PMC. Increased NMI latency in newer AMD processors makes this
      race condition more pronounced. If the counter value has overflowed, it is
      possible to update the PMC value before the NMI handler can run. The
      updated PMC value is not an overflowed value, so when the perf NMI handler
      does run, it will not find an overflowed counter. This may appear as an
      unknown NMI resulting in either a panic or a series of messages, depending
      on how the kernel is configured.
      
      To eliminate this race condition, the PMC value must be checked after
      disabling the counter. Add an AMD function, amd_pmu_disable_all(), that
      will wait for the NMI handler to reset any active and overflowed counter
      after calling x86_pmu_disable_all().
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org> # 4.14.x-
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/Message-ID:
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5a791b4
    • Alexander Potapenko's avatar
      x86/asm: Use stricter assembly constraints in bitops · 4b004504
      Alexander Potapenko authored
      commit 5b77e95d upstream.
      
      There's a number of problems with how arch/x86/include/asm/bitops.h
      is currently using assembly constraints for the memory region
      bitops are modifying:
      
      1) Use memory clobber in bitops that touch arbitrary memory
      
      Certain bit operations that read/write bits take a base pointer and an
      arbitrarily large offset to address the bit relative to that base.
      Inline assembly constraints aren't expressive enough to tell the
      compiler that the assembly directive is going to touch a specific memory
      location of unknown size, therefore we have to use the "memory" clobber
      to indicate that the assembly is going to access memory locations other
      than those listed in the inputs/outputs.
      
      To indicate that BTR/BTS instructions don't necessarily touch the first
      sizeof(long) bytes of the argument, we also move the address to assembly
      inputs.
      
      This particular change leads to size increase of 124 kernel functions in
      a defconfig build. For some of them the diff is in NOP operations, other
      end up re-reading values from memory and may potentially slow down the
      execution. But without these clobbers the compiler is free to cache
      the contents of the bitmaps and use them as if they weren't changed by
      the inline assembly.
      
      2) Use byte-sized arguments for operations touching single bytes.
      
      Passing a long value to ANDB/ORB/XORB instructions makes the compiler
      treat sizeof(long) bytes as being clobbered, which isn't the case. This
      may theoretically lead to worse code in the case of heavy optimization.
      
      Practical impact:
      
      I've built a defconfig kernel and looked through some of the functions
      generated by GCC 7.3.0 with and without this clobber, and didn't spot
      any miscompilations.
      
      However there is a (trivial) theoretical case where this code leads to
      miscompilation:
      
        https://lkml.org/lkml/2019/3/28/393
      
      using just GCC 8.3.0 with -O2.  It isn't hard to imagine someone writes
      such a function in the kernel someday.
      
      So the primary motivation is to fix an existing misuse of the asm
      directive, which happens to work in certain configurations now, but
      isn't guaranteed to work under different circumstances.
      
      [ --mingo: Added -stable tag because defconfig only builds a fraction
        of the kernel and the trivial testcase looks normal enough to
        be used in existing or in-development code. ]
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: James Y Knight <jyknight@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190402112813.193378-1-glider@google.com
      [ Edited the changelog, tidied up one of the defines. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b004504
    • Rasmus Villemoes's avatar
      x86/asm: Remove dead __GNUC__ conditionals · 356ae4de
      Rasmus Villemoes authored
      commit 88ca66d8 upstream.
      
      The minimum supported gcc version is >= 4.6, so these can be removed.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190111084931.24601-1-linux@rasmusvillemoes.dkSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      356ae4de
    • Max Filippov's avatar
      xtensa: fix return_address · f7b778b9
      Max Filippov authored
      commit ada770b1 upstream.
      
      return_address returns the address that is one level higher in the call
      stack than requested in its argument, because level 0 corresponds to its
      caller's return address. Use requested level as the number of stack
      frames to skip.
      
      This fixes the address reported by might_sleep and friends.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7b778b9
    • Mel Gorman's avatar
      sched/fair: Do not re-read ->h_load_next during hierarchical load calculation · cb75a0c5
      Mel Gorman authored
      commit 0e9f0245 upstream.
      
      A NULL pointer dereference bug was reported on a distribution kernel but
      the same issue should be present on mainline kernel. It occured on s390
      but should not be arch-specific.  A partial oops looks like:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        ...
        Call Trace:
          ...
          try_to_wake_up+0xfc/0x450
          vhost_poll_wakeup+0x3a/0x50 [vhost]
          __wake_up_common+0xbc/0x178
          __wake_up_common_lock+0x9e/0x160
          __wake_up_sync_key+0x4e/0x60
          sock_def_readable+0x5e/0x98
      
      The bug hits any time between 1 hour to 3 days. The dereference occurs
      in update_cfs_rq_h_load when accumulating h_load. The problem is that
      cfq_rq->h_load_next is not protected by any locking and can be updated
      by parallel calls to task_h_load. Depending on the compiler, code may be
      generated that re-reads cfq_rq->h_load_next after the check for NULL and
      then oops when reading se->avg.load_avg. The dissassembly showed that it
      was possible to reread h_load_next after the check for NULL.
      
      While this does not appear to be an issue for later compilers, it's still
      an accident if the correct code is generated. Full locking in this path
      would have high overhead so this patch uses READ_ONCE to read h_load_next
      only once and check for NULL before dereferencing. It was confirmed that
      there were no further oops after 10 days of testing.
      
      As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
      potential problems with store tearing.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 68520796 ("sched: Move h_load calculation to task_h_load()")
      Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb75a0c5
    • Dan Carpenter's avatar
      xen: Prevent buffer overflow in privcmd ioctl · ed3adb56
      Dan Carpenter authored
      commit 42d8644b upstream.
      
      The "call" variable comes from the user in privcmd_ioctl_hypercall().
      It's an offset into the hypercall_page[] which has (PAGE_SIZE / 32)
      elements.  We need to put an upper bound on it to prevent an out of
      bounds access.
      
      Cc: stable@vger.kernel.org
      Fixes: 1246ae0b ("xen: add variable hypercall caller")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed3adb56
    • Will Deacon's avatar
      arm64: backtrace: Don't bother trying to unwind the userspace stack · 84c6c2af
      Will Deacon authored
      commit 1e6f5440 upstream.
      
      Calling dump_backtrace() with a pt_regs argument corresponding to
      userspace doesn't make any sense and our unwinder will simply print
      "Call trace:" before unwinding the stack looking for user frames.
      
      Rather than go through this song and dance, just return early if we're
      passed a user register state.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 1149aad1 ("arm64: Add dump_backtrace() in show_regs")
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84c6c2af
    • Peter Geis's avatar
      arm64: dts: rockchip: fix rk3328 rgmii high tx error rate · 1ec54cee
      Peter Geis authored
      commit 6fd8b978 upstream.
      
      Several rk3328 based boards experience high rgmii tx error rates.
      This is due to several pins in the rk3328.dtsi rgmii pinmux that are
      missing a defined pull strength setting.
      This causes the pinmux driver to default to 2ma (bit mask 00).
      
      These pins are only defined in the rk3328.dtsi, and are not listed in
      the rk3328 specification.
      The TRM only lists them as "Reserved"
      (RK3328 TRM V1.1, 3.3.3 Detail Register Description, GRF_GPIO0B_IOMUX,
      GRF_GPIO0C_IOMUX, GRF_GPIO0D_IOMUX).
      However, removal of these pins from the rgmii pinmux definition causes
      the interface to fail to transmit.
      
      Also, the rgmii tx and rx pins defined in the dtsi are not consistent
      with the rk3328 specification, with tx pins currently set to 12ma and
      rx pins set to 2ma.
      
      Fix this by setting tx pins to 8ma and the rx pins to 4ma, consistent
      with the specification.
      Defining the drive strength for the undefined pins eliminated the high
      tx packet error rate observed under heavy data transfers.
      Aligning the drive strength to the TRM values eliminated the occasional
      packet retry errors under iperf3 testing.
      This allows much higher data rates with no recorded tx errors.
      
      Tested on the rk3328-roc-cc board.
      
      Fixes: 52e02d37 ("arm64: dts: rockchip: add core dtsi file for RK3328 SoCs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPeter Geis <pgwipeout@gmail.com>
      Signed-off-by: default avatarHeiko Stuebner <heiko@sntech.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ec54cee
    • Will Deacon's avatar
      arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value · 82a30a5d
      Will Deacon authored
      commit 045afc24 upstream.
      
      Rather embarrassingly, our futex() FUTEX_WAKE_OP implementation doesn't
      explicitly set the return value on the non-faulting path and instead
      leaves it holding the result of the underlying atomic operation. This
      means that any FUTEX_WAKE_OP atomic operation which computes a non-zero
      value will be reported as having failed. Regrettably, I wrote the buggy
      code back in 2011 and it was upstreamed as part of the initial arm64
      support in 2012.
      
      The reasons we appear to get away with this are:
      
        1. FUTEX_WAKE_OP is rarely used and therefore doesn't appear to get
           exercised by futex() test applications
      
        2. If the result of the atomic operation is zero, the system call
           behaves correctly
      
        3. Prior to version 2.25, the only operation used by GLIBC set the
           futex to zero, and therefore worked as expected. From 2.25 onwards,
           FUTEX_WAKE_OP is not used by GLIBC at all.
      
      Fix the implementation by ensuring that the return value is either 0
      to indicate that the atomic operation completed successfully, or -EFAULT
      if we encountered a fault when accessing the user mapping.
      
      Cc: <stable@kernel.org>
      Fixes: 6170a974 ("arm64: Atomic operations")
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      82a30a5d
    • David Engraf's avatar
      ARM: dts: at91: Fix typo in ISC_D0 on PC9 · 4362ff97
      David Engraf authored
      commit e7dfb6d0 upstream.
      
      The function argument for the ISC_D0 on PC9 was incorrect. According to
      the documentation it should be 'C' aka 3.
      Signed-off-by: default avatarDavid Engraf <david.engraf@sysgo.com>
      Reviewed-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Signed-off-by: default avatarLudovic Desroches <ludovic.desroches@microchip.com>
      Fixes: 7f16cb67 ("ARM: at91/dt: add sama5d2 pinmux")
      Cc: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4362ff97
    • Peter Ujfalusi's avatar
      ARM: dts: am335x-evm: Correct the regulators for the audio codec · 627a7d5a
      Peter Ujfalusi authored
      commit 4f96dc0a upstream.
      
      Correctly map the regulators used by tlv320aic3106.
      Both 1.8V and 3.3V for the codec is derived from VBAT via fixed regulators.
      
      Cc: <Stable@vger.kernel.org> # v4.14+
      Signed-off-by: default avatarPeter Ujfalusi <peter.ujfalusi@ti.com>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      627a7d5a
    • Peter Ujfalusi's avatar
      ARM: dts: am335x-evmsk: Correct the regulators for the audio codec · 57a9c1f4
      Peter Ujfalusi authored
      commit 66913706 upstream.
      
      Correctly map the regulators used by tlv320aic3106.
      Both 1.8V and 3.3V for the codec is derived from VBAT via fixed regulators.
      
      Cc: <Stable@vger.kernel.org> # v4.14+
      Signed-off-by: default avatarPeter Ujfalusi <peter.ujfalusi@ti.com>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      57a9c1f4
    • Jonas Karlman's avatar
      ARM: dts: rockchip: fix rk3288 cpu opp node reference · 3ba48b3c
      Jonas Karlman authored
      commit 6b2fde3d upstream.
      
      The following error can be seen during boot:
      
        of: /cpus/cpu@501: Couldn't find opp node
      
      Change cpu nodes to use operating-points-v2 in order to fix this.
      
      Fixes: ce76de98 ("ARM: dts: rockchip: convert rk3288 to operating-points-v2")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJonas Karlman <jonas@kwiboo.se>
      Signed-off-by: default avatarHeiko Stuebner <heiko@sntech.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ba48b3c
    • Cornelia Huck's avatar
      virtio: Honour 'may_reduce_num' in vring_create_virtqueue · 32fdac09
      Cornelia Huck authored
      commit cf94db21 upstream.
      
      vring_create_virtqueue() allows the caller to specify via the
      may_reduce_num parameter whether the vring code is allowed to
      allocate a smaller ring than specified.
      
      However, the split ring allocation code tries to allocate a
      smaller ring on allocation failure regardless of what the
      caller specified. This may cause trouble for e.g. virtio-pci
      in legacy mode, which does not support ring resizing. (The
      packed ring code does not resize in any case.)
      
      Let's fix this by bailing out immediately in the split ring code
      if the requested size cannot be allocated and may_reduce_num has
      not been specified.
      
      While at it, fix a typo in the usage instructions.
      
      Fixes: 2a2d1382 ("virtio: Add improved queue allocation API")
      Cc: stable@vger.kernel.org # v4.6+
      Signed-off-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarHalil Pasic <pasic@linux.ibm.com>
      Reviewed-by: default avatarJens Freimann <jfreimann@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32fdac09
    • Kefeng Wang's avatar
      genirq: Initialize request_mutex if CONFIG_SPARSE_IRQ=n · 8b4f68b4
      Kefeng Wang authored
      commit e8458e7a upstream.
      
      When CONFIG_SPARSE_IRQ is disable, the request_mutex in struct irq_desc
      is not initialized which causes malfunction.
      
      Fixes: 9114014c ("genirq: Add mutex to irq desc to serialize request/free_irq()")
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: <linux-arm-kernel@lists.infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190404074512.145533-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b4f68b4
    • Stephen Boyd's avatar
      genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent() · cd5b06a9
      Stephen Boyd authored
      commit 325aa195 upstream.
      
      If a child irqchip calls irq_chip_set_wake_parent() but its parent irqchip
      has the IRQCHIP_SKIP_SET_WAKE flag set an error is returned.
      
      This is inconsistent behaviour vs. set_irq_wake_real() which returns 0 when
      the irqchip has the IRQCHIP_SKIP_SET_WAKE flag set. It doesn't attempt to
      walk the chain of parents and set irq wake on any chips that don't have the
      flag set either. If the intent is to call the .irq_set_wake() callback of
      the parent irqchip, then we expect irqchip implementations to omit the
      IRQCHIP_SKIP_SET_WAKE flag and implement an .irq_set_wake() function that
      calls irq_chip_set_wake_parent().
      
      The problem has been observed on a Qualcomm sdm845 device where set wake
      fails on any GPIO interrupts after applying work in progress wakeup irq
      patches to the GPIO driver. The chain of chips looks like this:
      
           QCOM GPIO -> QCOM PDC (SKIP) -> ARM GIC (SKIP)
      
      The GPIO controllers parent is the QCOM PDC irqchip which in turn has ARM
      GIC as parent.  The QCOM PDC irqchip has the IRQCHIP_SKIP_SET_WAKE flag
      set, and so does the grandparent ARM GIC.
      
      The GPIO driver doesn't know if the parent needs to set wake or not, so it
      unconditionally calls irq_chip_set_wake_parent() causing this function to
      return a failure because the parent irqchip (PDC) doesn't have the
      .irq_set_wake() callback set. Returning 0 instead makes everything work and
      irqs from the GPIO controller can be configured for wakeup.
      
      Make it consistent by returning 0 (success) from irq_chip_set_wake_parent()
      when a parent chip has IRQCHIP_SKIP_SET_WAKE set.
      
      [ tglx: Massaged changelog ]
      
      Fixes: 08b55e2a ("genirq: Add irqchip_set_wake_parent")
      Signed-off-by: default avatarStephen Boyd <swboyd@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-gpio@vger.kernel.org
      Cc: Lina Iyer <ilina@codeaurora.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190325181026.247796-1-swboyd@chromium.orgSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd5b06a9
    • Jason Yan's avatar
      block: fix the return errno for direct IO · 543bb48d
      Jason Yan authored
      commit a89afe58 upstream.
      
      If the last bio returned is not dio->bio, the status of the bio will
      not assigned to dio->bio if it is error. This will cause the whole IO
      status wrong.
      
          ksoftirqd/21-117   [021] ..s.  4017.966090:   8,0    C   N 4883648 [0]
                <idle>-0     [018] ..s.  4017.970888:   8,0    C  WS 4924800 + 1024 [0]
                <idle>-0     [018] ..s.  4017.970909:   8,0    D  WS 4935424 + 1024 [<idle>]
                <idle>-0     [018] ..s.  4017.970924:   8,0    D  WS 4936448 + 321 [<idle>]
          ksoftirqd/21-117   [021] ..s.  4017.995033:   8,0    C   R 4883648 + 336 [65475]
          ksoftirqd/21-117   [021] d.s.  4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
          ksoftirqd/21-117   [021] d.s.  4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0
      
      We always have to assign bio->bi_status to dio->bio.bi_status because we
      will only check dio->bio.bi_status when we return the whole IO to
      the upper layer.
      
      Fixes: 542ff7bf ("block: new direct I/O implementation")
      Cc: stable@vger.kernel.org
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      543bb48d
    • Jérôme Glisse's avatar
      block: do not leak memory in bio_copy_user_iov() · 2591bfc6
      Jérôme Glisse authored
      commit a3761c3c upstream.
      
      When bio_add_pc_page() fails in bio_copy_user_iov() we should free
      the page we just allocated otherwise we are leaking it.
      
      Cc: linux-block@vger.kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2591bfc6
    • Dmitry V. Levin's avatar
      riscv: Fix syscall_get_arguments() and syscall_set_arguments() · 7af20b60
      Dmitry V. Levin authored
      commit 10a16997 upstream.
      
      RISC-V syscall arguments are located in orig_a0,a1..a5 fields
      of struct pt_regs.
      
      Due to an off-by-one bug and a bug in pointer arithmetic
      syscall_get_arguments() was reading s3..s7 fields instead of a1..a5.
      Likewise, syscall_set_arguments() was writing s3..s7 fields
      instead of a1..a5.
      
      Link: http://lkml.kernel.org/r/20190329171221.GA32456@altlinux.org
      
      Fixes: e2c0cdfb ("RISC-V: User-facing API")
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: linux-riscv@lists.infradead.org
      Cc: stable@vger.kernel.org # v4.15+
      Acked-by: default avatarPalmer Dabbelt <palmer@sifive.com>
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7af20b60
    • Anand Jain's avatar
      btrfs: prop: fix vanished compression property after failed set · 54fb5c9d
      Anand Jain authored
      commit 272e5326 upstream.
      
      The compression property resets to NULL, instead of the old value if we
      fail to set the new compression parameter.
      
        $ btrfs prop get /btrfs compression
          compression=lzo
        $ btrfs prop set /btrfs compression zli
          ERROR: failed to set compression for /btrfs: Invalid argument
        $ btrfs prop get /btrfs compression
      
      This is because the compression property ->validate() is successful for
      'zli' as the strncmp() used the length passed from the userspace.
      
      Fix it by using the expected string length in strncmp().
      
      Fixes: 63541927 ("Btrfs: add support for inode properties")
      Fixes: 5c1aab1d ("btrfs: Add zstd support")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54fb5c9d
    • Anand Jain's avatar
      btrfs: prop: fix zstd compression parameter validation · fbfbb996
      Anand Jain authored
      commit 50398fde upstream.
      
      We let pass zstd compression parameter even if it is not fully valid.
      For example:
      
        $ btrfs prop set /btrfs compression zst
        $ btrfs prop get /btrfs compression
           compression=zst
      
      zlib and lzo are fine.
      
      Fix it by checking the correct prefix length.
      
      Fixes: 5c1aab1d ("btrfs: Add zstd support")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbfbb996
    • Filipe Manana's avatar
      Btrfs: do not allow trimming when a fs is mounted with the nologreplay option · 16515acd
      Filipe Manana authored
      commit f35f06c3 upstream.
      
      Whan a filesystem is mounted with the nologreplay mount option, which
      requires it to be mounted in RO mode as well, we can not allow discard on
      free space inside block groups, because log trees refer to extents that
      are not pinned in a block group's free space cache (pinning the extents is
      precisely the first phase of replaying a log tree).
      
      So do not allow the fitrim ioctl to do anything when the filesystem is
      mounted with the nologreplay option, because later it can be mounted RW
      without that option, which causes log replay to happen and result in
      either a failure to replay the log trees (leading to a mount failure), a
      crash or some silent corruption.
      Reported-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Fixes: 96da0919 ("btrfs: Introduce new mount option to disable tree log replay")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16515acd
    • S.j. Wang's avatar
      ASoC: fsl_esai: fix channel swap issue when stream starts · e6265e36
      S.j. Wang authored
      commit 0ff4e8c6 upstream.
      
      There is very low possibility ( < 0.1% ) that channel swap happened
      in beginning when multi output/input pin is enabled. The issue is
      that hardware can't send data to correct pin in the beginning with
      the normal enable flow.
      
      This is hardware issue, but there is no errata, the workaround flow
      is that: Each time playback/recording, firstly clear the xSMA/xSMB,
      then enable TE/RE, then enable xSMB and xSMA (xSMB must be enabled
      before xSMA). Which is to use the xSMA as the trigger start register,
      previously the xCR_TE or xCR_RE is the bit for starting.
      
      Fixes commit 43d24e76 ("ASoC: fsl_esai: Add ESAI CPU DAI driver")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarFabio Estevam <festevam@gmail.com>
      Acked-by: default avatarNicolin Chen <nicoleotsuka@gmail.com>
      Signed-off-by: default avatarShengjiu Wang <shengjiu.wang@nxp.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6265e36
    • Guenter Roeck's avatar
      ASoC: intel: Fix crash at suspend/resume after failed codec registration · 19b0a7f5
      Guenter Roeck authored
      commit 8f71370f upstream.
      
      If codec registration fails after the ASoC Intel SST driver has been probed,
      the kernel will Oops and crash at suspend/resume.
      
      general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
      CPU: 1 PID: 2811 Comm: cat Tainted: G        W         4.19.30 #15
      Hardware name: GOOGLE Clapper, BIOS Google_Clapper.5216.199.7 08/22/2014
      RIP: 0010:snd_soc_suspend+0x5a/0xd21
      Code: 03 80 3c 10 00 49 89 d7 74 0b 48 89 df e8 71 72 c4 fe 4c 89
      fa 48 8b 03 48 89 45 d0 48 8d 98 a0 01 00 00 48 89 d8 48 c1 e8 03
      <8a> 04 10 84 c0 0f 85 85 0c 00 00 80 3b 00 0f 84 6b 0c 00 00 48 8b
      RSP: 0018:ffff888035407750 EFLAGS: 00010202
      RAX: 0000000000000034 RBX: 00000000000001a0 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffff88805c417098
      RBP: ffff8880354077b0 R08: dffffc0000000000 R09: ffffed100b975718
      R10: 0000000000000001 R11: ffffffff949ea4a3 R12: 1ffff1100b975746
      R13: dffffc0000000000 R14: ffff88805cba4588 R15: dffffc0000000000
      FS:  0000794a78e91b80(0000) GS:ffff888068d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007bd5283ccf58 CR3: 000000004b7aa000 CR4: 00000000001006e0
      Call Trace:
      ? dpm_complete+0x67b/0x67b
      ? i915_gem_suspend+0x14d/0x1ad
      sst_soc_prepare+0x91/0x1dd
      ? sst_be_hw_params+0x7e/0x7e
      dpm_prepare+0x39a/0x88b
      dpm_suspend_start+0x13/0x9d
      suspend_devices_and_enter+0x18f/0xbd7
      ? arch_suspend_enable_irqs+0x11/0x11
      ? printk+0xd9/0x12d
      ? lock_release+0x95f/0x95f
      ? log_buf_vmcoreinfo_setup+0x131/0x131
      ? rcu_read_lock_sched_held+0x140/0x22a
      ? __bpf_trace_rcu_utilization+0xa/0xa
      ? __pm_pr_dbg+0x186/0x190
      ? pm_notifier_call_chain+0x39/0x39
      ? suspend_test+0x9d/0x9d
      pm_suspend+0x2f4/0x728
      ? trace_suspend_resume+0x3da/0x3da
      ? lock_release+0x95f/0x95f
      ? kernfs_fop_write+0x19f/0x32d
      state_store+0xd8/0x147
      ? sysfs_kf_read+0x155/0x155
      kernfs_fop_write+0x23e/0x32d
      __vfs_write+0x108/0x608
      ? vfs_read+0x2e9/0x2e9
      ? rcu_read_lock_sched_held+0x140/0x22a
      ? __bpf_trace_rcu_utilization+0xa/0xa
      ? debug_smp_processor_id+0x10/0x10
      ? selinux_file_permission+0x1c5/0x3c8
      ? rcu_sync_lockdep_assert+0x6a/0xad
      ? __sb_start_write+0x129/0x2ac
      vfs_write+0x1aa/0x434
      ksys_write+0xfe/0x1be
      ? __ia32_sys_read+0x82/0x82
      do_syscall_64+0xcd/0x120
      entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      In the observed situation, the problem is seen because the codec driver
      failed to probe due to a hardware problem.
      
      max98090 i2c-193C9890:00: Failed to read device revision: -1
      max98090 i2c-193C9890:00: ASoC: failed to probe component -1
      cht-bsw-max98090 cht-bsw-max98090: ASoC: failed to instantiate card -1
      cht-bsw-max98090 cht-bsw-max98090: snd_soc_register_card failed -1
      cht-bsw-max98090: probe of cht-bsw-max98090 failed with error -1
      
      The problem is similar to the problem solved with commit 2fc995a8
      ("ASoC: intel: Fix crash at suspend/resume without card registration"),
      but codec registration fails at a later point. At that time, the pointer
      checked with the above mentioned commit is already set, but it is not
      cleared if the device is subsequently removed. Adding a remove function
      to clear the pointer fixes the problem.
      
      Cc: stable@vger.kernel.org
      Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com>
      Cc: Curtis Malainey <cujomalainey@chromium.org>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarPierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19b0a7f5
    • Greg Thelen's avatar
      mm: writeback: use exact memcg dirty counts · 43f47331
      Greg Thelen authored
      commit 0b3d6e6f upstream.
      
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: default avatarGreg Thelen <gthelen@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43f47331
    • Arnd Bergmann's avatar
      include/linux/bitrev.h: fix constant bitrev · 576f22ac
      Arnd Bergmann authored
      commit 6147e136 upstream.
      
      clang points out with hundreds of warnings that the bitrev macros have a
      problem with constant input:
      
        drivers/hwmon/sht15.c:187:11: error: variable '__x' is uninitialized when used within its own initialization
              [-Werror,-Wuninitialized]
                u8 crc = bitrev8(data->val_status & 0x0F);
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        include/linux/bitrev.h:102:21: note: expanded from macro 'bitrev8'
                __constant_bitrev8(__x) :                       \
                ~~~~~~~~~~~~~~~~~~~^~~~
        include/linux/bitrev.h:67:11: note: expanded from macro '__constant_bitrev8'
                u8 __x = x;                     \
                   ~~~   ^
      
      Both the bitrev and the __constant_bitrev macros use an internal
      variable named __x, which goes horribly wrong when passing one to the
      other.
      
      The obvious fix is to rename one of the variables, so this adds an extra
      '_'.
      
      It seems we got away with this because
      
       - there are only a few drivers using bitrev macros
      
       - usually there are no constant arguments to those
      
       - when they are constant, they tend to be either 0 or (unsigned)-1
         (drivers/isdn/i4l/isdnhdlc.o, drivers/iio/amplifiers/ad8366.c) and
         give the correct result by pure chance.
      
      In fact, the only driver that I could find that gets different results
      with this is drivers/net/wan/slic_ds26522.c, which in turn is a driver
      for fairly rare hardware (adding the maintainer to Cc for testing).
      
      Link: http://lkml.kernel.org/r/20190322140503.123580-1-arnd@arndb.de
      Fixes: 556d2f05 ("ARM: 8187/1: add CONFIG_HAVE_ARCH_BITREVERSE to support rbit instruction")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Cc: Zhao Qiang <qiang.zhao@nxp.com>
      Cc: Yalin Wang <yalin.wang@sonymobile.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      576f22ac
    • David Rientjes's avatar
      kvm: svm: fix potential get_num_contig_pages overflow · c4f103f6
      David Rientjes authored
      commit ede885ec upstream.
      
      get_num_contig_pages() could potentially overflow int so make its type
      consistent with its usage.
      Reported-by: default avatarCfir Cohen <cfir@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4f103f6
    • Dave Airlie's avatar
      drm/udl: add a release method and delay modeset teardown · 93d60348
      Dave Airlie authored
      commit 9b39b013 upstream.
      
      If we unplug a udl device, the usb callback with deinit the
      mode_config struct, however userspace will still have an open
      file descriptor and a framebuffer on that device. When userspace
      closes the fd, we'll oops because it'll try and look stuff up
      in the object idr which we've destroyed.
      
      This punts destroying the mode objects until release time instead.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190405031715.5959-2-airlied@gmail.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93d60348
    • Yan Zhao's avatar
      drm/i915/gvt: do not deliver a workload if its creation fails · df4106f2
      Yan Zhao authored
      commit dade58ed upstream.
      
      in workload creation routine, if any failure occurs, do not queue this
      workload for delivery. if this failure is fatal, enter into failsafe
      mode.
      
      Fixes: 6d763035 ("drm/i915/gvt: Move common vGPU workload creation into scheduler.c")
      Cc: stable@vger.kernel.org #4.19+
      Cc: zhenyuw@linux.intel.com
      Signed-off-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Signed-off-by: default avatarZhenyu Wang <zhenyuw@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df4106f2
    • Andrei Vagin's avatar
      alarmtimer: Return correct remaining time · a5277bcc
      Andrei Vagin authored
      commit 07d7e120 upstream.
      
      To calculate a remaining time, it's required to subtract the current time
      from the expiration time. In alarm_timer_remaining() the arguments of
      ktime_sub are swapped.
      
      Fixes: d653d845 ("alarmtimer: Implement remaining callback")
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190408041542.26338-1-avagin@gmail.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a5277bcc
    • Sven Schnelle's avatar
      parisc: also set iaoq_b in instruction_pointer_set() · 5db86e2a
      Sven Schnelle authored
      commit f324fa58 upstream.
      
      When setting the instruction pointer on PA-RISC we also need
      to set the back of the instruction queue to the new offset, otherwise
      we will execute on instruction from the new location, and jumping
      back to the old location stored in iaoq_b.
      Signed-off-by: default avatarSven Schnelle <svens@stackframe.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Fixes: 75ebedf1 ("parisc: Add HAVE_REGS_AND_STACK_ACCESS_API feature")
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5db86e2a
    • Sven Schnelle's avatar
      parisc: regs_return_value() should return gpr28 · 53bb8444
      Sven Schnelle authored
      commit 45efd871 upstream.
      
      While working on kretprobes for PA-RISC I was wondering while the
      kprobes sanity test always fails on kretprobes. This is caused by
      returning gpr20 instead of gpr28.
      Signed-off-by: default avatarSven Schnelle <svens@stackframe.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53bb8444
    • Helge Deller's avatar
      parisc: Detect QEMU earlier in boot process · 41cf8111
      Helge Deller authored
      commit d006e95b upstream.
      
      While adding LASI support to QEMU, I noticed that the QEMU detection in
      the kernel happens much too late. For example, when a LASI chip is found
      by the kernel, it registers the LASI LED driver as well.  But when we
      run on QEMU it makes sense to avoid spending unnecessary CPU cycles, so
      we need to access the running_on_QEMU flag earlier than before.
      
      This patch now makes the QEMU detection the fist task of the Linux
      kernel by moving it to where the kernel enters the C-coding.
      
      Fixes: 310d8278 ("parisc: qemu idle sleep support")
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41cf8111