1. 14 Feb, 2020 35 commits
  2. 11 Feb, 2020 5 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.19.103 · 35766839
      Greg Kroah-Hartman authored
      35766839
    • David Howells's avatar
      rxrpc: Fix service call disconnection · 06748661
      David Howells authored
      [ Upstream commit b39a934e ]
      
      The recent patch that substituted a flag on an rxrpc_call for the
      connection pointer being NULL as an indication that a call was disconnected
      puts the set_bit in the wrong place for service calls.  This is only a
      problem if a call is implicitly terminated by a new call coming in on the
      same connection channel instead of a terminating ACK packet.
      
      In such a case, rxrpc_input_implicit_end_call() calls
      __rxrpc_disconnect_call(), which is now (incorrectly) setting the
      disconnection bit, meaning that when rxrpc_release_call() is later called,
      it doesn't call rxrpc_disconnect_call() and so the call isn't removed from
      the peer's error distribution list and the list gets corrupted.
      
      KASAN finds the issue as an access after release on a call, but the
      position at which it occurs is confusing as it appears to be related to a
      different call (the call site is where the latter call is being removed
      from the error distribution list and either the next or pprev pointer
      points to a previously released call).
      
      Fix this by moving the setting of the flag from __rxrpc_disconnect_call()
      to rxrpc_disconnect_call() in the same place that the connection pointer
      was being cleared.
      
      Fixes: 5273a191 ("rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      06748661
    • Song Liu's avatar
      perf/core: Fix mlock accounting in perf_mmap() · a3623db4
      Song Liu authored
      commit 00346155 upstream.
      
      Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
      a perf ring buffer may lead to an integer underflow in locked memory
      accounting. This may lead to the undesired behaviors, such as failures in
      BPF map creation.
      
      Address this by adjusting the accounting logic to take into account the
      possibility that the amount of already locked memory may exceed the
      current limit.
      
      Fixes: c4b75479 ("perf/core: Make the mlock accounting simple again")
      Suggested-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3623db4
    • Konstantin Khlebnikov's avatar
      clocksource: Prevent double add_timer_on() for watchdog_timer · 6284d30e
      Konstantin Khlebnikov authored
      commit febac332 upstream.
      
      Kernel crashes inside QEMU/KVM are observed:
      
        kernel BUG at kernel/time/timer.c:1154!
        BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
      
      At the same time another cpu got:
      
        general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
      
        __hlist_del at include/linux/list.h:681
        (inlined by) detach_timer at kernel/time/timer.c:818
        (inlined by) expire_timers at kernel/time/timer.c:1355
        (inlined by) __run_timers at kernel/time/timer.c:1686
        (inlined by) run_timer_softirq at kernel/time/timer.c:1699
      
      Unfortunately kernel logs are badly scrambled, stacktraces are lost.
      
      Printing the timer->function before the BUG_ON() pointed to
      clocksource_watchdog().
      
      The execution of clocksource_watchdog() can race with a sequence of
      clocksource_stop_watchdog() .. clocksource_start_watchdog():
      
      expire_timers()
       detach_timer(timer, true);
        timer->entry.pprev = NULL;
       raw_spin_unlock_irq(&base->lock);
       call_timer_fn
        clocksource_watchdog()
      
      					clocksource_watchdog_kthread() or
      					clocksource_unbind()
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_stop_watchdog();
      					 del_timer(&watchdog_timer);
      					 watchdog_running = 0;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_start_watchdog();
      					 add_timer_on(&watchdog_timer, ...);
      					 watchdog_running = 1;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
        spin_lock(&watchdog_lock);
        add_timer_on(&watchdog_timer, ...);
         BUG_ON(timer_pending(timer) || !timer->function);
          timer_pending() -> true
          BUG()
      
      I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
      
      Check timer_pending() before calling add_timer_on(). This is sufficient as
      all operations are synchronized by watchdog_lock.
      
      Fixes: 75c5158f ("timekeeping: Update clocksource with stop_machine")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzzSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6284d30e
    • Thomas Gleixner's avatar
      x86/apic/msi: Plug non-maskable MSI affinity race · 032a2bf9
      Thomas Gleixner authored
      commit 6f1a4891 upstream.
      
      Evan tracked down a subtle race between the update of the MSI message and
      the device raising an interrupt internally on PCI devices which do not
      support MSI masking. The update of the MSI message is non-atomic and
      consists of either 2 or 3 sequential 32bit wide writes to the PCI config
      space.
      
         - Write address low 32bits
         - Write address high 32bits (If supported by device)
         - Write data
      
      When an interrupt is migrated then both address and data might change, so
      the kernel attempts to mask the MSI interrupt first. But for MSI masking is
      optional, so there exist devices which do not provide it. That means that
      if the device raises an interrupt internally between the writes then a MSI
      message is sent built from half updated state.
      
      On x86 this can lead to spurious interrupts on the wrong interrupt
      vector when the affinity setting changes both address and data. As a
      consequence the device interrupt can be lost causing the device to
      become stuck or malfunctioning.
      
      Evan tried to handle that by disabling MSI accross an MSI message
      update. That's not feasible because disabling MSI has issues on its own:
      
       If MSI is disabled the PCI device is routing an interrupt to the legacy
       INTx mechanism. The INTx delivery can be disabled, but the disablement is
       not working on all devices.
      
       Some devices lose interrupts when both MSI and INTx delivery are disabled.
      
      Another way to solve this would be to enforce the allocation of the same
      vector on all CPUs in the system for this kind of screwed devices. That
      could be done, but it would bring back the vector space exhaustion problems
      which got solved a few years ago.
      
      Fortunately the high address (if supported by the device) is only relevant
      when X2APIC is enabled which implies interrupt remapping. In the interrupt
      remapping case the affinity setting is happening at the interrupt remapping
      unit and the PCI MSI message is programmed only once when the PCI device is
      initialized.
      
      That makes it possible to solve it with a two step update:
      
        1) Target the MSI msg to the new vector on the current target CPU
      
        2) Target the MSI msg to the new vector on the new target CPU
      
      In both cases writing the MSI message is only changing a single 32bit word
      which prevents the issue of inconsistency.
      
      After writing the final destination it is necessary to check whether the
      device issued an interrupt while the intermediate state #1 (new vector,
      current CPU) was in effect.
      
      This is possible because the affinity change is always happening on the
      current target CPU. The code runs with interrupts disabled, so the
      interrupt can be detected by checking the IRR of the local APIC. If the
      vector is pending in the IRR then the interrupt is retriggered on the new
      target CPU by sending an IPI for the associated vector on the target CPU.
      
      This can cause spurious interrupts on both the local and the new target
      CPU.
      
       1) If the new vector is not in use on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then interrupt entry code will
          ignore that spurious interrupt. The vector is marked so that the
          'No irq handler for vector' warning is supressed once.
      
       2) If the new vector is in use already on the local CPU then the IRR check
          might see an pending interrupt from the device which is using this
          vector. The IPI to the new target CPU will then invoke the handler of
          the device, which got the affinity change, even if that device did not
          issue an interrupt
      
       3) If the new vector is in use already on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then the handler of the device which
          uses that vector on the local CPU will be invoked.
      
      expose issues in device driver interrupt handlers which are not prepared to
      handle a spurious interrupt correctly. This not a regression, it's just
      exposing something which was already broken as spurious interrupts can
      happen for a lot of reasons and all driver handlers need to be able to deal
      with them.
      Reported-by: default avatarEvan Green <evgreen@chromium.org>
      Debugged-by: default avatarEvan Green <evgreen@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarEvan Green <evgreen@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      032a2bf9