1. 14 Feb, 2020 17 commits
  2. 11 Feb, 2020 23 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.19.103 · 35766839
      Greg Kroah-Hartman authored
      35766839
    • David Howells's avatar
      rxrpc: Fix service call disconnection · 06748661
      David Howells authored
      [ Upstream commit b39a934e ]
      
      The recent patch that substituted a flag on an rxrpc_call for the
      connection pointer being NULL as an indication that a call was disconnected
      puts the set_bit in the wrong place for service calls.  This is only a
      problem if a call is implicitly terminated by a new call coming in on the
      same connection channel instead of a terminating ACK packet.
      
      In such a case, rxrpc_input_implicit_end_call() calls
      __rxrpc_disconnect_call(), which is now (incorrectly) setting the
      disconnection bit, meaning that when rxrpc_release_call() is later called,
      it doesn't call rxrpc_disconnect_call() and so the call isn't removed from
      the peer's error distribution list and the list gets corrupted.
      
      KASAN finds the issue as an access after release on a call, but the
      position at which it occurs is confusing as it appears to be related to a
      different call (the call site is where the latter call is being removed
      from the error distribution list and either the next or pprev pointer
      points to a previously released call).
      
      Fix this by moving the setting of the flag from __rxrpc_disconnect_call()
      to rxrpc_disconnect_call() in the same place that the connection pointer
      was being cleared.
      
      Fixes: 5273a191 ("rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      06748661
    • Song Liu's avatar
      perf/core: Fix mlock accounting in perf_mmap() · a3623db4
      Song Liu authored
      commit 00346155 upstream.
      
      Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
      a perf ring buffer may lead to an integer underflow in locked memory
      accounting. This may lead to the undesired behaviors, such as failures in
      BPF map creation.
      
      Address this by adjusting the accounting logic to take into account the
      possibility that the amount of already locked memory may exceed the
      current limit.
      
      Fixes: c4b75479 ("perf/core: Make the mlock accounting simple again")
      Suggested-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3623db4
    • Konstantin Khlebnikov's avatar
      clocksource: Prevent double add_timer_on() for watchdog_timer · 6284d30e
      Konstantin Khlebnikov authored
      commit febac332 upstream.
      
      Kernel crashes inside QEMU/KVM are observed:
      
        kernel BUG at kernel/time/timer.c:1154!
        BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
      
      At the same time another cpu got:
      
        general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
      
        __hlist_del at include/linux/list.h:681
        (inlined by) detach_timer at kernel/time/timer.c:818
        (inlined by) expire_timers at kernel/time/timer.c:1355
        (inlined by) __run_timers at kernel/time/timer.c:1686
        (inlined by) run_timer_softirq at kernel/time/timer.c:1699
      
      Unfortunately kernel logs are badly scrambled, stacktraces are lost.
      
      Printing the timer->function before the BUG_ON() pointed to
      clocksource_watchdog().
      
      The execution of clocksource_watchdog() can race with a sequence of
      clocksource_stop_watchdog() .. clocksource_start_watchdog():
      
      expire_timers()
       detach_timer(timer, true);
        timer->entry.pprev = NULL;
       raw_spin_unlock_irq(&base->lock);
       call_timer_fn
        clocksource_watchdog()
      
      					clocksource_watchdog_kthread() or
      					clocksource_unbind()
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_stop_watchdog();
      					 del_timer(&watchdog_timer);
      					 watchdog_running = 0;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_start_watchdog();
      					 add_timer_on(&watchdog_timer, ...);
      					 watchdog_running = 1;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
        spin_lock(&watchdog_lock);
        add_timer_on(&watchdog_timer, ...);
         BUG_ON(timer_pending(timer) || !timer->function);
          timer_pending() -> true
          BUG()
      
      I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
      
      Check timer_pending() before calling add_timer_on(). This is sufficient as
      all operations are synchronized by watchdog_lock.
      
      Fixes: 75c5158f ("timekeeping: Update clocksource with stop_machine")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzzSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6284d30e
    • Thomas Gleixner's avatar
      x86/apic/msi: Plug non-maskable MSI affinity race · 032a2bf9
      Thomas Gleixner authored
      commit 6f1a4891 upstream.
      
      Evan tracked down a subtle race between the update of the MSI message and
      the device raising an interrupt internally on PCI devices which do not
      support MSI masking. The update of the MSI message is non-atomic and
      consists of either 2 or 3 sequential 32bit wide writes to the PCI config
      space.
      
         - Write address low 32bits
         - Write address high 32bits (If supported by device)
         - Write data
      
      When an interrupt is migrated then both address and data might change, so
      the kernel attempts to mask the MSI interrupt first. But for MSI masking is
      optional, so there exist devices which do not provide it. That means that
      if the device raises an interrupt internally between the writes then a MSI
      message is sent built from half updated state.
      
      On x86 this can lead to spurious interrupts on the wrong interrupt
      vector when the affinity setting changes both address and data. As a
      consequence the device interrupt can be lost causing the device to
      become stuck or malfunctioning.
      
      Evan tried to handle that by disabling MSI accross an MSI message
      update. That's not feasible because disabling MSI has issues on its own:
      
       If MSI is disabled the PCI device is routing an interrupt to the legacy
       INTx mechanism. The INTx delivery can be disabled, but the disablement is
       not working on all devices.
      
       Some devices lose interrupts when both MSI and INTx delivery are disabled.
      
      Another way to solve this would be to enforce the allocation of the same
      vector on all CPUs in the system for this kind of screwed devices. That
      could be done, but it would bring back the vector space exhaustion problems
      which got solved a few years ago.
      
      Fortunately the high address (if supported by the device) is only relevant
      when X2APIC is enabled which implies interrupt remapping. In the interrupt
      remapping case the affinity setting is happening at the interrupt remapping
      unit and the PCI MSI message is programmed only once when the PCI device is
      initialized.
      
      That makes it possible to solve it with a two step update:
      
        1) Target the MSI msg to the new vector on the current target CPU
      
        2) Target the MSI msg to the new vector on the new target CPU
      
      In both cases writing the MSI message is only changing a single 32bit word
      which prevents the issue of inconsistency.
      
      After writing the final destination it is necessary to check whether the
      device issued an interrupt while the intermediate state #1 (new vector,
      current CPU) was in effect.
      
      This is possible because the affinity change is always happening on the
      current target CPU. The code runs with interrupts disabled, so the
      interrupt can be detected by checking the IRR of the local APIC. If the
      vector is pending in the IRR then the interrupt is retriggered on the new
      target CPU by sending an IPI for the associated vector on the target CPU.
      
      This can cause spurious interrupts on both the local and the new target
      CPU.
      
       1) If the new vector is not in use on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then interrupt entry code will
          ignore that spurious interrupt. The vector is marked so that the
          'No irq handler for vector' warning is supressed once.
      
       2) If the new vector is in use already on the local CPU then the IRR check
          might see an pending interrupt from the device which is using this
          vector. The IPI to the new target CPU will then invoke the handler of
          the device, which got the affinity change, even if that device did not
          issue an interrupt
      
       3) If the new vector is in use already on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then the handler of the device which
          uses that vector on the local CPU will be invoked.
      
      expose issues in device driver interrupt handlers which are not prepared to
      handle a spurious interrupt correctly. This not a regression, it's just
      exposing something which was already broken as spurious interrupts can
      happen for a lot of reasons and all driver handlers need to be able to deal
      with them.
      Reported-by: default avatarEvan Green <evgreen@chromium.org>
      Debugged-by: default avatarEvan Green <evgreen@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarEvan Green <evgreen@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      032a2bf9
    • Ronnie Sahlberg's avatar
      cifs: fail i/o on soft mounts if sessionsetup errors out · 71a47ed6
      Ronnie Sahlberg authored
      commit b0dd940e upstream.
      
      RHBZ: 1579050
      
      If we have a soft mount we should fail commands for session-setup
      failures (such as the password having changed/ account being deleted/ ...)
      and return an error back to the application.
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71a47ed6
    • David Hildenbrand's avatar
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · 0a69047d
      David Hildenbrand authored
      [ Upstream commit e822969c ]
      
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0a69047d
    • Pavel Tatashin's avatar
      mm: return zero_resv_unavail optimization · f19a50c1
      Pavel Tatashin authored
      [ Upstream commit ec393a0f ]
      
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f19a50c1
    • Naoya Horiguchi's avatar
      mm: zero remaining unavailable struct pages · 9ac5917a
      Naoya Horiguchi authored
      [ Upstream commit 907ec5fc ]
      
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9ac5917a
    • Sean Christopherson's avatar
      KVM: Play nice with read-only memslots when querying host page size · 21b70d9b
      Sean Christopherson authored
      [ Upstream commit 42cde48b ]
      
      Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
      on read-only memslots due to gfn_to_hva() assuming writes.  Functionally,
      this allows x86 to create large mappings for read-only memslots that
      are backed by HugeTLB mappings.
      
      Note, the changelog for commit 05da4558 ("KVM: MMU: large page
      support") states "If the largepage contains write-protected pages, a
      large pte is not used.", but "write-protected" refers to pages that are
      temporarily read-only, e.g. read-only memslots didn't even exist at the
      time.
      
      Fixes: 4d8b81ab ("KVM: introduce readonly memslot")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      21b70d9b
    • Sean Christopherson's avatar
      KVM: Use vcpu-specific gva->hva translation when querying host page size · dabf1a10
      Sean Christopherson authored
      [ Upstream commit f9b84e19 ]
      
      Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
      correct set of memslots is used when handling x86 page faults in SMM.
      
      Fixes: 54bf36aa ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dabf1a10
    • Miaohe Lin's avatar
      KVM: nVMX: vmread should not set rflags to specify success in case of #PF · eb2c9541
      Miaohe Lin authored
      [ Upstream commit a4d956b9 ]
      
      In case writing to vmread destination operand result in a #PF, vmread
      should not call nested_vmx_succeed() to set rflags to specify success.
      Similar to as done in VMPTRST (See handle_vmptrst()).
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      eb2c9541
    • Sean Christopherson's avatar
      KVM: VMX: Add non-canonical check on writes to RTIT address MSRs · 57211b73
      Sean Christopherson authored
      [ Upstream commit fe6ed369 ]
      
      Reject writes to RTIT address MSRs if the data being written is a
      non-canonical address as the MSRs are subject to canonical checks, e.g.
      KVM will trigger an unchecked #GP when loading the values to hardware
      during pt_guest_enter().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      57211b73
    • Sean Christopherson's avatar
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · 9b376cb6
      Sean Christopherson authored
      [ Upstream commit 736c291c ]
      
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9b376cb6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVM · c2e29d0f
      Sean Christopherson authored
      [ Upstream commit e30a7d62 ]
      
      Remove the bogus 64-bit only condition from the check that disables MMIO
      spte optimization when the system supports the max PA, i.e. doesn't have
      any reserved PA bits.  32-bit KVM always uses PAE paging for the shadow
      MMU, and per Intel's SDM:
      
        PAE paging translates 32-bit linear addresses to 52-bit physical
        addresses.
      
      The kernel's restrictions on max physical addresses are limits on how
      much memory the kernel can reasonably use, not what physical addresses
      are supported by hardware.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c2e29d0f
    • Josef Bacik's avatar
      btrfs: flush write bio if we loop in extent_write_cache_pages · 86047371
      Josef Bacik authored
      [ Upstream commit 96bf313ecb33567af4cb53928b0c951254a02759 ]
      
      There exists a deadlock with range_cyclic that has existed forever.  If
      we loop around with a bio already built we could deadlock with a writer
      who has the page locked that we're attempting to write but is waiting on
      a page in our bio to be written out.  The task traces are as follows
      
        PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
         #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
         #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
         #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
         #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
         #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
         #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
         #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
         #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
         #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
         #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
      
        PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
        "aio-dio-invalid"
         #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
         #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
         #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
         #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
         #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
         #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
         #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
         #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
         #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
         #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
      
      I used drgn to find the respective pages we were stuck on
      
      page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
      page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
      
      As you can see the kworker is waiting for bit 0 (PG_locked) on index
      7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
      8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
      following
      
        crash> struct extent_page_data ffffc900297bbbb0
        struct extent_page_data {
          bio = 0xffff889f747ed830,
          tree = 0xffff889eed6ba448,
          extent_locked = 0,
          sync_io = 0
        }
      
      Probably worth mentioning as well that it waits for writeback of the
      page to complete while holding a lock on it (at prepare_pages()).
      
      Using drgn I walked the bio pages looking for page
      0xffffea00fbfc7500 which is the one we're waiting for writeback on
      
        bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
        for i in range(0, bio.bi_vcnt.value_()):
            bv = bio.bi_io_vec[i]
            if bv.bv_page.value_() == 0xffffea00fbfc7500:
      	  print("FOUND IT")
      
      which validated what I suspected.
      
      The fix for this is simple, flush the epd before we loop back around to
      the beginning of the file during writeout.
      
      Fixes: b293f02e ("Btrfs: Add writepages support")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      86047371
    • Wayne Lin's avatar
      drm/dp_mst: Remove VCPI while disabling topology mgr · 4ecba33e
      Wayne Lin authored
      [ Upstream commit 64e62bdf ]
      
      [Why]
      
      This patch is trying to address the issue observed when hotplug DP
      daisy chain monitors.
      
      e.g.
      src-mstb-mstb-sst -> src (unplug) mstb-mstb-sst -> src-mstb-mstb-sst
      (plug in again)
      
      Once unplug a DP MST capable device, driver will call
      drm_dp_mst_topology_mgr_set_mst() to disable MST. In this function,
      it cleans data of topology manager while disabling mst_state. However,
      it doesn't clean up the proposed_vcpis of topology manager.
      If proposed_vcpi is not reset, once plug in MST daisy chain monitors
      later, code will fail at checking port validation while trying to
      allocate payloads.
      
      When MST capable device is plugged in again and try to allocate
      payloads by calling drm_dp_update_payload_part1(), this
      function will iterate over all proposed virtual channels to see if
      any proposed VCPI's num_slots is greater than 0. If any proposed
      VCPI's num_slots is greater than 0 and the port which the
      specific virtual channel directed to is not in the topology, code then
      fails at the port validation. Since there are stale VCPI allocations
      from the previous topology enablement in proposed_vcpi[], code will fail
      at port validation and reurn EINVAL.
      
      [How]
      
      Clean up the data of stale proposed_vcpi[] and reset mgr->proposed_vcpis
      to NULL while disabling mst in drm_dp_mst_topology_mgr_set_mst().
      
      Changes since v1:
      *Add on more details in commit message to describe the issue which the
      patch is trying to fix
      Signed-off-by: default avatarWayne Lin <Wayne.Lin@amd.com>
      [added cc to stable]
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20191205090043.7580-1-Wayne.Lin@amd.com
      Cc: <stable@vger.kernel.org> # v3.17+
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4ecba33e
    • Claudiu Beznea's avatar
      drm: atmel-hlcdc: enable clock before configuring timing engine · 1f1611dc
      Claudiu Beznea authored
      [ Upstream commit 2c1fb9d8 ]
      
      Changing pixel clock source without having this clock source enabled
      will block the timing engine and the next operations after (in this case
      setting ATMEL_HLCDC_CFG(5) settings in atmel_hlcdc_crtc_mode_set_nofb()
      will fail). It is recomended (although in datasheet this is not present)
      to actually enabled pixel clock source before doing any changes on timing
      enginge (only SAM9X60 datasheet specifies that the peripheral clock and
      pixel clock must be enabled before using LCD controller).
      
      Fixes: 1a396789 ("drm: add Atmel HLCDC Display Controller support")
      Signed-off-by: default avatarClaudiu Beznea <claudiu.beznea@microchip.com>
      Signed-off-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: <stable@vger.kernel.org> # v4.0+
      Link: https://patchwork.freedesktop.org/patch/msgid/1576672109-22707-3-git-send-email-claudiu.beznea@microchip.comSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
      1f1611dc
    • Josef Bacik's avatar
      btrfs: free block groups after free'ing fs trees · 159db2ae
      Josef Bacik authored
      [ Upstream commit 4e19443d ]
      
      Sometimes when running generic/475 we would trip the
      WARN_ON(cache->reserved) check when free'ing the block groups on umount.
      This is because sometimes we don't commit the transaction because of IO
      errors and thus do not cleanup the tree logs until at umount time.
      
      These blocks are still reserved until they are cleaned up, but they
      aren't cleaned up until _after_ we do the free block groups work.  Fix
      this by moving the free after free'ing the fs roots, that way all of the
      tree logs are cleaned up and we have a properly cleaned fs.  A bunch of
      loops of generic/475 confirmed this fixes the problem.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      159db2ae
    • Anand Jain's avatar
      btrfs: use bool argument in free_root_pointers() · 381a16fa
      Anand Jain authored
      [ Upstream commit 4273eaff ]
      
      We don't need int argument bool shall do in free_root_pointers().  And
      rename the argument as it confused two people.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      381a16fa
    • Eric Biggers's avatar
      ext4: fix deadlock allocating crypto bounce page from mempool · 987bb7a3
      Eric Biggers authored
      [ Upstream commit 547c556f ]
      
      ext4_writepages() on an encrypted file has to encrypt the data, but it
      can't modify the pagecache pages in-place, so it encrypts the data into
      bounce pages and writes those instead.  All bounce pages are allocated
      from a mempool using GFP_NOFS.
      
      This is not correct use of a mempool, and it can deadlock.  This is
      because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never
      fail" mode for mempool_alloc() where a failed allocation will fall back
      to waiting for one of the preallocated elements in the pool.
      
      But since this mode is used for all a bio's pages and not just the
      first, it can deadlock waiting for pages already in the bio to be freed.
      
      This deadlock can be reproduced by patching mempool_alloc() to pretend
      that pool->alloc() always fails (so that it always falls back to the
      preallocations), and then creating an encrypted file of size > 128 KiB.
      
      Fix it by only using GFP_NOFS for the first page in the bio.  For
      subsequent pages just use GFP_NOWAIT, and if any of those fail, just
      submit the bio and start a new one.
      
      This will need to be fixed in f2fs too, but that's less straightforward.
      
      Fixes: c9af28fd ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      987bb7a3
    • Florian Fainelli's avatar
      net: dsa: b53: Always use dev->vlan_enabled in b53_configure_vlan() · 25a1729e
      Florian Fainelli authored
      [ Upstream commit df373702 ]
      
      b53_configure_vlan() is called by the bcm_sf2 driver upon setup and
      indirectly through resume as well. During the initial setup, we are
      guaranteed that dev->vlan_enabled is false, so there is no change in
      behavior, however during suspend, we may have enabled VLANs before, so we
      do want to restore that setting.
      
      Fixes: dad8d7c6 ("net: dsa: b53: Properly account for VLAN filtering")
      Fixes: 967dd82f ("net: dsa: b53: Add support for Broadcom RoboSwitch")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25a1729e
    • Harini Katakam's avatar
      net: macb: Limit maximum GEM TX length in TSO · 62e5f512
      Harini Katakam authored
      [ Upstream commit f822e9c4 ]
      
      GEM_MAX_TX_LEN currently resolves to 0x3FF8 for any IP version supporting
      TSO with full 14bits of length field in payload descriptor. But an IP
      errata causes false amba_error (bit 6 of ISR) when length in payload
      descriptors is specified above 16387. The error occurs because the DMA
      falsely concludes that there is not enough space in SRAM for incoming
      payload. These errors were observed continuously under stress of large
      packets using iperf on a version where SRAM was 16K for each queue. This
      errata will be documented shortly and affects all versions since TSO
      functionality was added. Hence limit the max length to 0x3FC0 (rounded).
      Signed-off-by: default avatarHarini Katakam <harini.katakam@xilinx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62e5f512