1. 06 Feb, 2017 40 commits
    • Paul Burton's avatar
      MIPS: Malta: Fix IOCU disable switch read for MIPS64 · bc9f83ea
      Paul Burton authored
      commit 305723ab upstream.
      
      Malta boards used with CPU emulators feature a switch to disable use of
      an IOCU. Software has to check this switch & ignore any present IOCU if
      the switch is closed. The read used to do this was unsafe for 64 bit
      kernels, as it simply casted the address 0xbf403000 to a pointer &
      dereferenced it. Whilst in a 32 bit kernel this would access kseg1, in a
      64 bit kernel this attempts to access xuseg & results in an address
      error exception.
      
      Fix by accessing a correctly formed ckseg1 address generated using the
      CKSEG1ADDR macro.
      
      Whilst modifying this code, define the name of the register and the bit
      we care about within it, which indicates whether PCI DMA is routed to
      the IOCU or straight to DRAM. The code previously checked that bit 0 was
      also set, but the least significant 7 bits of the CONFIG_GEN0 register
      contain the value of the MReqInfo signal provided to the IOCU OCP bus,
      so singling out bit 0 makes little sense & that part of the check is
      dropped.
      Signed-off-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Fixes: b6d92b4a ("MIPS: Add option to disable software I/O coherency.")
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/14187/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      bc9f83ea
    • Will Deacon's avatar
      arm64: debug: avoid resetting stepping state machine when TIF_SINGLESTEP · 83099f39
      Will Deacon authored
      commit 3a402a70 upstream.
      
      When TIF_SINGLESTEP is set for a task, the single-step state machine is
      enabled and we must take care not to reset it to the active-not-pending
      state if it is already in the active-pending state.
      
      Unfortunately, that's exactly what user_enable_single_step does, by
      unconditionally setting the SS bit in the SPSR for the current task.
      This causes failures in the GDB testsuite, where GDB ends up missing
      expected step traps if the instruction being stepped generates another
      trap, e.g. PTRACE_EVENT_FORK from an SVC instruction.
      
      This patch fixes the problem by preserving the current state of the
      stepping state machine when TIF_SINGLESTEP is set on the current thread.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarYao Qi <yao.qi@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      83099f39
    • Will Deacon's avatar
      arm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb() · d65df517
      Will Deacon authored
      commit 872c63fb upstream.
      
      smp_mb__before_spinlock() is intended to upgrade a spin_lock() operation
      to a full barrier, such that prior stores are ordered with respect to
      loads and stores occuring inside the critical section.
      
      Unfortunately, the core code defines the barrier as smp_wmb(), which
      is insufficient to provide the required ordering guarantees when used in
      conjunction with our load-acquire-based spinlock implementation.
      
      This patch overrides the arm64 definition of smp_mb__before_spinlock()
      to map to a full smp_mb().
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reported-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d65df517
    • James Hogan's avatar
      arm64: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO · 1fd5c7b6
      James Hogan authored
      commit 3146bc64 upstream.
      
      AT_VECTOR_SIZE_ARCH should be defined with the maximum number of
      NEW_AUX_ENT entries that ARCH_DLINFO can contain, but it wasn't defined
      for arm64 at all even though ARCH_DLINFO will contain one NEW_AUX_ENT
      for the VDSO address.
      
      This shouldn't be a problem as AT_VECTOR_SIZE_BASE includes space for
      AT_BASE_PLATFORM which arm64 doesn't use, but lets define it now and add
      the comment above ARCH_DLINFO as found in several other architectures to
      remind future modifiers of ARCH_DLINFO to keep AT_VECTOR_SIZE_ARCH up to
      date.
      
      Fixes: f668cd16 ("arm64: ELF definitions")
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      1fd5c7b6
    • Mark Rutland's avatar
      arm64: avoid returning from bad_mode · e5471def
      Mark Rutland authored
      commit 7d9e8f71 upstream.
      
      Generally, taking an unexpected exception should be a fatal event, and
      bad_mode is intended to cater for this. However, it should be possible
      to contain unexpected synchronous exceptions from EL0 without bringing
      the kernel down, by sending a SIGILL to the task.
      
      We tried to apply this approach in commit 9955ac47 ("arm64:
      don't kill the kernel on a bad esr from el0"), by sending a signal for
      any bad_mode call resulting from an EL0 exception.
      
      However, this also applies to other unexpected exceptions, such as
      SError and FIQ. The entry paths for these exceptions branch to bad_mode
      without configuring the link register, and have no kernel_exit. Thus, if
      we take one of these exceptions from EL0, bad_mode will eventually
      return to the original user link register value.
      
      This patch fixes this by introducing a new bad_el0_sync handler to cater
      for the recoverable case, and restoring bad_mode to its original state,
      whereby it calls panic() and never returns. The recoverable case
      branches to bad_el0_sync with a bl, and returns to userspace via the
      usual ret_to_user mechanism.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Fixes: 9955ac47 ("arm64: don't kill the kernel on a bad esr from el0")
      Reported-by: default avatarMark Salter <msalter@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      e5471def
    • Russell King's avatar
      ARM: sa1111: fix pcmcia suspend/resume · eee1bdb5
      Russell King authored
      commit 06dfe5cc upstream.
      
      SA1111 PCMCIA was broken when PCMCIA switched to using dev_pm_ops for
      the PCMCIA socket class.  PCMCIA used to handle suspend/resume via the
      socket hosting device, which happened at normal device suspend/resume
      time.
      
      However, the referenced commit changed this: much of the resume now
      happens much earlier, in the noirq resume handler of dev_pm_ops.
      
      However, on SA1111, the PCMCIA device is not accessible as the SA1111
      has not been resumed at _noirq time.  It's slightly worse than that,
      because the SA1111 has already been put to sleep at _noirq time, so
      suspend doesn't work properly.
      
      Fix this by converting the core SA1111 code to use dev_pm_ops as well,
      and performing its own suspend/resume at noirq time.
      
      This fixes these errors in the kernel log:
      
      pcmcia_socket pcmcia_socket0: time out after reset
      pcmcia_socket pcmcia_socket1: time out after reset
      
      and the resulting lack of PCMCIA cards after a S2RAM cycle.
      
      Fixes: d7646f76 ("pcmcia: use dev_pm_ops for class pcmcia_socket_class")
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      eee1bdb5
    • Russell King's avatar
      ARM: sa1100: clear reset status prior to reboot · 5b4918cc
      Russell King authored
      commit da60626e upstream.
      
      Clear the current reset status prior to rebooting the platform.  This
      adds the bit missing from 04fef228 ("[ARM] pxa: introduce
      reset_status and clear_reset_status for driver's usage").
      
      Fixes: 04fef228 ("[ARM] pxa: introduce reset_status and clear_reset_status for driver's usage")
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      5b4918cc
    • Srinivas Ramana's avatar
      ARM: 8618/1: decompressor: reset ttbcr fields to use TTBR0 on ARMv7 · 1774ca81
      Srinivas Ramana authored
      commit 117e5e9c upstream.
      
      If the bootloader uses the long descriptor format and jumps to
      kernel decompressor code, TTBCR may not be in a right state.
      Before enabling the MMU, it is required to clear the TTBCR.PD0
      field to use TTBR0 for translation table walks.
      
      The commit dbece458 ("ARM: 7501/1: decompressor:
      reset ttbcr for VMSA ARMv7 cores") does the reset of TTBCR.N, but
      doesn't consider all the bits for the size of TTBCR.N.
      
      Clear TTBCR.PD0 field and reset all the three bits of TTBCR.N to
      indicate the use of TTBR0 and the correct base address width.
      
      Fixes: dbece458 ("ARM: 7501/1: decompressor: reset ttbcr for VMSA ARMv7 cores")
      Acked-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarSrinivas Ramana <sramana@codeaurora.org>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      1774ca81
    • Robin Murphy's avatar
      ARM: 8616/1: dt: Respect property size when parsing CPUs · 88654a15
      Robin Murphy authored
      commit ba6dea4f upstream.
      
      Whilst MPIDR values themselves are less than 32 bits, it is still
      perfectly valid for a DT to have #address-cells > 1 in the CPUs node,
      resulting in the "reg" property having leading zero cell(s). In that
      situation, the big-endian nature of the data conspires with the current
      behaviour of only reading the first cell to cause the kernel to think
      all CPUs have ID 0, and become resoundingly unhappy as a consequence.
      
      Take the full property length into account when parsing CPUs so as to
      be correct under any circumstances.
      
      Cc: Russell King <linux@armlinux.org.uk>
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      88654a15
    • Baoquan He's avatar
      iommu/amd: Free domain id when free a domain of struct dma_ops_domain · 7a6111b8
      Baoquan He authored
      commit c3db901c upstream.
      
      The current code missed freeing domain id when free a domain of
      struct dma_ops_domain.
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Fixes: ec487d1a ('x86, AMD IOMMU: add domain allocation and deallocation functions')
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      7a6111b8
    • Joerg Roedel's avatar
      iommu/amd: Update Alias-DTE in update_device_table() · 56eb0df4
      Joerg Roedel authored
      commit 3254de6b upstream.
      
      Not doing so might cause IO-Page-Faults when a device uses
      an alias request-id and the alias-dte is left in a lower
      page-mode which does not cover the address allocated from
      the iova-allocator.
      
      Fixes: 492667da ('x86/amd-iommu: Remove amd_iommu_pd_table')
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      56eb0df4
    • Michael S. Tsirkin's avatar
      x86/um: reuse asm-generic/barrier.h · 69c373d8
      Michael S. Tsirkin authored
      commit 577f183a upstream.
      
      On x86/um CONFIG_SMP is never defined.  As a result, several macros
      match the asm-generic variant exactly. Drop the local definitions and
      pull in asm-generic/barrier.h instead.
      
      This is in preparation to refactoring this code area.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarRichard Weinberger <richard@nod.at>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      69c373d8
    • H.J. Lu's avatar
      x86/build: Build compressed x86 kernels as PIE · 186c5f34
      H.J. Lu authored
      commit 6d92bc9d upstream.
      
      The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X
      relocation to get the symbol address in PIC.  When the compressed x86
      kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations
      to their fixed symbol addresses.  However, when the compressed x86
      kernel is loaded at a different address, it leads to the following
      load failure:
      
        Failed to allocate space for phdrs
      
      during the decompression stage.
      
      If the compressed x86 kernel is relocatable at run-time, it should be
      compiled with -fPIE, instead of -fPIC, if possible and should be built as
      Position Independent Executable (PIE) so that linker won't optimize
      R_386_GOT32X relocation to its fixed symbol address.
      
      Older linkers generate R_386_32 relocations against locally defined
      symbols, _bss, _ebss, _got and _egot, in PIE.  It isn't wrong, just less
      optimal than R_386_RELATIVE.  But the x86 kernel fails to properly handle
      R_386_32 relocations when relocating the kernel.  To generate
      R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as
      hidden in both 32-bit and 64-bit x86 kernels.
      
      To build a 64-bit compressed x86 kernel as PIE, we need to disable the
      relocation overflow check to avoid relocation overflow errors. We do
      this with a new linker command-line option, -z noreloc-overflow, which
      got added recently:
      
       commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7
       Author: H.J. Lu <hjl.tools@gmail.com>
       Date:   Tue Mar 15 11:07:06 2016 -0700
      
          Add -z noreloc-overflow option to x86-64 ld
      
          Add -z noreloc-overflow command-line option to the x86-64 ELF linker to
          disable relocation overflow check.  This can be used to avoid relocation
          overflow check if there will be no dynamic relocation overflow at
          run-time.
      
      The 64-bit compressed x86 kernel is built as PIE only if the linker supports
      -z noreloc-overflow.  So far 64-bit relocatable compressed x86 kernel
      boots fine even when it is built as a normal executable.
      Signed-off-by: default avatarH.J. Lu <hjl.tools@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      [ Edited the changelog and comments. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      186c5f34
    • Steven Rostedt's avatar
      x86/paravirt: Do not trace _paravirt_ident_*() functions · 6523fa8c
      Steven Rostedt authored
      commit 15301a57 upstream.
      
      Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
      after enabling function tracer. I asked him to bisect the functions within
      available_filter_functions, which he did and it came down to three:
      
        _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
      
      It was found that this is only an issue when noreplace-paravirt is added
      to the kernel command line.
      
      This means that those functions are most likely called within critical
      sections of the funtion tracer, and must not be traced.
      
      In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
      longer an issue.  But both _paravirt_ident_{32,64}() causes the
      following splat when they are traced:
      
       mm/pgtable-generic.c:33: bad pmd ffff8800d2435150(0000000001d00054)
       mm/pgtable-generic.c:33: bad pmd ffff8800d3624190(0000000001d00070)
       mm/pgtable-generic.c:33: bad pmd ffff8800d36a5110(0000000001d00054)
       mm/pgtable-generic.c:33: bad pmd ffff880118eb1450(0000000001d00054)
       NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
       Modules linked in: e1000e
       CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
       task: ffff880118f740c0 ti: ffff8800d4aec000 task.ti: ffff8800d4aec000
       RIP: 0010:[<ffffffff81134148>]  [<ffffffff81134148>] queued_spin_lock_slowpath+0x118/0x1a0
       RSP: 0018:ffff8800d4aefb90  EFLAGS: 00000246
       RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011eb16d40
       RDX: ffffffff82485760 RSI: 000000001f288820 RDI: ffffea0000008030
       RBP: ffff8800d4aefb90 R08: 00000000000c0000 R09: 0000000000000000
       R10: ffffffff821c8e0e R11: 0000000000000000 R12: ffff880000200fb8
       R13: 00007f7a4e3f7000 R14: ffffea000303f600 R15: ffff8800d4b562e0
       FS:  00007f7a4e3d7840(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f7a4e3f7000 CR3: 00000000d3e71000 CR4: 00000000001406e0
       Call Trace:
         _raw_spin_lock+0x27/0x30
         handle_pte_fault+0x13db/0x16b0
         handle_mm_fault+0x312/0x670
         __do_page_fault+0x1b1/0x4e0
         do_page_fault+0x22/0x30
         page_fault+0x28/0x30
         __vfs_read+0x28/0xe0
         vfs_read+0x86/0x130
         SyS_read+0x46/0xa0
         entry_SYSCALL_64_fastpath+0x1e/0xa8
       Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b
      Reported-by: default avatarŁukasz Daniluk <lukasz.daniluk@intel.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      6523fa8c
    • Jiri Kosina's avatar
      x86/mm/pat, /dev/mem: Remove superfluous error message · 1eae225e
      Jiri Kosina authored
      commit 39380b80 upstream.
      
      Currently it's possible for broken (or malicious) userspace to flood a
      kernel log indefinitely with messages a-la
      
      	Program dmidecode tried to access /dev/mem between f0000->100000
      
      because range_is_allowed() is case of CONFIG_STRICT_DEVMEM being turned on
      dumps this information each and every time devmem_is_allowed() fails.
      
      Reportedly userspace that is able to trigger contignuous flow of these
      messages exists.
      
      It would be possible to rate limit this message, but that'd have a
      questionable value; the administrator wouldn't get information about all
      the failing accessess, so then the information would be both superfluous
      and incomplete at the same time :)
      
      Returning EPERM (which is what is actually happening) is enough indication
      for userspace what has happened; no need to log this particular error as
      some sort of special condition.
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1607081137020.24757@cbobk.fhfr.pmSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      1eae225e
    • Wanpeng Li's avatar
      x86/apic: Do not init irq remapping if ioapic is disabled · 928a2775
      Wanpeng Li authored
      commit 2e63ad4b upstream.
      
      native_smp_prepare_cpus
        -> default_setup_apic_routing
          -> enable_IR_x2apic
            -> irq_remapping_prepare
              -> intel_prepare_irq_remapping
                -> intel_setup_irq_remapping
      
      So IR table is setup even if "noapic" boot parameter is added. As a result we
      crash later when the interrupt affinity is set due to a half initialized
      remapping infrastructure.
      
      Prevent remap initialization when IOAPIC is disabled.
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Link: http://lkml.kernel.org/r/1471954039-3942-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      928a2775
    • Sebastian Andrzej Siewior's avatar
      x86/mm: Disable preemption during CR3 read+write · b591901e
      Sebastian Andrzej Siewior authored
      commit 5cf0791d upstream.
      
      There's a subtle preemption race on UP kernels:
      
      Usually current->mm (and therefore mm->pgd) stays the same during the
      lifetime of a task so it does not matter if a task gets preempted during
      the read and write of the CR3.
      
      But then, there is this scenario on x86-UP:
      
      TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:
      
       -> mmput()
       -> exit_mmap()
       -> tlb_finish_mmu()
       -> tlb_flush_mmu()
       -> tlb_flush_mmu_tlbonly()
       -> tlb_flush()
       -> flush_tlb_mm_range()
       -> __flush_tlb_up()
       -> __flush_tlb()
       ->  __native_flush_tlb()
      
      At this point current->mm is NULL but current->active_mm still points to
      the "old" mm.
      
      Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
      own mm so CR3 has changed.
      
      Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
      mm and so CR3 remains unchanged. Once taskA gets active it continues
      where it was interrupted and that means it writes its old CR3 value
      back. Everything is fine because userland won't need its memory
      anymore.
      
      Now the fun part:
      
      Let's preempt taskA one more time and get back to taskB. This
      time switch_mm() won't do a thing because oldmm (->active_mm)
      is the same as mm (as per context_switch()). So we remain
      with a bad CR3 / PGD and return to userland.
      
      The next thing that happens is handle_mm_fault() with an address for
      the execution of its code in userland. handle_mm_fault() realizes that
      it has a PTE with proper rights so it returns doing nothing. But the
      CPU looks at the wrong PGD and insists that something is wrong and
      faults again. And again. And one more time…
      
      This pagefault circle continues until the scheduler gets tired of it and
      puts another task on the CPU. It gets little difficult if the task is a
      RT task with a high priority. The system will either freeze or it gets
      fixed by the software watchdog thread which usually runs at RT-max prio.
      But waiting for the watchdog will increase the latency of the RT task
      which is no good.
      
      Fix this by disabling preemption across the critical code section.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
      [ Prettified the changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      b591901e
    • Andy Lutomirski's avatar
      x86/traps: Ignore high word of regs->cs in early_idt_handler_common · 213060d9
      Andy Lutomirski authored
      This is a backport of:
      commit fc0e81b2 upstream
      
      On the 80486 DX, it seems that some exceptions may leave garbage in
      the high bits of CS.  This causes sporadic failures in which
      early_fixup_exception() refuses to fix up an exception.
      
      As far as I can tell, this has been buggy for a long time, but the
      problem seems to have been exacerbated by commits:
      
        1e02ce4c ("x86: Store a per-cpu shadow copy of CR4")
        e1bfc11c ("x86/init: Fix cr4_init_shadow() on CR4-less machines")
      
      This appears to have broken for as long as we've had early
      exception handling.
      
      [ This backport should apply to kernels from 3.4 - 4.5. ]
      
      Fixes: 4c5023a3 ("x86-32: Handle exception table entries during early boot")
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: stable@vger.kernel.org
      Reported-by: default avatarMatthew Whitehead <tedheadster@gmail.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      213060d9
    • Juergen Gross's avatar
      x86/xen: fix upper bound of pmd loop in xen_cleanhighmap() · 506750b7
      Juergen Gross authored
      commit 1cf38741 upstream.
      
      xen_cleanhighmap() is operating on level2_kernel_pgt only. The upper
      bound of the loop setting non-kernel-image entries to zero should not
      exceed the size of level2_kernel_pgt.
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      506750b7
    • Ben Hutchings's avatar
      xen-pciback: Add name prefix to global 'permissive' variable · f014225a
      Ben Hutchings authored
      commit 8014bcc8 upstream.
      
      The variable for the 'permissive' module parameter used to be static
      but was recently changed to be extern.  This puts it in the kernel
      global namespace if the driver is built-in, so its name should begin
      with a prefix identifying the driver.
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Fixes: af6fc858 ("xen-pciback: limit guest control of command register")
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      f014225a
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set. · 1d9d642c
      Konrad Rzeszutek Wilk authored
      commit 408fb0e5 upstream.
      
      commit f598282f ("PCI: Fix the NIU MSI-X problem in a better way")
      teaches us that dealing with MSI-X can be troublesome.
      
      Further checks in the MSI-X architecture shows that if the
      PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
      may not be able to access the BAR (since they are memory regions).
      
      Since the MSI-X tables are located in there.. that can lead
      to us causing PCIe errors. Inhibit us performing any
      operation on the MSI-X unless the MEMORY bit is set.
      
      Note that Xen hypervisor with:
      "x86/MSI-X: access MSI-X table only after having enabled MSI-X"
      will return:
      xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!
      
      When the generic MSI code tries to setup the PIRQ without
      MEMORY bit set. Which means with later versions of Xen
      (4.6) this patch is not neccessary.
      
      This is part of XSA-157
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      1d9d642c
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled. · 4c2e2fd2
      Konrad Rzeszutek Wilk authored
      commit 7cfb905b upstream.
      
      Otherwise just continue on, returning the same values as
      previously (return of 0, and op->result has the PIRQ value).
      
      This does not change the behavior of XEN_PCI_OP_disable_msi[|x].
      
      The pci_disable_msi or pci_disable_msix have the checks for
      msi_enabled or msix_enabled so they will error out immediately.
      
      However the guest can still call these operations and cause
      us to disable the 'ack_intr'. That means the backend IRQ handler
      for the legacy interrupt will not respond to interrupts anymore.
      
      This will lead to (if the device is causing an interrupt storm)
      for the Linux generic code to disable the interrupt line.
      
      Naturally this will only happen if the device in question
      is plugged in on the motherboard on shared level interrupt GSI.
      
      This is part of XSA-157
      Reviewed-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      4c2e2fd2
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Do not install an IRQ handler for MSI interrupts. · 5701f8a1
      Konrad Rzeszutek Wilk authored
      commit a396f3a2 upstream.
      
      Otherwise an guest can subvert the generic MSI code to trigger
      an BUG_ON condition during MSI interrupt freeing:
      
       for (i = 0; i < entry->nvec_used; i++)
              BUG_ON(irq_has_action(entry->irq + i));
      
      Xen PCI backed installs an IRQ handler (request_irq) for
      the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
      (or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
      done in case the device has legacy interrupts the GSI line
      is shared by the backend devices.
      
      To subvert the backend the guest needs to make the backend
      to change the dev->irq from the GSI to the MSI interrupt line,
      make the backend allocate an interrupt handler, and then command
      the backend to free the MSI interrupt and hit the BUG_ON.
      
      Since the backend only calls 'request_irq' when the guest
      writes to the PCI_COMMAND register the guest needs to call
      XEN_PCI_OP_enable_msi before any other operation. This will
      cause the generic MSI code to setup an MSI entry and
      populate dev->irq with the new PIRQ value.
      
      Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
      and cause the backend to setup an IRQ handler for dev->irq
      (which instead of the GSI value has the MSI pirq). See
      'xen_pcibk_control_isr'.
      
      Then the guest disables the MSI: XEN_PCI_OP_disable_msi
      which ends up triggering the BUG_ON condition in 'free_msi_irqs'
      as there is an IRQ handler for the entry->irq (dev->irq).
      
      Note that this cannot be done using MSI-X as the generic
      code does not over-write dev->irq with the MSI-X PIRQ values.
      
      The patch inhibits setting up the IRQ handler if MSI or
      MSI-X (for symmetry reasons) code had been called successfully.
      
      P.S.
      Xen PCIBack when it sets up the device for the guest consumption
      ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
      XSA-120 addendum patch removed that - however when upstreaming said
      addendum we found that it caused issues with qemu upstream. That
      has now been fixed in qemu upstream.
      
      This is part of XSA-157
      Reviewed-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      5701f8a1
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled · d2dfdbab
      Konrad Rzeszutek Wilk authored
      commit 5e0ce145 upstream.
      
      The guest sequence of:
      
        a) XEN_PCI_OP_enable_msix
        b) XEN_PCI_OP_enable_msix
      
      results in hitting an NULL pointer due to using freed pointers.
      
      The device passed in the guest MUST have MSI-X capability.
      
      The a) constructs and SysFS representation of MSI and MSI groups.
      The b) adds a second set of them but adding in to SysFS fails (duplicate entry).
      'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that
      in a) pdev->msi_irq_groups is still set) and also free's ALL of the
      MSI-X entries of the device (the ones allocated in step a) and b)).
      
      The unwind code: 'free_msi_irqs' deletes all the entries and tries to
      delete the pdev->msi_irq_groups (which hasn't been set to NULL).
      However the pointers in the SysFS are already freed and we hit an
      NULL pointer further on when 'strlen' is attempted on a freed pointer.
      
      The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard
      against that. The check for msi_enabled is not stricly neccessary.
      
      This is part of XSA-157
      Reviewed-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d2dfdbab
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled · 871e37be
      Konrad Rzeszutek Wilk authored
      commit 56441f3c upstream.
      
      The guest sequence of:
      
       a) XEN_PCI_OP_enable_msi
       b) XEN_PCI_OP_enable_msi
       c) XEN_PCI_OP_disable_msi
      
      results in hitting an BUG_ON condition in the msi.c code.
      
      The MSI code uses an dev->msi_list to which it adds MSI entries.
      Under the above conditions an BUG_ON() can be hit. The device
      passed in the guest MUST have MSI capability.
      
      The a) adds the entry to the dev->msi_list and sets msi_enabled.
      The b) adds a second entry but adding in to SysFS fails (duplicate entry)
      and deletes all of the entries from msi_list and returns (with msi_enabled
      is still set).  c) pci_disable_msi passes the msi_enabled checks and hits:
      
      BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
      
      and blows up.
      
      The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard
      against that. The check for msix_enabled is not stricly neccessary.
      
      This is part of XSA-157.
      Reviewed-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      871e37be
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Save the number of MSI-X entries to be copied later. · 28018142
      Konrad Rzeszutek Wilk authored
      commit d159457b upstream.
      
      Commit 8135cf8b (xen/pciback: Save
      xen_pci_op commands before processing it) broke enabling MSI-X because
      it would never copy the resulting vectors into the response.  The
      number of vectors requested was being overwritten by the return value
      (typically zero for success).
      
      Save the number of vectors before processing the op, so the correct
      number of vectors are copied afterwards.
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <stable@vger.kernel.org>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      28018142
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Save xen_pci_op commands before processing it · ffb6bca6
      Konrad Rzeszutek Wilk authored
      commit 8135cf8b upstream.
      
      Double fetch vulnerabilities that happen when a variable is
      fetched twice from shared memory but a security check is only
      performed the first time.
      
      The xen_pcibk_do_op function performs a switch statements on the op->cmd
      value which is stored in shared memory. Interestingly this can result
      in a double fetch vulnerability depending on the performed compiler
      optimization.
      
      This patch fixes it by saving the xen_pci_op command before
      processing it. We also use 'barrier' to make sure that the
      compiler does not perform any optimization.
      
      This is part of XSA155.
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarJan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      ffb6bca6
    • Roger Pau Monné's avatar
      xen-blkback: only read request operation from shared ring once · d6837064
      Roger Pau Monné authored
      commit 1f13d75c upstream.
      
      A compiler may load a switch statement value multiple times, which could
      be bad when the value is in memory shared with the frontend.
      
      When converting a non-native request to a native one, ensure that
      src->operation is only loaded once by using READ_ONCE().
      
      This is part of XSA155.
      Signed-off-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: "Jan Beulich" <JBeulich@suse.com>
      [wt: s/READ_ONCE/ACCESS_ONCE for 3.10]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d6837064
    • David Vrabel's avatar
      xen-netback: use RING_COPY_REQUEST() throughout · 67434d33
      David Vrabel authored
      commit 68a33bfd upstream.
      
      Instead of open-coding memcpy()s and directly accessing Tx and Rx
      requests, use the new RING_COPY_REQUEST() that ensures the local copy
      is correct.
      
      This is more than is strictly necessary for guest Rx requests since
      only the id and gref fields are used and it is harmless if the
      frontend modifies these.
      
      This is part of XSA155.
      Reviewed-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [wt: adjustments for 3.10 : netbk_rx_meta instead of struct xenvif_rx_meta]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      67434d33
    • David Vrabel's avatar
      xen-netback: don't use last request to determine minimum Tx credit · 85fed7d2
      David Vrabel authored
      commit 0f589967 upstream.
      
      The last from guest transmitted request gives no indication about the
      minimum amount of credit that the guest might need to send a packet
      since the last packet might have been a small one.
      
      Instead allow for the worst case 128 KiB packet.
      
      This is part of XSA155.
      Reviewed-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      85fed7d2
    • David Vrabel's avatar
      xen: Add RING_COPY_REQUEST() · 0a5b4e4c
      David Vrabel authored
      commit 454d5d88 upstream.
      
      Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly
      (i.e., by not considering that the other end may alter the data in the
      shared ring while it is being inspected).  Safe usage of a request
      generally requires taking a local copy.
      
      Provide a RING_COPY_REQUEST() macro to use instead of
      RING_GET_REQUEST() and an open-coded memcpy().  This takes care of
      ensuring that the copy is done correctly regardless of any possible
      compiler optimizations.
      
      Use a volatile source to prevent the compiler from reordering or
      omitting the copy.
      
      This is part of XSA155.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0a5b4e4c
    • Jan Beulich's avatar
      x86/mm/xen: Suppress hugetlbfs in PV guests · 0679df0f
      Jan Beulich authored
      commit 103f6112 upstream.
      
      Huge pages are not normally available to PV guests. Not suppressing
      hugetlbfs use results in an endless loop of page faults when user mode
      code tries to access a hugetlbfs mapped area (since the hypervisor
      denies such PTEs to be created, but error indications can't be
      propagated out of xen_set_pte_at(), just like for various of its
      siblings), and - once killed in an oops like this:
      
        kernel BUG at .../fs/hugetlbfs/inode.c:428!
        invalid opcode: 0000 [#1] SMP
        ...
        RIP: e030:[<ffffffff811c333b>]  [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320
        ...
        Call Trace:
         [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40
         [<ffffffff81167b3d>] evict+0xbd/0x1b0
         [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0
         [<ffffffff81165b0e>] dput+0x1fe/0x220
         [<ffffffff81150535>] __fput+0x155/0x200
         [<ffffffff81079fc0>] task_work_run+0x60/0xa0
         [<ffffffff81063510>] do_exit+0x160/0x400
         [<ffffffff810637eb>] do_group_exit+0x3b/0xa0
         [<ffffffff8106e8bd>] get_signal+0x1ed/0x470
         [<ffffffff8100f854>] do_signal+0x14/0x110
         [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0
         [<ffffffff814178a5>] retint_user+0x8/0x13
      
      This is CVE-2016-3961 / XSA-174.
      Reported-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Juergen Gross <JGross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: xen-devel <xen-devel@lists.xenproject.org>
      Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0679df0f
    • WANG Cong's avatar
      ppp: defer netns reference release for ppp channel · 068b35e9
      WANG Cong authored
      commit 205e1e25 upstream
      
      Matt reported that we have a NULL pointer dereference
      in ppp_pernet() from ppp_connect_channel(),
      i.e. pch->chan_net is NULL.
      
      This is due to that a parallel ppp_unregister_channel()
      could happen while we are in ppp_connect_channel(), during
      which pch->chan_net set to NULL. Since we need a reference
      to net per channel, it makes sense to sync the refcnt
      with the life time of the channel, therefore we should
      release this reference when we destroy it.
      
      Fixes: 1f461dcd ("ppp: take reference on channels netns")
      Reported-by: default avatarMatt Bennett <Matt.Bennett@alliedtelesis.co.nz>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linux-ppp@vger.kernel.org
      Cc: Guillaume Nault <g.nault@alphalink.fr>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      068b35e9
    • Xiaolong Ye's avatar
      PM / devfreq: Fix incorrect type issue. · 6e5ae963
      Xiaolong Ye authored
      commit 5f25f066 upstream
      
      time_in_state in struct devfreq is defined as unsigned long, so
      devm_kzalloc should use sizeof(unsigned long) as argument instead
      of sizeof(unsigned int), otherwise it will cause unexpected result
      in 64bit system.
      Signed-off-by: default avatarXiaolong Ye <yexl@marvell.com>
      Signed-off-by: default avatarKevin Liu <kliu5@marvell.com>
      Signed-off-by: default avatarMyungJoo Ham <myungjoo.ham@samsung.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      6e5ae963
    • Ignacio Alvarado's avatar
      KVM: Disable irq while unregistering user notifier · 01c22882
      Ignacio Alvarado authored
      commit 1650b4eb upstream.
      
      Function user_notifier_unregister should be called only once for each
      registered user notifier.
      
      Function kvm_arch_hardware_disable can be executed from an IPI context
      which could cause a race condition with a VCPU returning to user mode
      and attempting to unregister the notifier.
      Signed-off-by: default avatarIgnacio Alvarado <ikalvarado@google.com>
      Fixes: 18863bdd ("KVM: x86 shared msr infrastructure")
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      01c22882
    • Paolo Bonzini's avatar
      KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr · 014db7f6
      Paolo Bonzini authored
      commit 7301d6ab upstream.
      
      Reported by syzkaller:
      
          [ INFO: suspicious RCU usage. ]
          4.9.0-rc4+ #47 Not tainted
          -------------------------------
          ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!
      
          stack backtrace:
          CPU: 1 PID: 6679 Comm: syz-executor Not tainted 4.9.0-rc4+ #47
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
           ffff880039e2f6d0 ffffffff81c2e46b ffff88003e3a5b40 0000000000000000
           0000000000000001 ffffffff83215600 ffff880039e2f700 ffffffff81334ea9
           ffffc9000730b000 0000000000000004 ffff88003c4f8420 ffff88003d3f8000
          Call Trace:
           [<     inline     >] __dump_stack lib/dump_stack.c:15
           [<ffffffff81c2e46b>] dump_stack+0xb3/0x118 lib/dump_stack.c:51
           [<ffffffff81334ea9>] lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4445
           [<     inline     >] __kvm_memslots include/linux/kvm_host.h:534
           [<     inline     >] kvm_memslots include/linux/kvm_host.h:541
           [<ffffffff8105d6ae>] kvm_gfn_to_hva_cache_init+0xa1e/0xce0 virt/kvm/kvm_main.c:1941
           [<ffffffff8112685d>] kvm_lapic_set_vapic_addr+0xed/0x140 arch/x86/kvm/lapic.c:2217
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Fixes: fda4e2e8
      Cc: Andrew Honig <ahonig@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      014db7f6
    • Ido Yariv's avatar
      KVM: x86: fix wbinvd_dirty_mask use-after-free · a3e49e60
      Ido Yariv authored
      commit bd768e14 upstream.
      
      vcpu->arch.wbinvd_dirty_mask may still be used after freeing it,
      corrupting memory. For example, the following call trace may set a bit
      in an already freed cpu mask:
          kvm_arch_vcpu_load
          vcpu_load
          vmx_free_vcpu_nested
          vmx_free_vcpu
          kvm_arch_vcpu_free
      
      Fix this by deferring freeing of wbinvd_dirty_mask.
      Signed-off-by: default avatarIdo Yariv <ido@wizery.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a3e49e60
    • James Hogan's avatar
      KVM: MIPS: Make ERET handle ERL before EXL · 010a6cc4
      James Hogan authored
      commit ede5f3e7 upstream.
      
      The ERET instruction to return from exception is used for returning from
      exception level (Status.EXL) and error level (Status.ERL). If both bits
      are set however we should be returning from ERL first, as ERL can
      interrupt EXL, for example when an NMI is taken. KVM however checks EXL
      first.
      
      Fix the order of the checks to match the pseudocode in the instruction
      set manual.
      
      Fixes: e685c689 ("KVM/MIPS32: Privileged instruction/target branch emulation.")
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      010a6cc4
    • Radim Krčmář's avatar
      KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write · a0753a8c
      Radim Kr�má� authored
      commit dccbfcf5 upstream.
      
      If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
      write with vmcs02 as the current VMCS.
      This will incorrectly apply modifications intended for vmcs01 to vmcs02
      and L2 can use it to gain access to L0's x2APIC registers by disabling
      virtualized x2APIC while using msr bitmap that assumes enabled.
      
      Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
      current VMCS.  An alternative solution would temporarily make vmcs01 the
      current VMCS, but it requires more care.
      
      Fixes: 8d14695f ("x86, apicv: add virtual x2apic support")
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a0753a8c
    • James Hogan's avatar
      KVM: MIPS: Drop other CPU ASIDs on guest MMU changes · 4c7055f8
      James Hogan authored
      commit 91e4f1b6 upstream.
      
      When a guest TLB entry is replaced by TLBWI or TLBWR, we only invalidate
      TLB entries on the local CPU. This doesn't work correctly on an SMP host
      when the guest is migrated to a different physical CPU, as it could pick
      up stale TLB mappings from the last time the vCPU ran on that physical
      CPU.
      
      Therefore invalidate both user and kernel host ASIDs on other CPUs,
      which will cause new ASIDs to be generated when it next runs on those
      CPUs.
      
      We're careful only to do this if the TLB entry was already valid, and
      only for the kernel ASID where the virtual address it mapped is outside
      of the guest user address range.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.10.x-
      Cc: Jiri Slaby <jslaby@suse.cz>
      [james.hogan@imgtec.com: Backport to 3.10..3.16]
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      4c7055f8