- 06 Feb, 2017 40 commits
-
-
Will Deacon authored
commit 872c63fb upstream. smp_mb__before_spinlock() is intended to upgrade a spin_lock() operation to a full barrier, such that prior stores are ordered with respect to loads and stores occuring inside the critical section. Unfortunately, the core code defines the barrier as smp_wmb(), which is insufficient to provide the required ordering guarantees when used in conjunction with our load-acquire-based spinlock implementation. This patch overrides the arm64 definition of smp_mb__before_spinlock() to map to a full smp_mb(). Cc: Peter Zijlstra <peterz@infradead.org> Reported-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
James Hogan authored
commit 3146bc64 upstream. AT_VECTOR_SIZE_ARCH should be defined with the maximum number of NEW_AUX_ENT entries that ARCH_DLINFO can contain, but it wasn't defined for arm64 at all even though ARCH_DLINFO will contain one NEW_AUX_ENT for the VDSO address. This shouldn't be a problem as AT_VECTOR_SIZE_BASE includes space for AT_BASE_PLATFORM which arm64 doesn't use, but lets define it now and add the comment above ARCH_DLINFO as found in several other architectures to remind future modifiers of ARCH_DLINFO to keep AT_VECTOR_SIZE_ARCH up to date. Fixes: f668cd16 ("arm64: ELF definitions") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Mark Rutland authored
commit 7d9e8f71 upstream. Generally, taking an unexpected exception should be a fatal event, and bad_mode is intended to cater for this. However, it should be possible to contain unexpected synchronous exceptions from EL0 without bringing the kernel down, by sending a SIGILL to the task. We tried to apply this approach in commit 9955ac47 ("arm64: don't kill the kernel on a bad esr from el0"), by sending a signal for any bad_mode call resulting from an EL0 exception. However, this also applies to other unexpected exceptions, such as SError and FIQ. The entry paths for these exceptions branch to bad_mode without configuring the link register, and have no kernel_exit. Thus, if we take one of these exceptions from EL0, bad_mode will eventually return to the original user link register value. This patch fixes this by introducing a new bad_el0_sync handler to cater for the recoverable case, and restoring bad_mode to its original state, whereby it calls panic() and never returns. The recoverable case branches to bad_el0_sync with a bl, and returns to userspace via the usual ret_to_user mechanism. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Fixes: 9955ac47 ("arm64: don't kill the kernel on a bad esr from el0") Reported-by: Mark Salter <msalter@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: stable@vger.kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Russell King authored
commit 06dfe5cc upstream. SA1111 PCMCIA was broken when PCMCIA switched to using dev_pm_ops for the PCMCIA socket class. PCMCIA used to handle suspend/resume via the socket hosting device, which happened at normal device suspend/resume time. However, the referenced commit changed this: much of the resume now happens much earlier, in the noirq resume handler of dev_pm_ops. However, on SA1111, the PCMCIA device is not accessible as the SA1111 has not been resumed at _noirq time. It's slightly worse than that, because the SA1111 has already been put to sleep at _noirq time, so suspend doesn't work properly. Fix this by converting the core SA1111 code to use dev_pm_ops as well, and performing its own suspend/resume at noirq time. This fixes these errors in the kernel log: pcmcia_socket pcmcia_socket0: time out after reset pcmcia_socket pcmcia_socket1: time out after reset and the resulting lack of PCMCIA cards after a S2RAM cycle. Fixes: d7646f76 ("pcmcia: use dev_pm_ops for class pcmcia_socket_class") Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Russell King authored
commit da60626e upstream. Clear the current reset status prior to rebooting the platform. This adds the bit missing from 04fef228 ("[ARM] pxa: introduce reset_status and clear_reset_status for driver's usage"). Fixes: 04fef228 ("[ARM] pxa: introduce reset_status and clear_reset_status for driver's usage") Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Srinivas Ramana authored
commit 117e5e9c upstream. If the bootloader uses the long descriptor format and jumps to kernel decompressor code, TTBCR may not be in a right state. Before enabling the MMU, it is required to clear the TTBCR.PD0 field to use TTBR0 for translation table walks. The commit dbece458 ("ARM: 7501/1: decompressor: reset ttbcr for VMSA ARMv7 cores") does the reset of TTBCR.N, but doesn't consider all the bits for the size of TTBCR.N. Clear TTBCR.PD0 field and reset all the three bits of TTBCR.N to indicate the use of TTBR0 and the correct base address width. Fixes: dbece458 ("ARM: 7501/1: decompressor: reset ttbcr for VMSA ARMv7 cores") Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Srinivas Ramana <sramana@codeaurora.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Robin Murphy authored
commit ba6dea4f upstream. Whilst MPIDR values themselves are less than 32 bits, it is still perfectly valid for a DT to have #address-cells > 1 in the CPUs node, resulting in the "reg" property having leading zero cell(s). In that situation, the big-endian nature of the data conspires with the current behaviour of only reading the first cell to cause the kernel to think all CPUs have ID 0, and become resoundingly unhappy as a consequence. Take the full property length into account when parsing CPUs so as to be correct under any circumstances. Cc: Russell King <linux@armlinux.org.uk> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Baoquan He authored
commit c3db901c upstream. The current code missed freeing domain id when free a domain of struct dma_ops_domain. Signed-off-by: Baoquan He <bhe@redhat.com> Fixes: ec487d1a ('x86, AMD IOMMU: add domain allocation and deallocation functions') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Joerg Roedel authored
commit 3254de6b upstream. Not doing so might cause IO-Page-Faults when a device uses an alias request-id and the alias-dte is left in a lower page-mode which does not cover the address allocated from the iova-allocator. Fixes: 492667da ('x86/amd-iommu: Remove amd_iommu_pd_table') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Michael S. Tsirkin authored
commit 577f183a upstream. On x86/um CONFIG_SMP is never defined. As a result, several macros match the asm-generic variant exactly. Drop the local definitions and pull in asm-generic/barrier.h instead. This is in preparation to refactoring this code area. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Richard Weinberger <richard@nod.at> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
H.J. Lu authored
commit 6d92bc9d upstream. The 32-bit x86 assembler in binutils 2.26 will generate R_386_GOT32X relocation to get the symbol address in PIC. When the compressed x86 kernel isn't built as PIC, the linker optimizes R_386_GOT32X relocations to their fixed symbol addresses. However, when the compressed x86 kernel is loaded at a different address, it leads to the following load failure: Failed to allocate space for phdrs during the decompression stage. If the compressed x86 kernel is relocatable at run-time, it should be compiled with -fPIE, instead of -fPIC, if possible and should be built as Position Independent Executable (PIE) so that linker won't optimize R_386_GOT32X relocation to its fixed symbol address. Older linkers generate R_386_32 relocations against locally defined symbols, _bss, _ebss, _got and _egot, in PIE. It isn't wrong, just less optimal than R_386_RELATIVE. But the x86 kernel fails to properly handle R_386_32 relocations when relocating the kernel. To generate R_386_RELATIVE relocations, we mark _bss, _ebss, _got and _egot as hidden in both 32-bit and 64-bit x86 kernels. To build a 64-bit compressed x86 kernel as PIE, we need to disable the relocation overflow check to avoid relocation overflow errors. We do this with a new linker command-line option, -z noreloc-overflow, which got added recently: commit 4c10bbaa0912742322f10d9d5bb630ba4e15dfa7 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Mar 15 11:07:06 2016 -0700 Add -z noreloc-overflow option to x86-64 ld Add -z noreloc-overflow command-line option to the x86-64 ELF linker to disable relocation overflow check. This can be used to avoid relocation overflow check if there will be no dynamic relocation overflow at run-time. The 64-bit compressed x86 kernel is built as PIE only if the linker supports -z noreloc-overflow. So far 64-bit relocatable compressed x86 kernel boots fine even when it is built as a normal executable. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org [ Edited the changelog and comments. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Steven Rostedt authored
commit 15301a57 upstream. Åukasz Daniluk reported that on a RHEL kernel that his machine would lock up after enabling function tracer. I asked him to bisect the functions within available_filter_functions, which he did and it came down to three: _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64() It was found that this is only an issue when noreplace-paravirt is added to the kernel command line. This means that those functions are most likely called within critical sections of the funtion tracer, and must not be traced. In newer kenels _paravirt_nop() is defined within gcc asm(), and is no longer an issue. But both _paravirt_ident_{32,64}() causes the following splat when they are traced: mm/pgtable-generic.c:33: bad pmd ffff8800d2435150(0000000001d00054) mm/pgtable-generic.c:33: bad pmd ffff8800d3624190(0000000001d00070) mm/pgtable-generic.c:33: bad pmd ffff8800d36a5110(0000000001d00054) mm/pgtable-generic.c:33: bad pmd ffff880118eb1450(0000000001d00054) NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469] Modules linked in: e1000e CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 task: ffff880118f740c0 ti: ffff8800d4aec000 task.ti: ffff8800d4aec000 RIP: 0010:[<ffffffff81134148>] [<ffffffff81134148>] queued_spin_lock_slowpath+0x118/0x1a0 RSP: 0018:ffff8800d4aefb90 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011eb16d40 RDX: ffffffff82485760 RSI: 000000001f288820 RDI: ffffea0000008030 RBP: ffff8800d4aefb90 R08: 00000000000c0000 R09: 0000000000000000 R10: ffffffff821c8e0e R11: 0000000000000000 R12: ffff880000200fb8 R13: 00007f7a4e3f7000 R14: ffffea000303f600 R15: ffff8800d4b562e0 FS: 00007f7a4e3d7840(0000) GS:ffff88011eb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7a4e3f7000 CR3: 00000000d3e71000 CR4: 00000000001406e0 Call Trace: _raw_spin_lock+0x27/0x30 handle_pte_fault+0x13db/0x16b0 handle_mm_fault+0x312/0x670 __do_page_fault+0x1b1/0x4e0 do_page_fault+0x22/0x30 page_fault+0x28/0x30 __vfs_read+0x28/0xe0 vfs_read+0x86/0x130 SyS_read+0x46/0xa0 entry_SYSCALL_64_fastpath+0x1e/0xa8 Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b Reported-by: Åukasz Daniluk <lukasz.daniluk@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Jiri Kosina authored
commit 39380b80 upstream. Currently it's possible for broken (or malicious) userspace to flood a kernel log indefinitely with messages a-la Program dmidecode tried to access /dev/mem between f0000->100000 because range_is_allowed() is case of CONFIG_STRICT_DEVMEM being turned on dumps this information each and every time devmem_is_allowed() fails. Reportedly userspace that is able to trigger contignuous flow of these messages exists. It would be possible to rate limit this message, but that'd have a questionable value; the administrator wouldn't get information about all the failing accessess, so then the information would be both superfluous and incomplete at the same time :) Returning EPERM (which is what is actually happening) is enough indication for userspace what has happened; no need to log this particular error as some sort of special condition. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1607081137020.24757@cbobk.fhfr.pmSigned-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Wanpeng Li authored
commit 2e63ad4b upstream. native_smp_prepare_cpus -> default_setup_apic_routing -> enable_IR_x2apic -> irq_remapping_prepare -> intel_prepare_irq_remapping -> intel_setup_irq_remapping So IR table is setup even if "noapic" boot parameter is added. As a result we crash later when the interrupt affinity is set due to a half initialized remapping infrastructure. Prevent remap initialization when IOAPIC is disabled. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Joerg Roedel <joro@8bytes.org> Link: http://lkml.kernel.org/r/1471954039-3942-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Sebastian Andrzej Siewior authored
commit 5cf0791d upstream. There's a subtle preemption race on UP kernels: Usually current->mm (and therefore mm->pgd) stays the same during the lifetime of a task so it does not matter if a task gets preempted during the read and write of the CR3. But then, there is this scenario on x86-UP: TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by: -> mmput() -> exit_mmap() -> tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_tlbonly() -> tlb_flush() -> flush_tlb_mm_range() -> __flush_tlb_up() -> __flush_tlb() -> __native_flush_tlb() At this point current->mm is NULL but current->active_mm still points to the "old" mm. Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its own mm so CR3 has changed. Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's mm and so CR3 remains unchanged. Once taskA gets active it continues where it was interrupted and that means it writes its old CR3 value back. Everything is fine because userland won't need its memory anymore. Now the fun part: Let's preempt taskA one more time and get back to taskB. This time switch_mm() won't do a thing because oldmm (->active_mm) is the same as mm (as per context_switch()). So we remain with a bad CR3 / PGD and return to userland. The next thing that happens is handle_mm_fault() with an address for the execution of its code in userland. handle_mm_fault() realizes that it has a PTE with proper rights so it returns doing nothing. But the CPU looks at the wrong PGD and insists that something is wrong and faults again. And again. And one more time⦠This pagefault circle continues until the scheduler gets tired of it and puts another task on the CPU. It gets little difficult if the task is a RT task with a high priority. The system will either freeze or it gets fixed by the software watchdog thread which usually runs at RT-max prio. But waiting for the watchdog will increase the latency of the RT task which is no good. Fix this by disabling preemption across the critical code section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de [ Prettified the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Andy Lutomirski authored
This is a backport of: commit fc0e81b2 upstream On the 80486 DX, it seems that some exceptions may leave garbage in the high bits of CS. This causes sporadic failures in which early_fixup_exception() refuses to fix up an exception. As far as I can tell, this has been buggy for a long time, but the problem seems to have been exacerbated by commits: 1e02ce4c ("x86: Store a per-cpu shadow copy of CR4") e1bfc11c ("x86/init: Fix cr4_init_shadow() on CR4-less machines") This appears to have broken for as long as we've had early exception handling. [ This backport should apply to kernels from 3.4 - 4.5. ] Fixes: 4c5023a3 ("x86-32: Handle exception table entries during early boot") Cc: H. Peter Anvin <hpa@zytor.com> Cc: stable@vger.kernel.org Reported-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Juergen Gross authored
commit 1cf38741 upstream. xen_cleanhighmap() is operating on level2_kernel_pgt only. The upper bound of the loop setting non-kernel-image entries to zero should not exceed the size of level2_kernel_pgt. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Ben Hutchings authored
commit 8014bcc8 upstream. The variable for the 'permissive' module parameter used to be static but was recently changed to be extern. This puts it in the kernel global namespace if the driver is built-in, so its name should begin with a prefix identifying the driver. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Fixes: af6fc858 ("xen-pciback: limit guest control of command register") Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Konrad Rzeszutek Wilk authored
commit 408fb0e5 upstream. commit f598282f ("PCI: Fix the NIU MSI-X problem in a better way") teaches us that dealing with MSI-X can be troublesome. Further checks in the MSI-X architecture shows that if the PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we may not be able to access the BAR (since they are memory regions). Since the MSI-X tables are located in there.. that can lead to us causing PCIe errors. Inhibit us performing any operation on the MSI-X unless the MEMORY bit is set. Note that Xen hypervisor with: "x86/MSI-X: access MSI-X table only after having enabled MSI-X" will return: xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3! When the generic MSI code tries to setup the PIRQ without MEMORY bit set. Which means with later versions of Xen (4.6) this patch is not neccessary. This is part of XSA-157 Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Konrad Rzeszutek Wilk authored
commit 7cfb905b upstream. Otherwise just continue on, returning the same values as previously (return of 0, and op->result has the PIRQ value). This does not change the behavior of XEN_PCI_OP_disable_msi[|x]. The pci_disable_msi or pci_disable_msix have the checks for msi_enabled or msix_enabled so they will error out immediately. However the guest can still call these operations and cause us to disable the 'ack_intr'. That means the backend IRQ handler for the legacy interrupt will not respond to interrupts anymore. This will lead to (if the device is causing an interrupt storm) for the Linux generic code to disable the interrupt line. Naturally this will only happen if the device in question is plugged in on the motherboard on shared level interrupt GSI. This is part of XSA-157 Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Konrad Rzeszutek Wilk authored
commit a396f3a2 upstream. Otherwise an guest can subvert the generic MSI code to trigger an BUG_ON condition during MSI interrupt freeing: for (i = 0; i < entry->nvec_used; i++) BUG_ON(irq_has_action(entry->irq + i)); Xen PCI backed installs an IRQ handler (request_irq) for the dev->irq whenever the guest writes PCI_COMMAND_MEMORY (or PCI_COMMAND_IO) to the PCI_COMMAND register. This is done in case the device has legacy interrupts the GSI line is shared by the backend devices. To subvert the backend the guest needs to make the backend to change the dev->irq from the GSI to the MSI interrupt line, make the backend allocate an interrupt handler, and then command the backend to free the MSI interrupt and hit the BUG_ON. Since the backend only calls 'request_irq' when the guest writes to the PCI_COMMAND register the guest needs to call XEN_PCI_OP_enable_msi before any other operation. This will cause the generic MSI code to setup an MSI entry and populate dev->irq with the new PIRQ value. Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY and cause the backend to setup an IRQ handler for dev->irq (which instead of the GSI value has the MSI pirq). See 'xen_pcibk_control_isr'. Then the guest disables the MSI: XEN_PCI_OP_disable_msi which ends up triggering the BUG_ON condition in 'free_msi_irqs' as there is an IRQ handler for the entry->irq (dev->irq). Note that this cannot be done using MSI-X as the generic code does not over-write dev->irq with the MSI-X PIRQ values. The patch inhibits setting up the IRQ handler if MSI or MSI-X (for symmetry reasons) code had been called successfully. P.S. Xen PCIBack when it sets up the device for the guest consumption ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device). XSA-120 addendum patch removed that - however when upstreaming said addendum we found that it caused issues with qemu upstream. That has now been fixed in qemu upstream. This is part of XSA-157 Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Konrad Rzeszutek Wilk authored
commit 5e0ce145 upstream. The guest sequence of: a) XEN_PCI_OP_enable_msix b) XEN_PCI_OP_enable_msix results in hitting an NULL pointer due to using freed pointers. The device passed in the guest MUST have MSI-X capability. The a) constructs and SysFS representation of MSI and MSI groups. The b) adds a second set of them but adding in to SysFS fails (duplicate entry). 'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that in a) pdev->msi_irq_groups is still set) and also free's ALL of the MSI-X entries of the device (the ones allocated in step a) and b)). The unwind code: 'free_msi_irqs' deletes all the entries and tries to delete the pdev->msi_irq_groups (which hasn't been set to NULL). However the pointers in the SysFS are already freed and we hit an NULL pointer further on when 'strlen' is attempted on a freed pointer. The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard against that. The check for msi_enabled is not stricly neccessary. This is part of XSA-157 Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Konrad Rzeszutek Wilk authored
commit 56441f3c upstream. The guest sequence of: a) XEN_PCI_OP_enable_msi b) XEN_PCI_OP_enable_msi c) XEN_PCI_OP_disable_msi results in hitting an BUG_ON condition in the msi.c code. The MSI code uses an dev->msi_list to which it adds MSI entries. Under the above conditions an BUG_ON() can be hit. The device passed in the guest MUST have MSI capability. The a) adds the entry to the dev->msi_list and sets msi_enabled. The b) adds a second entry but adding in to SysFS fails (duplicate entry) and deletes all of the entries from msi_list and returns (with msi_enabled is still set). c) pci_disable_msi passes the msi_enabled checks and hits: BUG_ON(list_empty(dev_to_msi_list(&dev->dev))); and blows up. The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard against that. The check for msix_enabled is not stricly neccessary. This is part of XSA-157. Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Konrad Rzeszutek Wilk authored
commit d159457b upstream. Commit 8135cf8b (xen/pciback: Save xen_pci_op commands before processing it) broke enabling MSI-X because it would never copy the resulting vectors into the response. The number of vectors requested was being overwritten by the return value (typically zero for success). Save the number of vectors before processing the op, so the correct number of vectors are copied afterwards. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: <stable@vger.kernel.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Konrad Rzeszutek Wilk authored
commit 8135cf8b upstream. Double fetch vulnerabilities that happen when a variable is fetched twice from shared memory but a security check is only performed the first time. The xen_pcibk_do_op function performs a switch statements on the op->cmd value which is stored in shared memory. Interestingly this can result in a double fetch vulnerability depending on the performed compiler optimization. This patch fixes it by saving the xen_pci_op command before processing it. We also use 'barrier' to make sure that the compiler does not perform any optimization. This is part of XSA155. Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: "Jan Beulich" <JBeulich@suse.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Roger Pau Monné authored
commit 1f13d75c upstream. A compiler may load a switch statement value multiple times, which could be bad when the value is in memory shared with the frontend. When converting a non-native request to a native one, ensure that src->operation is only loaded once by using READ_ONCE(). This is part of XSA155. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: "Jan Beulich" <JBeulich@suse.com> [wt: s/READ_ONCE/ACCESS_ONCE for 3.10] Signed-off-by: Willy Tarreau <w@1wt.eu>
-
David Vrabel authored
commit 68a33bfd upstream. Instead of open-coding memcpy()s and directly accessing Tx and Rx requests, use the new RING_COPY_REQUEST() that ensures the local copy is correct. This is more than is strictly necessary for guest Rx requests since only the id and gref fields are used and it is harmless if the frontend modifies these. This is part of XSA155. Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [wt: adjustments for 3.10 : netbk_rx_meta instead of struct xenvif_rx_meta] Signed-off-by: Willy Tarreau <w@1wt.eu>
-
David Vrabel authored
commit 0f589967 upstream. The last from guest transmitted request gives no indication about the minimum amount of credit that the guest might need to send a packet since the last packet might have been a small one. Instead allow for the worst case 128 KiB packet. This is part of XSA155. Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
David Vrabel authored
commit 454d5d88 upstream. Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly (i.e., by not considering that the other end may alter the data in the shared ring while it is being inspected). Safe usage of a request generally requires taking a local copy. Provide a RING_COPY_REQUEST() macro to use instead of RING_GET_REQUEST() and an open-coded memcpy(). This takes care of ensuring that the copy is done correctly regardless of any possible compiler optimizations. Use a volatile source to prevent the compiler from reordering or omitting the copy. This is part of XSA155. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Jan Beulich authored
commit 103f6112 upstream. Huge pages are not normally available to PV guests. Not suppressing hugetlbfs use results in an endless loop of page faults when user mode code tries to access a hugetlbfs mapped area (since the hypervisor denies such PTEs to be created, but error indications can't be propagated out of xen_set_pte_at(), just like for various of its siblings), and - once killed in an oops like this: kernel BUG at .../fs/hugetlbfs/inode.c:428! invalid opcode: 0000 [#1] SMP ... RIP: e030:[<ffffffff811c333b>] [<ffffffff811c333b>] remove_inode_hugepages+0x25b/0x320 ... Call Trace: [<ffffffff811c3415>] hugetlbfs_evict_inode+0x15/0x40 [<ffffffff81167b3d>] evict+0xbd/0x1b0 [<ffffffff8116514a>] __dentry_kill+0x19a/0x1f0 [<ffffffff81165b0e>] dput+0x1fe/0x220 [<ffffffff81150535>] __fput+0x155/0x200 [<ffffffff81079fc0>] task_work_run+0x60/0xa0 [<ffffffff81063510>] do_exit+0x160/0x400 [<ffffffff810637eb>] do_group_exit+0x3b/0xa0 [<ffffffff8106e8bd>] get_signal+0x1ed/0x470 [<ffffffff8100f854>] do_signal+0x14/0x110 [<ffffffff810030e9>] prepare_exit_to_usermode+0xe9/0xf0 [<ffffffff814178a5>] retint_user+0x8/0x13 This is CVE-2016-3961 / XSA-174. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <JGross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: xen-devel <xen-devel@lists.xenproject.org> Link: http://lkml.kernel.org/r/57188ED802000078000E431C@prv-mh.provo.novell.comSigned-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
WANG Cong authored
commit 205e1e25 upstream Matt reported that we have a NULL pointer dereference in ppp_pernet() from ppp_connect_channel(), i.e. pch->chan_net is NULL. This is due to that a parallel ppp_unregister_channel() could happen while we are in ppp_connect_channel(), during which pch->chan_net set to NULL. Since we need a reference to net per channel, it makes sense to sync the refcnt with the life time of the channel, therefore we should release this reference when we destroy it. Fixes: 1f461dcd ("ppp: take reference on channels netns") Reported-by: Matt Bennett <Matt.Bennett@alliedtelesis.co.nz> Cc: Paul Mackerras <paulus@samba.org> Cc: linux-ppp@vger.kernel.org Cc: Guillaume Nault <g.nault@alphalink.fr> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Xiaolong Ye authored
commit 5f25f066 upstream time_in_state in struct devfreq is defined as unsigned long, so devm_kzalloc should use sizeof(unsigned long) as argument instead of sizeof(unsigned int), otherwise it will cause unexpected result in 64bit system. Signed-off-by: Xiaolong Ye <yexl@marvell.com> Signed-off-by: Kevin Liu <kliu5@marvell.com> Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Ignacio Alvarado authored
commit 1650b4eb upstream. Function user_notifier_unregister should be called only once for each registered user notifier. Function kvm_arch_hardware_disable can be executed from an IPI context which could cause a race condition with a VCPU returning to user mode and attempting to unregister the notifier. Signed-off-by: Ignacio Alvarado <ikalvarado@google.com> Fixes: 18863bdd ("KVM: x86 shared msr infrastructure") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim KrÄmáŠ<rkrcmar@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Paolo Bonzini authored
commit 7301d6ab upstream. Reported by syzkaller: [ INFO: suspicious RCU usage. ] 4.9.0-rc4+ #47 Not tainted ------------------------------- ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage! stack backtrace: CPU: 1 PID: 6679 Comm: syz-executor Not tainted 4.9.0-rc4+ #47 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 ffff880039e2f6d0 ffffffff81c2e46b ffff88003e3a5b40 0000000000000000 0000000000000001 ffffffff83215600 ffff880039e2f700 ffffffff81334ea9 ffffc9000730b000 0000000000000004 ffff88003c4f8420 ffff88003d3f8000 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff81c2e46b>] dump_stack+0xb3/0x118 lib/dump_stack.c:51 [<ffffffff81334ea9>] lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4445 [< inline >] __kvm_memslots include/linux/kvm_host.h:534 [< inline >] kvm_memslots include/linux/kvm_host.h:541 [<ffffffff8105d6ae>] kvm_gfn_to_hva_cache_init+0xa1e/0xce0 virt/kvm/kvm_main.c:1941 [<ffffffff8112685d>] kvm_lapic_set_vapic_addr+0xed/0x140 arch/x86/kvm/lapic.c:2217 Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: fda4e2e8 Cc: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim KrÄmáŠ<rkrcmar@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Ido Yariv authored
commit bd768e14 upstream. vcpu->arch.wbinvd_dirty_mask may still be used after freeing it, corrupting memory. For example, the following call trace may set a bit in an already freed cpu mask: kvm_arch_vcpu_load vcpu_load vmx_free_vcpu_nested vmx_free_vcpu kvm_arch_vcpu_free Fix this by deferring freeing of wbinvd_dirty_mask. Signed-off-by: Ido Yariv <ido@wizery.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim KrÄmáŠ<rkrcmar@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
James Hogan authored
commit ede5f3e7 upstream. The ERET instruction to return from exception is used for returning from exception level (Status.EXL) and error level (Status.ERL). If both bits are set however we should be returning from ERL first, as ERL can interrupt EXL, for example when an NMI is taken. KVM however checks EXL first. Fix the order of the checks to match the pseudocode in the instruction set manual. Fixes: e685c689 ("KVM/MIPS32: Privileged instruction/target branch emulation.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim KrÄmáÅ" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Radim Kr�má� authored
commit dccbfcf5 upstream. If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the write with vmcs02 as the current VMCS. This will incorrectly apply modifications intended for vmcs01 to vmcs02 and L2 can use it to gain access to L0's x2APIC registers by disabling virtualized x2APIC while using msr bitmap that assumes enabled. Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the current VMCS. An alternative solution would temporarily make vmcs01 the current VMCS, but it requires more care. Fixes: 8d14695f ("x86, apicv: add virtual x2apic support") Reported-by: Jim Mattson <jmattson@google.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim KrÄmáŠ<rkrcmar@redhat.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
James Hogan authored
commit 91e4f1b6 upstream. When a guest TLB entry is replaced by TLBWI or TLBWR, we only invalidate TLB entries on the local CPU. This doesn't work correctly on an SMP host when the guest is migrated to a different physical CPU, as it could pick up stale TLB mappings from the last time the vCPU ran on that physical CPU. Therefore invalidate both user and kernel host ASIDs on other CPUs, which will cause new ASIDs to be generated when it next runs on those CPUs. We're careful only to do this if the TLB entry was already valid, and only for the kernel ASID where the virtual address it mapped is outside of the guest user address range. Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim KrÄmáÅ" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x- Cc: Jiri Slaby <jslaby@suse.cz> [james.hogan@imgtec.com: Backport to 3.10..3.16] Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
James Hogan authored
commit e1e575f6 upstream. The advancing of the PC when completing an MMIO load is done before re-entering the guest, i.e. before restoring the guest ASID. However if the load is in a branch delay slot it may need to access guest code to read the prior branch instruction. This isn't safe in TLB mapped code at the moment, nor in the future when we'll access unmapped guest segments using direct user accessors too, as it could read the branch from host user memory instead. Therefore calculate the resume PC in advance while we're still in the right context and save it in the new vcpu->arch.io_pc (replacing the no longer needed vcpu->arch.pending_load_cause), and restore it on MMIO completion. Fixes: e685c689 ("KVM/MIPS32: Privileged instruction/target branch emulation.") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim KrÄmáŠ<rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Cc: <stable@vger.kernel.org> # 3.10.x-3.16.x: 5f508c43: MIPS: KVM: Fix unused variable build warning Cc: <stable@vger.kernel.org> # 3.10.x-3.16.x Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [james.hogan@imgtec.com: Backport to 3.10..3.16] Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-
Nicholas Mc Guire authored
commit 5f508c43 upstream. As kvm_mips_complete_mmio_load() did not yet modify PC at this point as James Hogans <james.hogan@imgtec.com> explained the curr_pc variable and the comments along with it can be dropped. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Link: http://lkml.org/lkml/2015/5/8/422 Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: kvm@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Patchwork: https://patchwork.linux-mips.org/patch/9993/Signed-off-by: Ralf Baechle <ralf@linux-mips.org> [james.hogan@imgtec.com: Backport to 3.10..3.16] Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
-