- 10 Sep, 2019 3 commits
-
-
Paolo Bonzini authored
Merge tag 'kvm-ppc-next-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD PPC KVM update for 5.4 - Some prep for extending the uses of the rmap array - Various minor fixes - Commits from the powerpc topic/ppc-kvm branch, which fix a problem with interrupts arriving after free_irq, causing host hangs and crashes.
-
Sean Christopherson authored
Manually generate the PDPTR reserved bit mask when explicitly loading PDPTRs. The reserved bits that are being tracked by the MMU reflect the current paging mode, which is unlikely to be PAE paging in the vast majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation, __set_sregs(), etc... This can cause KVM to incorrectly signal a bad PDPTR, or more likely, miss a reserved bit check and subsequently fail a VM-Enter due to a bad VMCS.GUEST_PDPTR. Add a one off helper to generate the reserved bits instead of sharing code across the MMU's calculations and the PDPTR emulation. The PDPTR reserved bits are basically set in stone, and pushing a helper into the MMU's calculation adds unnecessary complexity without improving readability. Oppurtunistically fix/update the comment for load_pdptrs(). Note, the buggy commit also introduced a deliberate functional change, "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was effectively (and correctly) reverted by commit cd9ae5fe ("KVM: x86: Fix page-tables reserved bits"). A bit of SDM archaeology shows that the SDM from late 2008 had a bug (likely a copy+paste error) where it listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved for 2mb entries. I.e. the SDM contradicted itself, and bits 6:5 are and always have been reserved. Fixes: 20c466b5 ("KVM: Use rsvd_bits_mask in load_pdptrs()") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Reported-by: Doug Reiland <doug.reiland@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Alexander Graf authored
We can easily route hardware interrupts directly into VM context when they target the "Fixed" or "LowPriority" delivery modes. However, on modes such as "SMI" or "Init", we need to go via KVM code to actually put the vCPU into a different mode of operation, so we can not post the interrupt Add code in the VMX and SVM PI logic to explicitly refuse to establish posted mappings for advanced IRQ deliver modes. This reflects the logic in __apic_accept_irq() which also only ever passes Fixed and LowPriority interrupts as posted interrupts into the guest. This fixes a bug I have with code which configures real hardware to inject virtual SMIs into my guest. Signed-off-by: Alexander Graf <graf@amazon.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 27 Aug, 2019 3 commits
-
-
Paul Mackerras authored
On POWER9, when userspace reads the value of the DPDES register on a vCPU, it is possible for 0 to be returned although there is a doorbell interrupt pending for the vCPU. This can lead to a doorbell interrupt being lost across migration. If the guest kernel uses doorbell interrupts for IPIs, then it could malfunction because of the lost interrupt. This happens because a newly-generated doorbell interrupt is signalled by setting vcpu->arch.doorbell_request to 1; the DPDES value in vcpu->arch.vcore->dpdes is not updated, because it can only be updated when holding the vcpu mutex, in order to avoid races. To fix this, we OR in vcpu->arch.doorbell_request when reading the DPDES value. Cc: stable@vger.kernel.org # v4.13+ Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
-
Paul Mackerras authored
When we are running multiple vcores on the same physical core, they could be from different VMs and so it is possible that one of the VMs could have its arch.mmu_ready flag cleared (for example by a concurrent HPT resize) when we go to run it on a physical core. We currently check the arch.mmu_ready flag for the primary vcore but not the flags for the other vcores that will be run alongside it. This adds that check, and also a check when we select the secondary vcores from the preempted vcores list. Cc: stable@vger.kernel.org # v4.14+ Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
-
Paul Mackerras authored
There are some POWER9 machines where the OPAL firmware does not support the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls. The impact of this is that a guest using XIVE natively will not be able to be migrated successfully. On the source side, the get_attr operation on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute will fail; on the destination side, the set_attr operation for the same attribute will fail. This adds tests for the existence of the OPAL get/set queue state functions, and if they are not supported, the XIVE-native KVM device is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false. Userspace can then either provide a software emulation of XIVE, or else tell the guest that it does not have a XIVE controller available to it. Cc: stable@vger.kernel.org # v5.2+ Fixes: 3fab2d10 ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode") Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
-
- 23 Aug, 2019 3 commits
-
-
Suraj Jitindar Singh authored
The rmap array in the guest memslot is an array of size number of guest pages, allocated at memslot creation time. Each rmap entry in this array is used to store information about the guest page to which it corresponds. For example for a hpt guest it is used to store a lock bit, rc bits, a present bit and the index of a hpt entry in the guest hpt which maps this page. For a radix guest which is running nested guests it is used to store a pointer to a linked list of nested rmap entries which store the nested guest physical address which maps this guest address and for which there is a pte in the shadow page table. As there are currently two uses for the rmap array, and the potential for this to expand to more in the future, define a type field (being the top 8 bits of the rmap entry) to be used to define the type of the rmap entry which is currently present and define two values for this field for the two current uses of the rmap array. Since the nested case uses the rmap entry to store a pointer, define this type as having the two high bits set as is expected for a pointer. Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering). Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
-
Paul Menzel authored
Fix the error below triggered by `-Wimplicit-fallthrough`, by tagging it as an expected fall-through. arch/powerpc/kvm/book3s_32_mmu.c: In function ‘kvmppc_mmu_book3s_32_xlate_pte’: arch/powerpc/kvm/book3s_32_mmu.c:241:21: error: this statement may fall through [-Werror=implicit-fallthrough=] pte->may_write = true; ~~~~~~~~~~~~~~~^~~~~~ arch/powerpc/kvm/book3s_32_mmu.c:242:5: note: here case 3: ^~~~ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
-
Paul Mackerras authored
This merges in fixes for the XIVE interrupt controller which touch both generic powerpc and PPC KVM code. To avoid merge conflicts, these commits will go upstream via the powerpc tree as well as the KVM tree. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
-
- 22 Aug, 2019 18 commits
-
-
Sean Christopherson authored
Fix an incorrect/stale comment regarding the vmx_vcpu pointer, as guest registers are now loaded using a direct pointer to the start of the register array. Opportunistically add a comment to document why the vmx_vcpu pointer is needed, its consumption via 'call vmx_update_host_rsp' is rather subtle. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
KVM implementations that wrap struct kvm_vcpu with a vendor specific struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0, otherwise the usercopy region intended to encompass struct kvm_vcpu_arch will instead overlap random chunks of the vendor specific struct. E.g. padding a large number of bytes before struct kvm_vcpu triggers a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Wanpeng Li authored
Add pv tlb shootdown tracepoint. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove a few stale checks for non-NULL ops now that the ops in question are implemented by both VMX and SVM. Note, this is **not** stable material, the Fixes tags are there purely to show when a particular op was first supported by both VMX and SVM. Fixes: 74f16909 ("kvm/svm: Setup MCG_CAP on AMD properly") Fixes: b31c114b ("KVM: X86: Provide a capability to disable PAUSE intercepts") Fixes: 411b44ba ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Replace the open-coded "is MMIO SPTE" checks in the MMU warnings related to software-based access/dirty tracking to make the code slightly more self-documenting. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When shadow paging is enabled, KVM tracks the allowed access type for MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit without having to walk the guest's page tables. The tracking is done by retaining the WRITE and USER bits of the access when inserting the MMIO SPTE (read access is implicitly allowed), which allows the MMIO page fault handler to retrieve and cache the WRITE/USER bits from the SPTE. Unfortunately for EPT, the mask used to retain the WRITE/USER bits is hardcoded using the x86 paging versions of the bits. This funkiness happens to work because KVM uses a completely different mask/value for MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to overlap exactly with the x86 WRITE/USER bits[*]. Explicitly define the access mask for MMIO SPTEs to accurately reflect that EPT does not want to incorporate any access bits into the SPTE, and so that KVM isn't subtly relying on EPT's WX bits always being set in MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks horribly. Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for EPT has no (known) functional impact. [*] Using WX to generate EPT misconfigurations (equivalent to reserved bit page fault) ensures KVM can employ its MMIO page fault tricks even platforms without reserved address bits. Fixes: ce88decf ("KVM: MMU: mmio page fault support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Rename "access" to "mmio_access" to match the other MMIO cache members and to make it more obvious that it's tracking the access permissions for the MMIO cache. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Just like we do with other intercepts, in vmrun_interception() we should be doing kvm_skip_emulated_instruction() and not just RIP += 3. Also, it is wrong to increment RIP before nested_svm_vmrun() as it can result in kvm_inject_gp(). We can't call kvm_skip_emulated_instruction() after nested_svm_vmrun() so move it inside. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Regardless of whether or not nested_svm_vmrun_msrpm() fails, we return 1 from vmrun_interception() so there's no point in doing goto. Also, nested_svm_vmrun_msrpm() call can be made from nested_svm_vmrun() where other nested launch issues are handled. nested_svm_vmrun() returns a bool, however, its result is ignored in vmrun_interception() as we always return '1'. As a preparatory change to putting kvm_skip_emulated_instruction() inside nested_svm_vmrun() make nested_svm_vmrun() return an int (always '1' for now). Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Various intercepts hard-code the respective instruction lengths to optimize skip_emulated_instruction(): when next_rip is pre-set we skip kvm_emulate_instruction(vcpu, EMULTYPE_SKIP). The optimization is, however, incorrect: different (redundant) prefixes could be used to enlarge the instruction. We can't really avoid decoding. svm->next_rip is not used when CPU supports 'nrips' (X86_FEATURE_NRIPS) feature: next RIP is provided in VMCB. The feature is not really new (Opteron G3s had it already) and the change should have zero affect. Remove manual svm->next_rip setting with hard-coded instruction lengths. The only case where we now use svm->next_rip is EXIT_IOIO: the instruction length is provided to us by hardware. Hardcoded RIP advancement remains in vmrun_interception(), this is going to be taken care of separately. Reported-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
To avoid hardcoding xsetbv length to '3' we need to support decoding it in the emulator. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
When doing x86_emulate_instruction(EMULTYPE_SKIP) interrupt shadow has to be cleared if and only if the skipping is successful. There are two immediate issues: - In SVM skip_emulated_instruction() we are not zapping interrupt shadow in case kvm_emulate_instruction(EMULTYPE_SKIP) is used to advance RIP (!nrpip_save). - In VMX handle_ept_misconfig() when running as a nested hypervisor we (static_cpu_has(X86_FEATURE_HYPERVISOR) case) forget to clear interrupt shadow. Note that we intentionally don't handle the case when the skipped instruction is supposed to prolong the interrupt shadow ("MOV/POP SS") as skip-emulation of those instructions should not happen under normal circumstances. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory, fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP). Currently, we only do printk(KERN_DEBUG) when this happens and this is not ideal. Propagate the error up the stack. On VMX, skip_emulated_instruction() doesn't fail, we have two call sites calling it explicitly: handle_exception_nmi() and handle_task_switch(), we can just ignore the result. On SVM, we also have two explicit call sites: svm_queue_exception() and it seems we don't need to do anything there as we check if RIP was advanced or not. In task_switch_interception(), however, we are better off not proceeding to kvm_task_switch() in case skip_emulated_instruction() failed. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
svm->next_rip is only used by skip_emulated_instruction() and in case kvm_set_msr() fails we rightfully don't do that. Move svm->next_rip advancement to 'else' branch to avoid creating false impression that it's always advanced (and make it look like rdmsr_interception()). This is a preparatory change to removing hardcoded RIP advancement from instruction intercepts, no functional change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Jump to the common error handling in x86_decode_insn() if __do_insn_fetch_bytes() fails so that its error code is converted to the appropriate return type. Although the various helpers used by x86_decode_insn() return X86EMUL_* values, x86_decode_insn() itself returns EMULATION_FAILED or EMULATION_OK. This doesn't cause a functional issue as the sole caller, x86_emulate_instruction(), currently only cares about success vs. failure, and success is indicated by '0' for both types (X86EMUL_CONTINUE and EMULATION_OK). Fixes: 285ca9e9 ("KVM: emulate: speed up do_insn_fetch") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Similar to AMD bits, set the Intel bits from the vendor-independent feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care about the vendor and they should be set on AMD processors as well. Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Even though it is preferrable to use SPEC_CTRL (represented by X86_FEATURE_AMD_SSBD) instead of VIRT_SPEC, VIRT_SPEC is always supported anyway because otherwise it would be impossible to migrate from old to new CPUs. Make this apparent in the result of KVM_GET_SUPPORTED_CPUID as well. However, we need to hide the bit on Intel processors, so move the setting to svm_set_supported_cpuid. Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reported-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The AMD_* bits have to be set from the vendor-independent feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care about the vendor and they should be set on Intel processors as well. On top of this, SSBD, STIBP and AMD_SSB_NO bit were not set, and VIRT_SSBD does not have to be added manually because it is a cpufeature that comes directly from the host's CPUID bit. Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 21 Aug, 2019 3 commits
-
-
Vitaly Kuznetsov authored
test_msr_platform_info_disabled() generates EXIT_SHUTDOWN but VMCB state is undefined after that so an attempt to launch this guest again from test_msr_platform_info_enabled() fails. Reorder the tests to make test pass. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
This reverts commit 4e103134. Alex Williamson reported regressions with device assignment with this patch. Even though the bug is probably elsewhere and still latent, this is needed to fix the regression. Fixes: 4e103134 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot", 2019-02-05) Reported-by: Alex Willamson <alex.williamson@redhat.com> Cc: stable@vger.kernel.org Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
state_test and smm_test are failing on older processors that do not have xcr0. This is because on those processor KVM does provide support for KVM_GET/SET_XSAVE (to avoid having to rely on the older KVM_GET/SET_FPU) but not for KVM_GET/SET_XCRS. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 16 Aug, 2019 4 commits
-
-
Paul Mackerras authored
Testing has revealed the existence of a race condition where a XIVE interrupt being shut down can be in one of the XIVE interrupt queues (of which there are up to 8 per CPU, one for each priority) at the point where free_irq() is called. If this happens, can return an interrupt number which has been shut down. This can lead to various symptoms: - irq_to_desc(irq) can be NULL. In this case, no end-of-interrupt function gets called, resulting in the CPU's elevated interrupt priority (numerically lowered CPPR) never gets reset. That then means that the CPU stops processing interrupts, causing device timeouts and other errors in various device drivers. - The irq descriptor or related data structures can be in the process of being freed as the interrupt code is using them. This typically leads to crashes due to bad pointer dereferences. This race is basically what commit 62e04686 ("genirq: Add optional hardware synchronization for shutdown", 2019-06-28) is intended to fix, given a get_irqchip_state() method for the interrupt controller being used. It works by polling the interrupt controller when an interrupt is being freed until the controller says it is not pending. With XIVE, the PQ bits of the interrupt source indicate the state of the interrupt source, and in particular the P bit goes from 0 to 1 at the point where the hardware writes an entry into the interrupt queue that this interrupt is directed towards. Normally, the code will then process the interrupt and do an end-of-interrupt (EOI) operation which will reset PQ to 00 (assuming another interrupt hasn't been generated in the meantime). However, there are situations where the code resets P even though a queue entry exists (for example, by setting PQ to 01, which disables the interrupt source), and also situations where the code leaves P at 1 after removing the queue entry (for example, this is done for escalation interrupts so they cannot fire again until they are explicitly re-enabled). The code already has a 'saved_p' flag for the interrupt source which indicates that a queue entry exists, although it isn't maintained consistently. This patch adds a 'stale_p' flag to indicate that P has been left at 1 after processing a queue entry, and adds code to set and clear saved_p and stale_p as necessary to maintain a consistent indication of whether a queue entry may or may not exist. With this, we can implement xive_get_irqchip_state() by looking at stale_p, saved_p and the ESB PQ bits for the interrupt. There is some additional code to handle escalation interrupts properly; because they are enabled and disabled in KVM assembly code, which does not have access to the xive_irq_data struct for the escalation interrupt. Hence, stale_p may be incorrect when the escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu(). Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on, with some careful attention to barriers in order to ensure the correct result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu(). Finally, this adds code to make noise on the console (pr_crit and WARN_ON(1)) if we find an interrupt queue entry for an interrupt which does not have a descriptor. While this won't catch the race reliably, if it does get triggered it will be an indication that the race is occurring and needs to be debugged. Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry
-
Paul Mackerras authored
At present, when running a guest on POWER9 using HV KVM but not using an in-kernel interrupt controller (XICS or XIVE), for example if QEMU is run with the kernel_irqchip=off option, the guest entry code goes ahead and tries to load the guest context into the XIVE hardware, even though no context has been set up. To fix this, we check that the "CAM word" is non-zero before pushing it to the hardware. The CAM word is initialized to a non-zero value in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(), and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu. Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Reported-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry
-
Paul Mackerras authored
Escalation interrupts are interrupts sent to the host by the XIVE hardware when it has an interrupt to deliver to a guest VCPU but that VCPU is not running anywhere in the system. Hence we disable the escalation interrupt for the VCPU being run when we enter the guest and re-enable it when the guest does an H_CEDE hypercall indicating it is idle. It is possible that an escalation interrupt gets generated just as we are entering the guest. In that case the escalation interrupt may be using a queue entry in one of the interrupt queues, and that queue entry may not have been processed when the guest exits with an H_CEDE. The existing entry code detects this situation and does not clear the vcpu->arch.xive_esc_on flag as an indication that there is a pending queue entry (if the queue entry gets processed, xive_esc_irq() will clear the flag). There is a comment in the code saying that if the flag is still set on H_CEDE, we have to abort the cede rather than re-enabling the escalation interrupt, lest we end up with two occurrences of the escalation interrupt in the interrupt queue. However, the exit code doesn't do that; it aborts the cede in the sense that vcpu->arch.ceded gets cleared, but it still enables the escalation interrupt by setting the source's PQ bits to 00. Instead we need to set the PQ bits to 10, indicating that an interrupt has been triggered. We also need to avoid setting vcpu->arch.xive_esc_on in this case (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because xive_esc_irq() will run at some point and clear it, and if we race with that we may end up with an incorrect result (i.e. xive_esc_on set when the escalation interrupt has just been handled). It is extremely unlikely that having two queue entries would cause observable problems; theoretically it could cause queue overflow, but the CPU would have to have thousands of interrupts targetted to it for that to be possible. However, this fix will also make it possible to determine accurately whether there is an unhandled escalation interrupt in the queue, which will be needed by the following patch. Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry
-
Cédric Le Goater authored
When a vCPU is brought done, the XIVE VP (Virtual Processor) is first disabled and then the event notification queues are freed. When freeing the queues, we check for possible escalation interrupts and free them also. But when a XIVE VP is disabled, the underlying XIVE ENDs also are disabled in OPAL. When an END (Event Notification Descriptor) is disabled, its ESB pages (ESn and ESe) are disabled and loads return all 1s. Which means that any access on the ESB page of the escalation interrupt will return invalid values. When an interrupt is freed, the shutdown handler computes a 'saved_p' field from the value returned by a load in xive_do_source_set_mask(). This value is incorrect for escalation interrupts for the reason described above. This has no impact on Linux/KVM today because we don't make use of it but we will introduce in future changes a xive_get_irqchip_state() handler. This handler will use the 'saved_p' field to return the state of an interrupt and 'saved_p' being incorrect, softlockup will occur. Fix the vCPU cleanup sequence by first freeing the escalation interrupts if any, then disable the XIVE VP and last free the queues. Fixes: 90c73795 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode") Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org
-
- 15 Aug, 2019 3 commits
-
-
Paolo Bonzini authored
vmx_set_nested_state_test is trying to use the KVM_STATE_NESTED_EVMCS without enabling enlightened VMCS first. Correct the outcome of the test, and actually test that it succeeds after the capability is enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
There are two tests already enabling eVMCS and a third is coming. Add a function that enables the capability and tests the result. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
This test is only covering various edge cases of the KVM_SET_NESTED_STATE ioctl. Running the VM does not really add anything. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 14 Aug, 2019 3 commits
-
-
Miaohe Lin authored
new_entry is reassigned a new value next line. So it's redundant and remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
This is probably overdue---KVM x86 has quite a few contributors that usually review each other's patches, which is really helpful to me. Formalize this by listing them as reviewers. I am including people with various expertise: - Joerg for SVM (with designated reviewers, it makes more sense to have him in the main KVM/x86 stanza) - Sean for MMU and VMX - Jim for VMX - Vitaly for Hyper-V and possibly SVM - Wanpeng for LAPIC and paravirtualization. Please ack if you are okay with this arrangement, otherwise speak up. In other news, Radim is going to leave Red Hat soon. However, he has not been very much involved in upstream KVM development for some time, and in the immediate future he is still going to help maintain kvm/queue while I am on vacation. Since not much is going to change, I will let him decide whether he wants to keep the maintainer role after he leaves. Acked-by: Joerg Roedel <joro@8bytes.org> Acked-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Wanpeng Li <wanpengli@tencent.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
KVM/s390 does not have a list of its own, and linux-s390 is in the loop anyway thanks to the generic arch/s390 match. So use the generic KVM list for s390 patches. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-