- 13 Aug, 2021 11 commits
-
-
Sean Christopherson authored
Now that nested VMX pulls KVM's desired VMCS controls from vmcs01 instead of re-calculating on the fly, bury the helpers that do the calcluations in vmx.c. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove the secondary execution controls cache now that it's effectively dead code; it is only read immediately after it is written. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When preparing controls for vmcs02, grab KVM's desired controls from vmcs01's shadow state instead of recalculating the controls from scratch, or in the secondary execution controls, instead of using the dedicated cache. Calculating secondary exec controls is eye-poppingly expensive due to the guest CPUID checks, hence the dedicated cache, but the other calculations aren't exactly free either. Explicitly clear several bits (x2APIC, DESC exiting, and load EFER on exit) as appropriate as they may be set in vmcs01, whereas the previous implementation relied on dynamic bits being cleared in the calculator. Intentionally propagate VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL from vmcs01 to vmcs02. Whether or not PERF_GLOBAL_CTRL is loaded depends on whether or not perf itself is active, so unless perf stops between the exit from L1 and entry to L2, vmcs01 will hold the desired value. This is purely an optimization as atomic_switch_perf_msrs() will set/clear the control as needed at VM-Enter, i.e. it avoids two extra VMWRITEs in the case where perf is active (versus starting with the bits clear in vmcs02, which was the previous behavior). Cc: Zeng Guang <guang.zeng@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
These stores are copied and pasted from the "if" statements above. They are dead and while they are not really a bug, they can be confusing to anyone reading the code as well. Remove them. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The commit efdab992 ("KVM: x86: fix escape of guest dr6 to the host") fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out. But writing to debug registers is slow, and it can be visible in perf results sometimes, even if neither the host nor the guest activate breakpoints. Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case where DR6 gets the guest value, and it never happens at all on SVM, the register can be cleared in vmx.c right after reading it. Reported-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Commit c77fb5fe ("KVM: x86: Allow the guest to run with dirty debug registers") allows the guest accessing to DRs without exiting when KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized on entry to the guest---including DR6 that was not synced before the commit. But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT, but also when KVM_DEBUGREG_BP_ENABLED. The second case is unnecessary and just leads to a more case which leaks stale DR6 to the host which has to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put(). Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters on VMX because SVM always uses the DR6 value from the VMCB. So move this line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT. Reported-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
Commit ae561ede ("KVM: x86: DR0-DR3 are not clear on reset") added code to ensure eff_db are updated when they're modified through non-standard paths. But there is no reason to also update hardware DRs unless hardware breakpoints are active or DR exiting is disabled, and in those cases updating hardware is handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED. KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better to be removed. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.
-
Sean Christopherson authored
Add yet another spinlock for the TDP MMU and take it when marking indirect shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with nested TDP, KVM may encounter shadow pages for the TDP entries managed by L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and misbehaves when a shadow page is marked unsync via a TDP MMU page fault, which runs with mmu_lock held for read, not write. Lack of a critical section manifests most visibly as an underflow of unsync_children in clear_unsync_child_bit() due to unsync_children being corrupted when multiple CPUs write it without a critical section and without atomic operations. But underflow is the best case scenario. The worst case scenario is that unsync_children prematurely hits '0' and leads to guest memory corruption due to KVM neglecting to properly sync shadow pages. Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock would functionally be ok. Usurping the lock could degrade performance when building upper level page tables on different vCPUs, especially since the unsync flow could hold the lock for a comparatively long time depending on the number of indirect shadow pages and the depth of the paging tree. For simplicity, take the lock for all MMUs, even though KVM could fairly easily know that mmu_lock is held for write. If mmu_lock is held for write, there cannot be contention for the inner spinlock, and marking shadow pages unsync across multiple vCPUs will be slow enough that bouncing the kvm_arch cacheline should be in the noise. Note, even though L2 could theoretically be given access to its own EPT entries, a nested MMU must hold mmu_lock for write and thus cannot race against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to be taken by the TDP MMU, as opposed to being taken by any MMU for a VM that is running with the TDP MMU enabled. Holding mmu_lock for read also prevents the indirect shadow page from being freed. But as above, keep it simple and always take the lock. Alternative #1, the TDP MMU could simply pass "false" for can_unsync and effectively disable unsync behavior for nested TDP. Write protecting leaf shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such VMMs typically don't modify TDP entries, but the same may not hold true for non-standard use cases and/or VMMs that are migrating physical pages (from L1's perspective). Alternative #2, the unsync logic could be made thread safe. In theory, simply converting all relevant kvm_mmu_page fields to atomics and using atomic bitops for the bitmap would suffice. However, (a) an in-depth audit would be required, (b) the code churn would be substantial, and (c) legacy shadow paging would incur additional atomic operations in performance sensitive paths for no benefit (to legacy shadow paging). Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210812181815.3378104-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Set the min_level for the TDP iterator at the root level when zapping all SPTEs to optimize the iterator's try_step_down(). Zapping a non-leaf SPTE will recursively zap all its children, thus there is no need for the iterator to attempt to step down. This avoids rereading the top-level SPTEs after they are zapped by causing try_step_down() to short-circuit. In most cases, optimizing try_step_down() will be in the noise as the cost of zapping SPTEs completely dominates the overall time. The optimization is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM is zapping in response to multiple memslot updates when userspace is adding and removing read-only memslots for option ROMs. In that case, the task doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock for read and thus can be a noisy neighbor of sorts. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210812181414.3376143-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and really zap all SPTEs in this case. As is, zap_gfn_range() skips non-leaf SPTEs whose range exceeds the range to be zapped. If shadow_phys_bits is not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level paging, the "zap all" flows will skip top-level SPTEs whose range extends beyond shadow_phys_bits and leak their SPs when the VM is destroyed. Use the current upper bound (based on host.MAXPHYADDR) to detect that the caller wants to zap all SPTEs, e.g. instead of using the max theoretical gfn, 1 << (52 - 12). The more precise upper bound allows the TDP iterator to terminate its walk earlier when running on hosts with MAXPHYADDR < 52. Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to help future debuggers should KVM decide to leak SPTEs again. The bug is most easily reproduced by running (and unloading!) KVM in a VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped. ============================================================================= BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1) CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack_lvl+0x45/0x59 slab_err+0x95/0xc9 __kmem_cache_shutdown.cold+0x3c/0x158 kmem_cache_destroy+0x3d/0xf0 kvm_mmu_module_exit+0xa/0x30 [kvm] kvm_arch_exit+0x5d/0x90 [kvm] kvm_exit+0x78/0x90 [kvm] vmx_exit+0x1a/0x50 [kvm_intel] __x64_sys_delete_module+0x13f/0x220 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210812181414.3376143-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 10 Aug, 2021 2 commits
-
-
Paolo Bonzini authored
Merge common topic branch for 5.14-rc6 and 5.15 merge window.
-
Sean Christopherson authored
Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to effectively get the controls for the current VMCS, as opposed to using vmx->secondary_exec_controls, which is the cached value of KVM's desired controls for vmcs01 and truly not reflective of any particular VMCS. While the waitpkg control is not dynamic, i.e. vmcs01 will always hold the same waitpkg configuration as vmx->secondary_exec_controls, the same does not hold true for vmcs02 if the L1 VMM hides the feature from L2. If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL, L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP. Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210810171952.2758100-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 06 Aug, 2021 8 commits
-
-
David Matlack authored
perf_test_util is used to set up KVM selftests where vCPUs touch a region of memory. The guest code is implemented in perf_test_util.c (not the calling selftests). The guest code requires a 1 parameter, the vcpuid, which has to be set by calling vcpu_args_set(vm, vcpu_id, 1, vcpu_id). Today all of the selftests that use perf_test_util are making this call. Instead, perf_test_util should just do it. This will save some code but more importantly prevents mistakes since totally non-obvious that this needs to be called and failing to do so results in vCPUs not accessing the right regions of memory. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210805172821.2622793-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Introduce a new option to dirty_log_perf_test: -x number_of_slots. This causes the test to attempt to split the region of memory into the given number of slots. If the region cannot be evenly divided, the test will fail. This allows testing with more than one slot and therefore measure how performance scales with the number of memslots. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
gfn_to_rmap was removed in the previous patch so there is no need to retain the double underscore on __gfn_to_rmap. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
rmap_add() and rmap_recycle() both run in the context of the vCPU and thus we can use kvm_vcpu_gfn_to_memslot() to look up the memslot. This enables rmap_add() and rmap_recycle() to take advantage of vcpu->last_used_slot and avoid expensive memslot searching. This change improves the performance of "Populate memory time" in dirty_log_perf_test with tdp_mmu=N. In addition to improving the performance, "Populate memory time" no longer scales with the number of memslots in the VM. Command | Before | After ------------------------------- | ---------------- | ------------- ./dirty_log_perf_test -v64 -x1 | 15.18001570s | 14.99469366s ./dirty_log_perf_test -v64 -x64 | 18.71336392s | 14.98675076s Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
The existing TDP MMU methods to handle dirty logging are vcpu-agnostic since they can be driven by MMU notifiers and other non-vcpu-specific events in addition to page faults. However this means that the TDP MMU is not benefiting from the new vcpu->last_used_slot. Fix that by introducing a tdp_mmu_map_set_spte_atomic() which is only called during a TDP page fault and has access to the kvm_vcpu for fast slot lookups. This improves "Populate memory time" in dirty_log_perf_test by 5%: Command | Before | After ------------------------------- | ---------------- | ------------- ./dirty_log_perf_test -v64 -x64 | 5.472321072s | 5.169832886s Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
The memslot for a given gfn is looked up multiple times during page fault handling. Avoid binary searching for it multiple times by caching the most recently used slot. There is an existing VM-wide last_used_slot but that does not work well for cases where vCPUs are accessing memory in different slots (see performance data below). Another benefit of caching the most recently use slot (versus looking up the slot once and passing around a pointer) is speeding up memslot lookups *across* faults and during spte prefetching. To measure the performance of this change I ran dirty_log_perf_test with 64 vCPUs and 64 memslots and measured "Populate memory time" and "Iteration 2 dirty memory time". Tests were ran with eptad=N to force dirty logging to use fast_page_fault so its performance could be measured. Config | Metric | Before | After ---------- | ----------------------------- | ------ | ------ tdp_mmu=Y | Populate memory time | 6.76s | 5.47s tdp_mmu=Y | Iteration 2 dirty memory time | 2.83s | 0.31s tdp_mmu=N | Populate memory time | 20.4s | 18.7s tdp_mmu=N | Iteration 2 dirty memory time | 2.65s | 0.30s The "Iteration 2 dirty memory time" results are especially compelling because they are equivalent to running the same test with a single memslot. In other words, fast_page_fault performance no longer scales with the number of memslots. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Make search_memslots unconditionally search all memslots and move the last_used_slot logic up one level to __gfn_to_memslot. This is in preparation for introducing a per-vCPU last_used_slot. As part of this change convert existing callers of search_memslots to __gfn_to_memslot to avoid making any functional changes. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
lru_slot is used to keep track of the index of the most-recently used memslot. The correct acronym would be "mru" but that is not a common acronym. So call it last_used_slot which is a bit more obvious. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 05 Aug, 2021 2 commits
-
-
Sean Christopherson authored
Take a signed 'long' instead of an 'unsigned long' for the number of pages to add/subtract to the total number of pages used by the MMU. This fixes a zero-extension bug on 32-bit kernels that effectively corrupts the per-cpu counter used by the shrinker. Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus an unsigned 32-bit value on 32-bit kernels. As a result, the value used to adjust the per-cpu counter is zero-extended (unsigned -> signed), not sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to 4294967295 and effectively corrupts the counter. This was found by a staggering amount of sheer dumb luck when running kvm-unit-tests on a 32-bit KVM build. The shrinker just happened to kick in while running tests and do_shrink_slab() logged an error about trying to free a negative number of objects. The truly lucky part is that the kernel just happened to be a slightly stale build, as the shrinker no longer yells about negative objects as of commit 18bb473e ("mm: vmscan: shrink deferred objects proportional to priority"). vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460 Fixes: bc8a3d89 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210804214609.1096003-1-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
gfn_to_hva_cache is not thread-safe, so it is usually used only within a vCPU (whose code is protected by vcpu->mutex). The Xen interface implementation has such a cache in kvm->arch, but it is not really used except to store the location of the shared info page. Replace shinfo_set and shinfo_cache with just the value that is passed via KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the initialization value is not zero anymore and therefore kvm_xen_init_vm needs to be introduced. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 04 Aug, 2021 8 commits
-
-
Maxim Levitsky authored
The test was mistakenly using addr_gpa2hva on a gva and that happened to work accidentally. Commit 106a2e76 ("KVM: selftests: Lower the min virtual address for misc page allocations") revealed this bug. Fixes: 2c7f76b4 ("selftests: kvm: Add basic Hyper-V clocksources tests", 2021-03-18) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210804112057.409498-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Mingwei Zhang authored
KVM SEV code uses bitmaps to manage ASID states. ASID 0 was always skipped because it is never used by VM. Thus, in existing code, ASID value and its bitmap postion always has an 'offset-by-1' relationship. Both SEV and SEV-ES shares the ASID space, thus KVM uses a dynamic range [min_asid, max_asid] to handle SEV and SEV-ES ASIDs separately. Existing code mixes the usage of ASID value and its bitmap position by using the same variable called 'min_asid'. Fix the min_asid usage: ensure that its usage is consistent with its name; allocate extra size for ASID 0 to ensure that each ASID has the same value with its bitmap position. Add comments on ASID bitmap allocation to clarify the size change. Signed-off-by: Mingwei Zhang <mizhang@google.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Marc Orr <marcorr@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Alper Gun <alpergun@google.com> Cc: Dionna Glaze <dionnaglaze@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vipin Sharma <vipinsh@google.com> Cc: Peter Gonda <pgonda@google.com> Cc: Joerg Roedel <joro@8bytes.org> Message-Id: <20210802180903.159381-1-mizhang@google.com> [Fix up sev_asid_free to also index by ASID, as suggested by Sean Christopherson, and use nr_asids in sev_cpu_init. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use the raw ASID, not ASID-1, when nullifying the last used VMCB when freeing an SEV ASID. The consumer, pre_sev_run(), indexes the array by the raw ASID, thus KVM could get a false negative when checking for a different VMCB if KVM manages to reallocate the same ASID+VMCB combo for a new VM. Note, this cannot cause a functional issue _in the current code_, as pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is guaranteed to mismatch on the first VMRUN. However, prior to commit 8a14fe4f ("kvm: x86: Move last_cpu into kvm_vcpu_arch as last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the last_cpu variable. Thus it's theoretically possible that older versions of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID and VMCB exactly match those of a prior VM. Fixes: 70cd94e6 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled") Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
KVM creates a debugfs directory for each VM in order to store statistics about the virtual machine. The directory name is built from the process pid and a VM fd. While generally unique, it is possible to keep a file descriptor alive in a way that causes duplicate directories, which manifests as these messages: [ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present! Even though this should not happen in practice, it is more or less expected in the case of KVM for testcases that call KVM_CREATE_VM and close the resulting file descriptor repeatedly and in parallel. When this happens, debugfs_create_dir() returns an error but kvm_create_vm_debugfs() goes on to allocate stat data structs which are later leaked. The slow memory leak was spotted by syzkaller, where it caused OOM reports. Since the issue only affects debugfs, do a lookup before calling debugfs_create_dir, so that the message is downgraded and rate-limited. While at it, ensure kvm->debugfs_dentry is NULL rather than an error if it is not created. This fixes kvm_destroy_vm_debugfs, which was not checking IS_ERR_OR_NULL correctly. Cc: stable@vger.kernel.org Fixes: 536a6f88 ("KVM: Create debugfs dir and stat files for each VM") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Like Xu authored
Based on our observations, after any vm-exit associated with vPMU, there are at least two or more perf interfaces to be called for guest counter emulation, such as perf_event_{pause, read_value, period}(), and each one will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes more severe when guest use counters in a multiplexed manner. Holding a lock once and completing the KVM request operations in the perf context would introduce a set of impractical new interfaces. So we can further optimize the vPMU implementation by avoiding repeated calls to these interfaces in the KVM context for at least one pattern: After we call perf_event_pause() once, the event will be disabled and its internal count will be reset to 0. So there is no need to pause it again or read its value. Once the event is paused, event period will not be updated until the next time it's resumed or reprogrammed. And there is also no need to call perf_event_period twice for a non-running counter, considering the perf_event for a running counter is never paused. Based on this implementation, for the following common usage of sampling 4 events using perf on a 4u8g guest: echo 0 > /proc/sys/kernel/watchdog echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent for i in `seq 1 1 10` do taskset -c 0 perf record \ -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \ /root/br_instr a done the average latency of the guest NMI handler is reduced from 37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server. Also, in addition to collecting more samples, no loss of sampling accuracy was observed compared to before the optimization. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20210728120705.6855-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org>
-
Peter Xu authored
Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be slow. The easy way is to travers the rmap list, collecting the a/d bits and free the slots along the way. Provide a pte_list_destroy() and do exactly that. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220605.26377-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Xu authored
Add a counter field into pte_list_desc, so as to simplify the add/remove/loop logic. E.g., we don't need to loop over the array any more for most reasons. This will make more sense after we've switched the array size to be larger otherwise the counter will be a waste. Initially I wanted to store a tail pointer at the head of the array list so we don't need to traverse the list at least for pushing new ones (if without the counter we traverse both the list and the array). However that'll need slightly more change without a huge lot benefit, e.g., after we grow entry numbers per array the list traversing is not so expensive. So let's be simple but still try to get as much benefit as we can with just these extra few lines of changes (not to mention the code looks easier too without looping over arrays). I used the same a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch further speeds up the total fork time of about 4%, which is a total of 33% of vanilla kernel: Vanilla: 473.90 (+-5.93%) 3->15 slots: 366.10 (+-4.94%) Add counter: 351.00 (+-3.70%) [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220602.26327-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Xu authored
Currently rmap array element only contains 3 entries. However for EPT=N there could have a lot of guest pages that got tens of even hundreds of rmap entry. A normal distribution of a 6G guest (even if idle) shows this with rmap count statistics: Rmap_Count: 0 1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 Level=4K: 3089171 49005 14016 1363 235 212 15 7 0 0 0 Level=2M: 5951 227 0 0 0 0 0 0 0 0 0 Level=1G: 32 0 0 0 0 0 0 0 0 0 0 If we do some more fork some pages will grow even larger rmap counts. This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT will be slightly more efficient; but still not too large so less waste when array not full. It should not affecting EPT=Y since EPT normally only has zero or one rmap entry for each page, so no array is even allocated. With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]), this patch speeds up fork time of about 29%. Before: 473.90 (+-5.93%) After: 366.10 (+-4.94%) [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-6-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 03 Aug, 2021 7 commits
-
-
Vitaly Kuznetsov authored
Check that #UD is raised if bit 16 is clear in HYPERV_CPUID_FEATURES.EDX and an 'XMM fast' hypercall is issued. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Siddharth Chandrasekaran <sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210730122625.112848-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
TLFS states that "Availability of the XMM fast hypercall interface is indicated via the “Hypervisor Feature Identification” CPUID Leaf (0x40000003, see section 2.4.4) ... Any attempt to use this interface when the hypervisor does not indicate availability will result in a #UD fault." Implement the check for 'strict' mode (KVM_CAP_HYPERV_ENFORCE_CPUID). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Siddharth Chandrasekaran <sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210730122625.112848-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Hypercall failures are unusual with potentially far going consequences so it would be useful to see their results when tracing. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Siddharth Chandrasekaran <sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210730122625.112848-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
In case guest doesn't have access to the particular hypercall we can avoid reading XMM registers. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Siddharth Chandrasekaran <sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210730122625.112848-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Hamza Mahfooz authored
As alluded to in commit f36f3f28 ("KVM: add "new" argument to kvm_arch_commit_memory_region"), a bunch of other places where struct kvm_memory_slot is used, needs to be refactored to preserve the "const"ness of struct kvm_memory_slot across-the-board. Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Message-Id: <20210713023338.57108-1-someguy@effective-light.com> [Do not touch body of slot_rmap_walk_init. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. This is possible now that memslot updates are blocked from range_start() to range_end(); that ensures that lock elision happens in both or none, and therefore that mmu_notifier_count updates (which must occur while holding mmu_lock for write) are always paired across start->end. Based on patches originally written by Ben Gardon. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. Because mmu_notifier_count must be modified while holding mmu_lock for write, and must always be paired across start->end to stay balanced, lock elision must happen in both or none. Therefore, in preparation for this change, this patch prevents memslot updates across range_start() and range_end(). Note, technically flag-only memslot updates could be allowed in parallel, but stalling a memslot update for a relatively short amount of time is not a scalability issue, and this is all more than complex enough. A long note on the locking: a previous version of the patch used an rwsem to block the memslot update while the MMU notifier run, but this resulted in the following deadlock involving the pseudo-lock tagged as "mmu_notifier_invalidate_range_start". ====================================================== WARNING: possible circular locking dependency detected 5.12.0-rc3+ #6 Tainted: G OE ------------------------------------------------------ qemu-system-x86/3069 is trying to acquire lock: ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190 but task is already holding lock: ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm] which lock already depends on the new lock. This corresponds to the following MMU notifier logic: invalidate_range_start take pseudo lock down_read() (*) release pseudo lock invalidate_range_end take pseudo lock (**) up_read() release pseudo lock At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock; at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock. This could cause a deadlock (ignoring for a second that the pseudo lock is not a lock): - invalidate_range_start waits on down_read(), because the rwsem is held by install_new_memslots - install_new_memslots waits on down_write(), because the rwsem is held till (another) invalidate_range_end finishes - invalidate_range_end sits waits on the pseudo lock, held by invalidate_range_start. Removing the fairness of the rwsem breaks the cycle (in lockdep terms, it would change the *shared* rwsem readers into *shared recursive* readers), so open-code the wait using a readers count and a spinlock. This also allows handling blockable and non-blockable critical section in the same way. Losing the rwsem fairness does theoretically allow MMU notifiers to block install_new_memslots forever. Note that mm/mmu_notifier.c's own retry scheme in mmu_interval_read_begin also uses wait/wake_up and is likewise not fair. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 02 Aug, 2021 2 commits
-
-
Paolo Bonzini authored
For an event to be in injected state when nested_svm_vmrun executes, it must have come from exitintinfo when svm_complete_interrupts ran: vcpu_enter_guest static_call(kvm_x86_run) -> svm_vcpu_run svm_complete_interrupts // now the event went from "exitintinfo" to "injected" static_call(kvm_x86_handle_exit) -> handle_exit svm_invoke_exit_handler vmrun_interception nested_svm_vmrun However, no event could have been in exitintinfo before a VMRUN vmexit. The code in svm.c is a bit more permissive than the one in vmx.c: if (is_external_interrupt(svm->vmcb->control.exit_int_info) && exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR && exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH && exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI) but in any case, a VMRUN instruction would not even start to execute during an attempted event delivery. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as defined by both Intel's SDM and AMD's APM. Note, current versions of Intel's SDM are very poorly written with respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT" quite clearly lists power-up, RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM then attempts to qualify CD/NW behavior in a footnote: 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits are cleared. Presumably that footnote is only meant for INIT, as the RESET case and especially the power-up case are rather non-sensical. Another footnote all but confirms that: 6. Internal caches are invalid after power-up and RESET, but left unchanged with an INIT. Bare metal testing shows that CD/NW are indeed preserved on INIT (someone else can hack their BIOS to check RESET and power-up :-D). Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-