1. 30 Sep, 2021 29 commits
    • Lai Jiangshan's avatar
      KVM: X86: Don't flush current tlb on shadow page modification · bd047e54
      Lai Jiangshan authored
      After any shadow page modification, flushing tlb only on current VCPU
      is weird due to other VCPU's tlb might still be stale.
      
      In other words, if there is any mandatory tlb-flushing after shadow page
      modification, SET_SPTE_NEED_REMOTE_TLB_FLUSH or remote_flush should be
      set and the tlbs of all VCPUs should be flushed.  There is not point to
      only flush current tlb except when the request is from vCPU's or pCPU's
      activities.
      
      If there was any bug that mandatory tlb-flushing is required and
      SET_SPTE_NEED_REMOTE_TLB_FLUSH/remote_flush is failed to set, this patch
      would expose the bug in a more destructive way.  The related code paths
      are checked and no missing SET_SPTE_NEED_REMOTE_TLB_FLUSH is found yet.
      
      Currently, there is no optional tlb-flushing after sync page related code
      is changed to flush tlb timely.  So we can just remove these local flushing
      code.
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210918005636.3675-5-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd047e54
    • Sean Christopherson's avatar
      KVM: x86/mmu: Complete prefetch for trailing SPTEs for direct, legacy MMU · c6cecc4b
      Sean Christopherson authored
      Make a final call to direct_pte_prefetch_many() if there are "trailing"
      SPTEs to prefetch, i.e. SPTEs for GFNs following the faulting GFN.  The
      call to direct_pte_prefetch_many() in the loop only handles the case
      where there are !PRESENT SPTEs preceding a PRESENT SPTE.
      
      E.g. if the faulting GFN is a multiple of 8 (the prefetch size) and all
      SPTEs for the following GFNs are !PRESENT, the loop will terminate with
      "start = sptep+1" and not prefetch any SPTEs.
      
      Prefetching trailing SPTEs as intended can drastically reduce the number
      of guest page faults, e.g. accessing the first byte of every 4kb page in
      a 6gb chunk of virtual memory, in a VM with 8gb of preallocated memory,
      the number of pf_fixed events observed in L0 drops from ~1.75M to <0.27M.
      
      Note, this only affects memory that is backed by 4kb pages as KVM doesn't
      prefetch when installing hugepages.  Shadow paging prefetching is not
      affected as it does not batch the prefetches due to the need to process
      the corresponding guest PTE.  The TDP MMU is not affected because it
      doesn't have prefetching, yet...
      
      Fixes: 957ed9ef ("KVM: MMU: prefetch ptes when intercepted guest #PF")
      Cc: Sergey Senozhatsky <senozhatsky@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210818235615.2047588-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6cecc4b
    • Thomas Huth's avatar
      KVM: selftests: Fix kvm_vm_free() in cr4_cpuid_sync and vmx_tsc_adjust tests · 22d7108c
      Thomas Huth authored
      The kvm_vm_free() statement here is currently dead code, since the loop
      in front of it can only be left with the "goto done" that jumps right
      after the kvm_vm_free(). Fix it by swapping the locations of the "done"
      label and the kvm_vm_free().
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Message-Id: <20210826074928.240942-1-thuth@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      22d7108c
    • Colin Ian King's avatar
      kvm: selftests: Fix spelling mistake "missmatch" -> "mismatch" · d22869af
      Colin Ian King authored
      There is a spelling mistake in an error message. Fix it.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Message-Id: <20210826120752.12633-1-colin.king@canonical.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d22869af
    • Sean Christopherson's avatar
      KVM: x86: Manually retrieve CPUID.0x1 when getting FMS for RESET/INIT · 25b97845
      Sean Christopherson authored
      Manually look for a CPUID.0x1 entry instead of bouncing through
      kvm_cpuid() when retrieving the Family-Model-Stepping information for
      vCPU RESET/INIT.  This fixes a potential undefined behavior bug due to
      kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_,
      a.k.a. the index.
      
      A more minimal fix would be to simply zero "dummy", but the extra work in
      kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as
      an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR.
      Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as
      holding the CPU's FMS information, not as holding CPUID.0x1.EAX.  KVM's
      usage of CPUID entries to get FMS is simply a pragmatic approach to avoid
      having yet another way for userspace to provide inconsistent data.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20210929222426.1855730-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      25b97845
    • Sean Christopherson's avatar
      KVM: x86: WARN on non-zero CRs at RESET to detect improper initalization · 62dd57dd
      Sean Christopherson authored
      WARN if CR0, CR3, or CR4 are non-zero at RESET, which given the current
      KVM implementation, really means WARN if they're not zeroed at vCPU
      creation.  VMX in particular has several ->set_*() flows that read other
      registers to handle side effects, and because those flows are common to
      RESET and INIT, KVM subtly relies on emulated/virtualized registers to be
      zeroed at vCPU creation in order to do the right thing at RESET.
      
      Use CRs as a sentinel because they are most likely to be written as side
      effects, and because KVM specifically needs CR0.PG and CR0.PE to be '0'
      to correctly reflect the state of the vCPU's MMU.  CRs are also loaded
      and stored from/to the VMCS, and so adds some level of coverage to verify
      that KVM doesn't conflate zero-allocating the VMCS with properly
      initializing the VMCS with VMWRITEs.
      
      Note, '0' is somewhat arbitrary, vCPU creation can technically stuff any
      value for a register so long as it's coherent with respect to the current
      vCPU state.  In practice, '0' works for all registers and is convenient.
      Suggested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      62dd57dd
    • Sean Christopherson's avatar
      KVM: SVM: Move RESET emulation to svm_vcpu_reset() · 9ebe530b
      Sean Christopherson authored
      Move RESET emulation for SVM vCPUs to svm_vcpu_reset(), and drop an extra
      init_vmcb() from svm_create_vcpu() in the process.  Hopefully KVM will
      someday expose a dedicated RESET ioctl(), and in the meantime separating
      "create" from "RESET" is a nice cleanup.
      
      Keep the call to svm_switch_vmcb() so that misuse of svm->vmcb at worst
      breaks the guest, e.g. premature accesses doesn't cause a NULL pointer
      dereference.
      
      Cc: Reiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9ebe530b
    • Sean Christopherson's avatar
      KVM: VMX: Move RESET emulation to vmx_vcpu_reset() · 06692e4b
      Sean Christopherson authored
      Move vCPU RESET emulation, including initializating of select VMCS state,
      to vmx_vcpu_reset().  Drop the open coded "vCPU load" sequence, as
      ->vcpu_reset() is invoked while the vCPU is properly loaded (which is
      kind of the point of ->vcpu_reset()...).  Hopefully KVM will someday
      expose a dedicated RESET ioctl(), and in the meantime separating "create"
      from "RESET" is a nice cleanup.
      
      Deferring VMCS initialization is effectively a nop as it's impossible to
      safely access the VMCS between the current call site and its new home, as
      both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
      the VMCS isn't guaranteed to be loaded.
      
      Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
      the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
      reloaded.  I.e. the preemption path also can't consume VMCS state.
      
      Cc: Reiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      06692e4b
    • Sean Christopherson's avatar
      KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation · d0656735
      Sean Christopherson authored
      Don't zero out user return and nested MSRs during vCPU creation, and
      instead rely on vcpu_vmx being zero-allocated.  Explicitly zeroing MSRs
      is not wrong, and is in fact necessary if KVM ever emulates vCPU RESET
      outside of vCPU creation, but zeroing only a subset of MSRs is confusing.
      
      Poking directly into KVM's backing is also undesirable in that it doesn't
      scale and is error prone.  Ideally KVM would have a common RESET path for
      all MSRs, e.g. by expanding kvm_set_msr(), which would obviate the need
      for this out-of-bad code (to support standalone RESET).
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0656735
    • Sean Christopherson's avatar
      KVM: x86: Fold fx_init() into kvm_arch_vcpu_create() · 583d369b
      Sean Christopherson authored
      Move the few bits of relevant fx_init() code into kvm_arch_vcpu_create(),
      dropping the superfluous check on vcpu->arch.guest_fpu that was blindly
      and wrongly added by commit ed02b213 ("KVM: SVM: Guest FPU state
      save/restore not needed for SEV-ES guest").
      
      Note, KVM currently allocates and then frees FPU state for SEV-ES guests,
      rather than avoid the allocation in the first place.  While that approach
      is inarguably inefficient and unnecessary, it's a cleanup for the future.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      583d369b
    • Sean Christopherson's avatar
      KVM: x86: Remove defunct setting of XCR0 for guest during vCPU create · e8f65b9b
      Sean Christopherson authored
      Drop code to initialize XCR0 during fx_init(), a.k.a. vCPU creation, as
      XCR0 has been initialized during kvm_vcpu_reset() (for RESET) since
      commit a554d207 ("KVM: X86: Processor States following Reset or INIT").
      
      Back when XCR0 support was added by commit 2acf923e ("KVM: VMX:
      Enable XSAVE/XRSTOR for guest"), KVM didn't differentiate between RESET
      and INIT.  Ignoring the fact that calling fx_init() for INIT is obviously
      wrong, e.g. FPU state after INIT is not the same as after RESET, setting
      XCR0 in fx_init() was correct.
      
      Eventually fx_init() got moved to kvm_arch_vcpu_init(), a.k.a. vCPU
      creation (ignore the terrible name) by commit 0ee6a517 ("x86/fpu,
      kvm: Simplify fx_init()").  Finally, commit 95a0d01e ("KVM: x86: Move
      all vcpu init code into kvm_arch_vcpu_create()") killed off
      kvm_arch_vcpu_init(), leaving behind the oddity of redundant setting of
      guest state during vCPU creation.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8f65b9b
    • Sean Christopherson's avatar
      KVM: x86: Remove defunct setting of CR0.ET for guests during vCPU create · 5ebbc470
      Sean Christopherson authored
      Drop code to set CR0.ET for the guest during initialization of the guest
      FPU.  The code was added as a misguided bug fix by commit 380102c8
      ("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue
      where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM
      systems.  While init_vmcb() did set CR0.ET, it only did so in the VMCB,
      and subtly did not update vcpu->cr0.  Stuffing CR0.ET worked around the
      immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB
      being out of sync.  That underlying bug was eventually remedied by commit
      18fa000a ("KVM: SVM: Reset cr0 properly on vcpu reset").
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ebbc470
    • Sean Christopherson's avatar
      KVM: x86: Do not mark all registers as avail/dirty during RESET/INIT · ff8828c8
      Sean Christopherson authored
      Do not blindly mark all registers as available+dirty at RESET/INIT, and
      instead rely on writes to registers to go through the proper mutators or
      to explicitly mark registers as dirty.  INIT in particular does not blindly
      overwrite all registers, e.g. select bits in CR0 are preserved across INIT,
      thus marking registers available+dirty without first reading the register
      from hardware is incorrect.
      
      In practice this is a benign bug as KVM doesn't let the guest control CR0
      bits that are preserved across INIT, and all other true registers are
      explicitly written during the RESET/INIT flows.  The PDPTRs and EX_INFO
      "registers" are not explicitly written, but accessing those values during
      RESET/INIT is nonsensical and would be a KVM bug regardless of register
      caching.
      
      Fixes: 66f7b72e ("KVM: x86: Make register state after reset conform to specification")
      [sean: !!! NOT FOR STABLE !!!]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210921000303.400537-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ff8828c8
    • Sean Christopherson's avatar
      KVM: x86: Simplify retrieving the page offset when loading PDTPRs · 94c641ba
      Sean Christopherson authored
      Replace impressively complex "logic" for computing the page offset from
      CR3 when loading PDPTRs.  Unlike other paging modes, the address held in
      CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits
      11:5 need to be used as the offset from the gfn when reading PDPTRs.
      
      The existing calculation originated in commit 1342d353 ("[PATCH] KVM:
      MMU: Load the pae pdptrs on cr3 change like the processor does"), which
      read the PDPTRs from guest memory as individual 8-byte loads.  At the
      time, the so called "offset" was the base index of PDPTR0 as a _u64_, not
      a byte offset.  Naming aside, the computation was useful and arguably
      simplified the overall flow.
      
      Unfortunately, when commit 195aefde ("KVM: Add general accessors to
      read and write guest memory") added accessors with offsets at byte
      granularity, the cleverness of the original code was lost and KVM was
      left with convoluted code for a simple operation.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210831164224.1119728-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      94c641ba
    • Sean Christopherson's avatar
      KVM: x86: Subsume nested GPA read helper into load_pdptrs() · 15cabbc2
      Sean Christopherson authored
      Open code the call to mmu->translate_gpa() when loading nested PDPTRs and
      kill off the existing helper, kvm_read_guest_page_mmu(), to discourage
      incorrect use.  Reading guest memory straight from an L2 GPA is extremely
      rare (as evidenced by the lack of users), as very few constructs in x86
      specify physical addresses, even fewer are virtualized by KVM, and even
      fewer yet require emulation of L2 by L0 KVM.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210831164224.1119728-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      15cabbc2
    • Juergen Gross's avatar
      kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS · a1c42dde
      Juergen Gross authored
      KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the
      number of allowed vcpu-ids. This has already led to confusion, so
      rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more
      clear
      Suggested-by: default avatarEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210913135745.13944-3-jgross@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a1c42dde
    • Juergen Gross's avatar
      Revert "x86/kvm: fix vcpu-id indexed array sizes" · 1e254d0d
      Juergen Gross authored
      This reverts commit 76b4f357.
      
      The commit has the wrong reasoning, as KVM_MAX_VCPU_ID is not defining the
      maximum allowed vcpu-id as its name suggests, but the number of vcpu-ids.
      So revert this patch again.
      Suggested-by: default avatarEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210913135745.13944-2-jgross@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1e254d0d
    • Vitaly Kuznetsov's avatar
      KVM: Make kvm_make_vcpus_request_mask() use pre-allocated cpu_kick_mask · 620b2438
      Vitaly Kuznetsov authored
      kvm_make_vcpus_request_mask() already disables preemption so just like
      kvm_make_all_cpus_request_except() it can be switched to using
      pre-allocated per-cpu cpumasks. This allows for improvements for both
      users of the function: in Hyper-V emulation code 'tlb_flush' can now be
      dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask()
      gets rid of dynamic allocation.
      
      cpumask_available() checks in kvm_make_vcpu_request() and
      kvm_kick_many_cpus() can now be dropped as they checks for an impossible
      condition: kvm_init() makes sure per-cpu masks are allocated.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210903075141.403071-9-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      620b2438
    • Vitaly Kuznetsov's avatar
      KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except() · baff59cc
      Vitaly Kuznetsov authored
      Allocating cpumask dynamically in zalloc_cpumask_var() is not ideal.
      Allocation is somewhat slow and can (in theory and when CPUMASK_OFFSTACK)
      fail. kvm_make_all_cpus_request_except() already disables preemption so
      we can use pre-allocated per-cpu cpumasks instead.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210903075141.403071-8-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      baff59cc
    • Vitaly Kuznetsov's avatar
      KVM: Drop 'except' parameter from kvm_make_vcpus_request_mask() · 381cecc5
      Vitaly Kuznetsov authored
      Both remaining callers of kvm_make_vcpus_request_mask() pass 'NULL' for
      'except' parameter so it can just be dropped.
      
      No functional change intended ©.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210903075141.403071-6-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      381cecc5
    • Vitaly Kuznetsov's avatar
      KVM: Optimize kvm_make_vcpus_request_mask() a bit · ae0946cd
      Vitaly Kuznetsov authored
      Iterating over set bits in 'vcpu_bitmap' should be faster than going
      through all vCPUs, especially when just a few bits are set.
      
      Drop kvm_make_vcpus_request_mask() call from kvm_make_all_cpus_request_except()
      to avoid handling the special case when 'vcpu_bitmap' is NULL, move the
      code to kvm_make_all_cpus_request_except() itself.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210903075141.403071-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ae0946cd
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyper-v: Avoid calling kvm_make_vcpus_request_mask() with vcpu_mask==NULL · 6470accc
      Vitaly Kuznetsov authored
      In preparation to making kvm_make_vcpus_request_mask() use for_each_set_bit()
      switch kvm_hv_flush_tlb() to calling kvm_make_all_cpus_request() for 'all cpus'
      case.
      
      Note: kvm_make_all_cpus_request() (unlike kvm_make_vcpus_request_mask())
      currently dynamically allocates cpumask on each call and this is suboptimal.
      Both kvm_make_all_cpus_request() and kvm_make_vcpus_request_mask() are
      going to be switched to using pre-allocated per-cpu masks.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210903075141.403071-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6470accc
    • Yang Li's avatar
      KVM: use vma_pages() helper · 11476d27
      Yang Li authored
      Use vma_pages function on vma object instead of explicit computation.
      
      Fix the following coccicheck warning:
      ./virt/kvm/kvm_main.c:3526:29-35: WARNING: Consider using vma_pages
      helper on vma
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Message-Id: <1632900526-119643-1-git-send-email-yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      11476d27
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Reset vmxon_ptr upon VMXOFF emulation. · feb3162f
      Vitaly Kuznetsov authored
      Currently, 'vmx->nested.vmxon_ptr' is not reset upon VMXOFF
      emulation. This is not a problem per se as we never access
      it when !vmx->nested.vmxon. But this should be done to avoid
      any issue in the future.
      
      Also, initialize the vmxon_ptr when vcpu is created.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Message-Id: <20210929175154.11396-3-yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      feb3162f
    • Yu Zhang's avatar
      KVM: nVMX: Use INVALID_GPA for pointers used in nVMX. · 64c78508
      Yu Zhang authored
      Clean up nested.c and vmx.c by using INVALID_GPA instead of "-1ull",
      to denote an invalid address in nested VMX. Affected addresses are
      the ones of VMXON region, current VMCS, VMCS link pointer, virtual-
      APIC page, ENCLS-exiting bitmap, and IO bitmap etc.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Message-Id: <20210929175154.11396-2-yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      64c78508
    • Sean Christopherson's avatar
      KVM: selftests: Ensure all migrations are performed when test is affined · 7b0035ea
      Sean Christopherson authored
      Rework the CPU selection in the migration worker to ensure the specified
      number of migrations are performed when the test iteslf is affined to a
      subset of CPUs.  The existing logic skips iterations if the target CPU is
      not in the original set of possible CPUs, which causes the test to fail
      if too many iterations are skipped.
      
        ==== Test Assertion Failure ====
        rseq_test.c:228: i > (NR_TASK_MIGRATIONS / 2)
        pid=10127 tid=10127 errno=4 - Interrupted system call
           1  0x00000000004018e5: main at rseq_test.c:227
           2  0x00007fcc8fc66bf6: ?? ??:0
           3  0x0000000000401959: _start at ??:?
        Only performed 4 KVM_RUNs, task stalled too much?
      
      Calculate the min/max possible CPUs as a cheap "best effort" to avoid
      high runtimes when the test is affined to a small percentage of CPUs.
      Alternatively, a list or xarray of the possible CPUs could be used, but
      even in a horrendously inefficient setup, such optimizations are not
      needed because the runtime is completely dominated by the cost of
      migrating the task, and the absolute runtime is well under a minute in
      even truly absurd setups, e.g. running on a subset of vCPUs in a VM that
      is heavily overcommited (16 vCPUs per pCPU).
      
      Fixes: 61e52f16 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs")
      Reported-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210929234112.1862848-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7b0035ea
    • Sean Christopherson's avatar
      KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks · e8a747d0
      Sean Christopherson authored
      Check whether a CPUID entry's index is significant before checking for a
      matching index to hack-a-fix an undefined behavior bug due to consuming
      uninitialized data.  RESET/INIT emulation uses kvm_cpuid() to retrieve
      CPUID.0x1, which does _not_ have a significant index, and fails to
      initialize the dummy variable that doubles as EBX/ECX/EDX output _and_
      ECX, a.k.a. index, input.
      
      Practically speaking, it's _extremely_  unlikely any compiler will yield
      code that causes problems, as the compiler would need to inline the
      kvm_cpuid() call to detect the uninitialized data, and intentionally hose
      the kernel, e.g. insert ud2, instead of simply ignoring the result of
      the index comparison.
      
      Although the sketchy "dummy" pattern was introduced in SVM by commit
      66f7b72e ("KVM: x86: Make register state after reset conform to
      specification"), it wasn't actually broken until commit 7ff6c035
      ("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the
      order of operations such that "index" was checked before the significant
      flag.
      
      Avoid consuming uninitialized data by reverting to checking the flag
      before the index purely so that the fix can be easily backported; the
      offending RESET/INIT code has been refactored, moved, and consolidated
      from vendor code to common x86 since the bug was introduced.  A future
      patch will directly address the bad RESET/INIT behavior.
      
      The undefined behavior was detected by syzbot + KernelMemorySanitizer.
      
        BUG: KMSAN: uninit-value in cpuid_entry2_find arch/x86/kvm/cpuid.c:68
        BUG: KMSAN: uninit-value in kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103
        BUG: KMSAN: uninit-value in kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
         cpuid_entry2_find arch/x86/kvm/cpuid.c:68 [inline]
         kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 [inline]
         kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
         kvm_vcpu_reset+0x13fb/0x1c20 arch/x86/kvm/x86.c:10885
         kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
         vcpu_enter_guest+0xfd2/0x6d80 arch/x86/kvm/x86.c:9534
         vcpu_run+0x7f5/0x18d0 arch/x86/kvm/x86.c:9788
         kvm_arch_vcpu_ioctl_run+0x245b/0x2d10 arch/x86/kvm/x86.c:10020
      
        Local variable ----dummy@kvm_vcpu_reset created at:
         kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812
         kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
      
      Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com
      Reported-by: default avatarAlexander Potapenko <glider@google.com>
      Fixes: 2a24be79 ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping")
      Fixes: 7ff6c035 ("KVM: x86: Remove stateful CPUID handling")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20210929222426.1855730-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8a747d0
    • Zelin Deng's avatar
      ptp: Fix ptp_kvm_getcrosststamp issue for x86 ptp_kvm · 773e89ab
      Zelin Deng authored
      hv_clock is preallocated to have only HVC_BOOT_ARRAY_SIZE (64) elements;
      if the PTP_SYS_OFFSET_PRECISE ioctl is executed on vCPUs whose index is
      64 of higher, retrieving the struct pvclock_vcpu_time_info pointer with
      "src = &hv_clock[cpu].pvti" will result in an out-of-bounds access and
      a wild pointer.  Change it to "this_cpu_pvti()" which is guaranteed to
      be valid.
      
      Fixes: 95a3d445 ("Switch kvmclock data to a PER_CPU variable")
      Signed-off-by: default avatarZelin Deng <zelin.deng@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Message-Id: <1632892429-101194-3-git-send-email-zelin.deng@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      773e89ab
    • Zelin Deng's avatar
      x86/kvmclock: Move this_cpu_pvti into kvmclock.h · ad9af930
      Zelin Deng authored
      There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
      so move it into kvmclock.h and export the symbol to make it visiable to
      other modules.
      Signed-off-by: default avatarZelin Deng <zelin.deng@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad9af930
  2. 28 Sep, 2021 1 commit
  3. 27 Sep, 2021 1 commit
    • Zhenzhong Duan's avatar
      KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue · 5c49d185
      Zhenzhong Duan authored
      When updating the host's mask for its MSR_IA32_TSX_CTRL user return entry,
      clear the mask in the found uret MSR instead of vmx->guest_uret_msrs[i].
      Modifying guest_uret_msrs directly is completely broken as 'i' does not
      point at the MSR_IA32_TSX_CTRL entry.  In fact, it's guaranteed to be an
      out-of-bounds accesses as is always set to kvm_nr_uret_msrs in a prior
      loop. By sheer dumb luck, the fallout is limited to "only" failing to
      preserve the host's TSX_CTRL_CPUID_CLEAR.  The out-of-bounds access is
      benign as it's guaranteed to clear a bit in a guest MSR value, which are
      always zero at vCPU creation on both x86-64 and i386.
      
      Cc: stable@vger.kernel.org
      Fixes: 8ea8b8d6 ("KVM: VMX: Use common x86's uret MSR list as the one true list")
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210926015545.281083-1-zhenzhong.duan@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5c49d185
  4. 24 Sep, 2021 3 commits
  5. 23 Sep, 2021 6 commits