1. 08 Mar, 2022 2 commits
  2. 04 Mar, 2022 1 commit
  3. 02 Mar, 2022 2 commits
    • Paolo Bonzini's avatar
      KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run · 8d25b7be
      Paolo Bonzini authored
      kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
      places, namely vcpu_run and post_kvm_run_save, and a third is actually
      needed around the call to vcpu->arch.complete_userspace_io to avoid
      the following splat:
      
        WARNING: suspicious RCU usage
        arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by CPU 28/KVM/370841:
        #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
        Call Trace:
         <TASK>
         dump_stack_lvl+0x59/0x73
         reprogram_fixed_counter+0x15d/0x1a0 [kvm]
         kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
         ? free_moved_vector+0x1b4/0x1e0
         complete_fast_pio_in+0x8a/0xd0 [kvm]
      
      This splat is not at all unexpected, since complete_userspace_io callbacks
      can execute similar code to vmexits.  For example, SVM with nrips=false
      will call into the emulator from svm_skip_emulated_instruction().
      
      While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
      practically speaking there's no penalty to acquiring kvm->srcu "early"
      as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU.  On
      the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
      and sync_regs() can theoretically reach code that might access
      SRCU-protected data structures, e.g. sync_regs() can trigger forced
      existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
      Reported-by: default avatarLike Xu <likexu@tencent.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8d25b7be
    • Like Xu's avatar
      KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() · c6c937d6
      Like Xu authored
      Just like on the optional mmu_alloc_direct_roots() path, once shadow
      path reaches "r = -EIO" somewhere, the caller needs to know the actual
      state in order to enter error handling and avoid something worse.
      
      Fixes: 4a38162e ("KVM: MMU: load PDPTRs outside mmu_lock")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220301124941.48412-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6c937d6
  4. 01 Mar, 2022 24 commits
    • Sean Christopherson's avatar
      KVM: SVM: Disable preemption across AVIC load/put during APICv refresh · b652de1e
      Sean Christopherson authored
      Disable preemption when loading/putting the AVIC during an APICv refresh.
      If the vCPU task is preempted and migrated ot a different pCPU, the
      unprotected avic_vcpu_load() could set the wrong pCPU in the physical ID
      cache/table.
      
      Pull the necessary code out of avic_vcpu_{,un}blocking() and into a new
      helper to reduce the probability of introducing this exact bug a third
      time.
      
      Fixes: df7e4827 ("KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b652de1e
    • Sean Christopherson's avatar
      KVM: SVM: Exit to userspace on ENOMEM/EFAULT GHCB errors · aa9f5841
      Sean Christopherson authored
      Exit to userspace if setup_vmgexit_scratch() fails due to OOM or because
      copying data from guest (userspace) memory failed/faulted.  The OOM
      scenario is clearcut, it's userspace's decision as to whether it should
      terminate the guest, free memory, etc...
      
      As for -EFAULT, arguably, any guest issue is a violation of the guest's
      contract with userspace, and thus userspace needs to decide how to
      proceed.  E.g. userspace defines what is RAM vs. MMIO and communicates
      that directly to the guest, KVM is not involved in deciding what is/isn't
      RAM nor in communicating that information to the guest.  If the scratch
      GPA doesn't resolve to a memslot, then the guest is not honoring the
      memory configuration as defined by userspace.
      
      And if userspace unmaps an hva for whatever reason, then exiting to
      userspace with -EFAULT is absolutely the right thing to do.  KVM's ABI
      currently sucks and doesn't provide enough information to act on the
      -EFAULT, but that will hopefully be remedied in the future as there are
      multiple use cases, e.g. uffd and virtiofs truncation, that shouldn't
      require any work in KVM beyond returning -EFAULT with a small amount of
      metadata.
      
      KVM could define its ABI such that failure to access the scratch area is
      reflected into the guest, i.e. establish a contract with userspace, but
      that's undesirable as it limits KVM's options in the future, e.g. in the
      potential uffd case any failure on a uaccess needs to kick out to
      userspace.  KVM does have several cases where it reflects these errors
      into the guest, e.g. kvm_pv_clock_pairing() and Hyper-V emulation, but
      KVM would preferably "fix" those instead of propagating the falsehood
      that any memory failure is the guest's fault.
      
      Lastly, returning a boolean as an "error" for that a helper that isn't
      named accordingly never works out well.
      
      Fixes: ad5b3532 ("KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure")
      Cc: Alper Gun <alpergun@google.com>
      Cc: Peter Gonda <pgonda@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225205209.3881130-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aa9f5841
    • Sean Christopherson's avatar
      KVM: WARN if is_unsync_root() is called on a root without a shadow page · 5d6a3221
      Sean Christopherson authored
      WARN and bail if is_unsync_root() is passed a root for which there is no
      shadow page, i.e. is passed the physical address of one of the special
      roots, which do not have an associated shadow page.  The current usage
      squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
      kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
      and 5-level AMD CPUs are not generally available, i.e. no one can coerce
      KVM into calling is_unsync_root() on pml5_root.
      
      Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
      prevents KVM from crashing.
      
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225182248.3812651-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d6a3221
    • Sean Christopherson's avatar
      KVM: Drop KVM_REQ_MMU_RELOAD and update vcpu-requests.rst documentation · e65a3b46
      Sean Christopherson authored
      Remove the now unused KVM_REQ_MMU_RELOAD, shift KVM_REQ_VM_DEAD into the
      unoccupied space, and update vcpu-requests.rst, which was missing an
      entry for KVM_REQ_VM_DEAD.  Switching KVM_REQ_VM_DEAD to entry '1' also
      fixes the stale comment about bits 4-7 being reserved.
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e65a3b46
    • Sean Christopherson's avatar
      KVM: s390: Replace KVM_REQ_MMU_RELOAD usage with arch specific request · cc65c3a1
      Sean Christopherson authored
      Add an arch request, KVM_REQ_REFRESH_GUEST_PREFIX, to deal with guest
      prefix changes instead of piggybacking KVM_REQ_MMU_RELOAD.  This will
      allow for the removal of the generic KVM_REQ_MMU_RELOAD, which isn't
      actually used by generic KVM.
      
      No functional change intended.
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Reviewed-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225182248.3812651-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cc65c3a1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped · 527d5cd7
      Sean Christopherson authored
      Zap only obsolete roots when responding to zapping a single root shadow
      page.  Because KVM keeps root_count elevated when stuffing a previous
      root into its PGD cache, shadowing a 64-bit guest means that zapping any
      root causes all vCPUs to reload all roots, even if their current root is
      not affected by the zap.
      
      For many kernels, zapping a single root is a frequent operation, e.g. in
      Linux it happens whenever an mm is dropped, e.g. process exits, etc...
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      527d5cd7
    • Sean Christopherson's avatar
      KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users · 2f6f66cc
      Sean Christopherson authored
      Remove the generic kvm_reload_remote_mmus() and open code its
      functionality into the two x86 callers.  x86 is (obviously) the only
      architecture that uses the hook, and is also the only architecture that
      uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name.  That
      will change in a future patch, as x86's usage when zapping a single
      shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
      MMUs whose root is being zapped actually need to be reloaded.
      
      s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
      
      Drop the generic code in anticipation of implementing s390 and x86 arch
      specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
      
      Opportunistically reword the x86 TDP MMU comment to avoid making
      references to functions (and requests!) when possible, and to remove the
      rather ambiguous "this".
      
      No functional change intended.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f6f66cc
    • Sean Christopherson's avatar
      KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE change · f6d0a252
      Sean Christopherson authored
      Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call
      when the guest's CR4.PCIDE changes.  This will allow tweaking the logic
      of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the
      historical intent of KVM_REQ_MMU_RELOAD.  The recent PCIDE behavior is
      the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as
      obsolete, needs to unconditionally unload the entire MMU, _and_ affects
      only the current vCPU.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225182248.3812651-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f6d0a252
    • Hou Wenlong's avatar
      KVM: x86/emulator: Move the unhandled outer privilege level logic of far... · 1e326ad4
      Hou Wenlong authored
      KVM: x86/emulator: Move the unhandled outer privilege level logic of far return into __load_segment_descriptor()
      
      Outer-privilege level return is not implemented in emulator,
      move the unhandled logic into __load_segment_descriptor to
      make it easier to understand why the checks for RET are
      incomplete.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <5b7188e6388ac9f4567d14eab32db9adf3e00119.1644292363.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1e326ad4
    • Hou Wenlong's avatar
      KVM: x86/emulator: Fix wrong privilege check for code segment in __load_segment_descriptor() · 31c66dab
      Hou Wenlong authored
      Code segment descriptor can be loaded by jmp/call/ret, iret
      and int. The privilege checks are different between those
      instructions above realmode. Although, the emulator has
      use x86_transfer_type enumerate to differentiate them, but
      it is not really used in __load_segment_descriptor(). Note,
      far jump/call to call gate, task gate or task state segment
      are not implemented in emulator.
      
      As for far jump/call to code segment, if DPL > CPL for conforming
      code or (RPL > CPL or DPL != CPL) for non-conforming code, it
      should trigger #GP. The current checks are ok.
      
      As for far return, if RPL < CPL or DPL > RPL for conforming
      code or DPL != RPL for non-conforming code, it should trigger #GP.
      Outer level return is not implemented above virtual-8086 mode in
      emulator. So it implies that RPL <= CPL, but the current checks
      wouldn't trigger #GP if RPL < CPL.
      
      As for code segment loading in task switch, if DPL > RPL for conforming
      code or DPL != RPL for non-conforming code, it should trigger #TS. Since
      segment selector is loaded before segment descriptor when load state from
      tss, it implies that RPL = CPL, so the current checks are ok.
      
      The only problem in current implementation is missing RPL < CPL check for
      far return. However, change code to follow the manual is better.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <e01f5ea70fc1f18f23da1182acdbc5c97c0e5886.1644292363.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31c66dab
    • Hou Wenlong's avatar
      KVM: x86/emulator: Defer not-present segment check in __load_segment_descriptor() · ca85f002
      Hou Wenlong authored
      Per Intel's SDM on the "Instruction Set Reference", when
      loading segment descriptor, not-present segment check should
      be after all type and privilege checks. But the emulator checks
      it first, then #NP is triggered instead of #GP if privilege fails
      and segment is not present. Put not-present segment check after
      type and privilege checks in __load_segment_descriptor().
      
      Fixes: 38ba30ba (KVM: x86 emulator: Emulate task switch in emulator.c)
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <52573c01d369f506cadcf7233812427cf7db81a7.1644292363.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ca85f002
    • Sean Christopherson's avatar
      KVM: selftests: Add test to verify KVM handling of ICR · 85c68eb4
      Sean Christopherson authored
      The main thing that the selftest verifies is that KVM copies x2APIC's
      ICR[63:32] to/from ICR2 when userspace accesses the vAPIC page via
      KVM_{G,S}ET_LAPIC.  KVM previously split x2APIC ICR to ICR+ICR2 at the
      time of write (from the guest), and so KVM must preserve that behavior
      for backwards compatibility between different versions of KVM.
      
      It will also test other invariants, e.g. that KVM clears the BUSY
      flag on ICR writes, that the reserved bits in ICR2 are dropped on writes
      from the guest, etc...
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85c68eb4
    • Sean Christopherson's avatar
      KVM: x86: Make kvm_lapic_set_reg() a "private" xAPIC helper · b9964ee3
      Sean Christopherson authored
      Hide the lapic's "raw" write helper inside lapic.c to force non-APIC code
      to go through proper helpers when modification the vAPIC state.  Keep the
      read helper visible to outsiders for now, refactoring KVM to hide it too
      is possible, it will just take more work to do so.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b9964ee3
    • Sean Christopherson's avatar
      KVM: x86: Treat x2APIC's ICR as a 64-bit register, not two 32-bit regs · a57a3168
      Sean Christopherson authored
      Emulate the x2APIC ICR as a single 64-bit register, as opposed to forking
      it across ICR and ICR2 as two 32-bit registers.  This mirrors hardware
      behavior for Intel's upcoming IPI virtualization support, which does not
      split the access.
      
      Previous versions of Intel's SDM and AMD's APM don't explicitly state
      exactly how ICR is reflected in the vAPIC page for x2APIC, KVM just
      happened to speculate incorrectly.
      
      Handling the upcoming behavior is necessary in order to maintain
      backwards compatibility with KVM_{G,S}ET_LAPIC, e.g. failure to shuffle
      the 64-bit ICR to ICR+ICR2 and vice versa would break live migration if
      IPI virtualization support isn't symmetrical across the source and dest.
      
      Cc: Zeng Guang <guang.zeng@intel.com>
      Cc: Chao Gao <chao.gao@intel.com>
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a57a3168
    • Sean Christopherson's avatar
      KVM: x86: Add helpers to handle 64-bit APIC MSR read/writes · 5429478d
      Sean Christopherson authored
      Add helpers to handle 64-bit APIC read/writes via MSRs to deduplicate the
      x2APIC and Hyper-V code needed to service reads/writes to ICR.  Future
      support for IPI virtualization will add yet another path where KVM must
      handle 64-bit APIC MSR reads/write (to ICR).
      
      Opportunistically fix the comment in the write path; ICR2 holds the
      destination (if there's no shorthand), not the vector.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5429478d
    • Sean Christopherson's avatar
      KVM: x86: Make kvm_lapic_reg_{read,write}() static · 70180052
      Sean Christopherson authored
      Make the low level read/write lapic helpers static, any accesses to the
      local APIC from vendor code or non-APIC code should be routed through
      proper helpers.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70180052
    • Sean Christopherson's avatar
      KVM: x86: WARN if KVM emulates an IPI without clearing the BUSY flag · bd17f417
      Sean Christopherson authored
      WARN if KVM emulates an IPI without clearing the BUSY flag, failure to do
      so could hang the guest if it waits for the IPI be sent.
      
      Opportunistically use APIC_ICR_BUSY macro instead of open coding the
      magic number, and add a comment to clarify why kvm_recalculate_apic_map()
      is unconditionally invoked (it's really, really confusing for IPIs due to
      the existence of fast paths that don't trigger a potential recalc).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd17f417
    • Sean Christopherson's avatar
      KVM: SVM: Don't rewrite guest ICR on AVIC IPI virtualization failure · b51818af
      Sean Christopherson authored
      Don't bother rewriting the ICR value into the vAPIC page on an AVIC IPI
      virtualization failure, the access is a trap, i.e. the value has already
      been written to the vAPIC page.  The one caveat is if hardware left the
      BUSY flag set (which appears to happen somewhat arbitrarily), in which
      case go through the "nodecode" APIC-write path in order to clear the BUSY
      flag.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b51818af
    • Sean Christopherson's avatar
      KVM: SVM: Use common kvm_apic_write_nodecode() for AVIC write traps · ed60920e
      Sean Christopherson authored
      Use the common kvm_apic_write_nodecode() to handle AVIC/APIC-write traps
      instead of open coding the same exact code.  This will allow making the
      low level lapic helpers inaccessible outside of lapic.c code.
      
      Opportunistically clean up the params to eliminate a bunch of svm=>vcpu
      reflection.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ed60920e
    • Sean Christopherson's avatar
      KVM: x86: Use "raw" APIC register read for handling APIC-write VM-Exit · b031f104
      Sean Christopherson authored
      Use the "raw" helper to read the vAPIC register after an APIC-write trap
      VM-Exit.  Hardware is responsible for vetting the write, and the caller
      is responsible for sanitizing the offset.  This is a functional change,
      as it means KVM will consume whatever happens to be in the vAPIC page if
      the write was dropped by hardware.  But, unless userspace deliberately
      wrote garbage into the vAPIC page via KVM_SET_LAPIC, the value should be
      zero since it's not writable by the guest.
      
      This aligns common x86 with SVM's AVIC logic, i.e. paves the way for
      using the nodecode path to handle APIC-write traps when AVIC is enabled.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b031f104
    • Sean Christopherson's avatar
      KVM: VMX: Handle APIC-write offset wrangling in VMX code · b5ede3df
      Sean Christopherson authored
      Move the vAPIC offset adjustments done in the APIC-write trap path from
      common x86 to VMX in anticipation of using the nodecode path for SVM's
      AVIC.  The adjustment reflects hardware behavior, i.e. it's technically a
      property of VMX, no common x86.  SVM's AVIC behavior is identical, so
      it's a bit of a moot point, the goal is purely to make it easier to
      understand why the adjustment is ok.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5ede3df
    • Paolo Bonzini's avatar
      KVM: x86: Do not change ICR on write to APIC_SELF_IPI · d22a81b3
      Paolo Bonzini authored
      Emulating writes to SELF_IPI with a write to ICR has an unwanted side effect:
      the value of ICR in vAPIC page gets changed.  The lists SELF_IPI as write-only,
      with no associated MMIO offset, so any write should have no visible side
      effect in the vAPIC page.
      Reported-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d22a81b3
    • Zhenzhong Duan's avatar
      KVM: x86: Fix emulation in writing cr8 · f66af9f2
      Zhenzhong Duan authored
      In emulation of writing to cr8, one of the lowest four bits in TPR[3:0]
      is kept.
      
      According to Intel SDM 10.8.6.1(baremetal scenario):
      "APIC.TPR[bits 7:4] = CR8[bits 3:0], APIC.TPR[bits 3:0] = 0";
      
      and SDM 28.3(use TPR shadow):
      "MOV to CR8. The instruction stores bits 3:0 of its source operand into
      bits 7:4 of VTPR; the remainder of VTPR (bits 3:0 and bits 31:8) are
      cleared.";
      
      and AMD's APM 16.6.4:
      "Task Priority Sub-class (TPS)-Bits 3 : 0. The TPS field indicates the
      current sub-priority to be used when arbitrating lowest-priority messages.
      This field is written with zero when TPR is written using the architectural
      CR8 register.";
      
      so in KVM emulated scenario, clear TPR[3:0] to make a consistent behavior
      as in other scenarios.
      
      This doesn't impact evaluation and delivery of pending virtual interrupts
      because processor does not use the processor-priority sub-class to
      determine which interrupts to delivery and which to inhibit.
      
      Sub-class is used by hardware to arbitrate lowest priority interrupts,
      but KVM just does a round-robin style delivery.
      
      Fixes: b93463aa ("KVM: Accelerated apic support")
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220210094506.20181-1-zhenzhong.duan@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f66af9f2
    • Paolo Bonzini's avatar
      KVM: x86: flush TLB separately from MMU reset · b5f61c03
      Paolo Bonzini authored
      For both CR0 and CR4, disassociate the TLB flush logic from the
      MMU role logic.  Instead  of relying on kvm_mmu_reset_context() being
      a superset of various TLB flushes (which is not necessarily going to
      be the case in the future), always call it if the role changes
      but also set the various TLB flush requests according to what is
      in the manual.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5f61c03
  5. 25 Feb, 2022 11 commits
    • Li RongQing's avatar
      KVM: x86: Yield to IPI target vCPU only if it is busy · 9ee83635
      Li RongQing authored
      When sending a call-function IPI-many to vCPUs, yield to the
      IPI target vCPU which is marked as preempted.
      
      but when emulating HLT, an idling vCPU will be voluntarily
      scheduled out and mark as preempted from the guest kernel
      perspective. yielding to idle vCPU is pointless and increase
      unnecessary vmexit, maybe miss the true preempted vCPU
      
      so yield to IPI target vCPU only if vCPU is busy and preempted
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Message-Id: <1644380201-29423-1-git-send-email-lirongqing@baidu.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9ee83635
    • Dexuan Cui's avatar
      x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64 · 92e68cc5
      Dexuan Cui authored
      When Linux runs as an Isolated VM on Hyper-V, it supports AMD SEV-SNP
      but it's partially enlightened, i.e. cc_platform_has(
      CC_ATTR_GUEST_MEM_ENCRYPT) is true but sev_active() is false.
      
      Commit 4d96f910 per se is good, but with it now
      kvm_setup_vsyscall_timeinfo() -> kvmclock_init_mem() calls
      set_memory_decrypted(), and later gets stuck when trying to zere out
      the pages pointed by 'hvclock_mem', if Linux runs as an Isolated VM on
      Hyper-V. The cause is that here now the Linux VM should no longer access
      the original guest physical addrss (GPA); instead the VM should do
      memremap() and access the original GPA + ms_hyperv.shared_gpa_boundary:
      see the example code in drivers/hv/connection.c: vmbus_connect() or
      drivers/hv/ring_buffer.c: hv_ringbuffer_init(). If the VM tries to
      access the original GPA, it keepts getting injected a fault by Hyper-V
      and gets stuck there.
      
      Here the issue happens only when the VM has >=65 vCPUs, because the
      global static array hv_clock_boot[] can hold 64 "struct
      pvclock_vsyscall_time_info" (the sizeof of the struct is 64 bytes), so
      kvmclock_init_mem() only allocates memory in the case of vCPUs > 64.
      
      Since the 'hvclock_mem' pages are only useful when the kvm clock is
      supported by the underlying hypervisor, fix the issue by returning
      early when Linux VM runs on Hyper-V, which doesn't support kvm clock.
      
      Fixes: 4d96f910 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")
      Tested-by: default avatarAndrea Parri (Microsoft) <parri.andrea@gmail.com>
      Signed-off-by: default avatarAndrea Parri (Microsoft) <parri.andrea@gmail.com>
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Message-Id: <20220225084600.17817-1-decui@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      92e68cc5
    • Wanpeng Li's avatar
      x86/kvm: Don't waste memory if kvmclock is disabled · 3c51d0a6
      Wanpeng Li authored
      Even if "no-kvmclock" is passed in cmdline parameter, the guest kernel
      still allocates hvclock_mem which is scaled by the number of vCPUs,
      let's check kvmclock enable in advance to avoid this memory waste.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1645520523-30814-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3c51d0a6
    • Wanpeng Li's avatar
      x86/kvm: Don't use PV TLB/yield when mwait is advertised · 40cd58db
      Wanpeng Li authored
      MWAIT is advertised in host is not overcommitted scenario, however, PV
      TLB/sched yield should be enabled in host overcommitted scenario. Let's
      add the MWAIT checking when enabling PV TLB/sched yield.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1645777780-2581-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      40cd58db
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-5.17-4' of... · ece32a75
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 5.17, take #4
      
      - Correctly synchronise PMR and co on PSCI CPU_SUSPEND
      
      - Skip tests that depend on GICv3 when the HW isn't available
      ece32a75
    • Paolo Bonzini's avatar
      KVM: x86/mmu: clear MMIO cache when unloading the MMU · 6d58f275
      Paolo Bonzini authored
      For cleanliness, do not leave a stale GVA in the cache after all the roots are
      cleared.  In practice, kvm_mmu_load will go through kvm_mmu_sync_roots if
      paging is on, and will not use vcpu_match_mmio_gva at all if paging is off.
      However, leaving data in the cache might cause bugs in the future.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6d58f275
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Always use current mmu's role when loading new PGD · d2e5f333
      Paolo Bonzini authored
      Since the guest PGD is now loaded after the MMU has been set up
      completely, the desired role for a cache hit is simply the current
      mmu_role.  There is no need to compute it again, so __kvm_mmu_new_pgd
      can be folded in kvm_mmu_new_pgd.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d2e5f333
    • Paolo Bonzini's avatar
      KVM: x86/mmu: load new PGD after the shadow MMU is initialized · 3cffc89d
      Paolo Bonzini authored
      Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
      shadow_root_level anymore, pull the PGD load after the initialization of
      the shadow MMUs.
      
      Besides being more intuitive, this enables future simplifications
      and optimizations because it's not necessary anymore to compute the
      role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
      attempting to use a cached PGD to avoid having to figure out the new role.
      With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
      and avoid unloading all the cached roots.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3cffc89d
    • Paolo Bonzini's avatar
      KVM: x86/mmu: look for a cached PGD when going from 32-bit to 64-bit · 5499ea73
      Paolo Bonzini authored
      Right now, PGD caching avoids placing a PAE root in the cache by using the
      old value of mmu->root_level and mmu->shadow_root_level; it does not look
      for a cached PGD if the old root is a PAE one, and then frees it using
      kvm_mmu_free_roots.
      
      Change the logic instead to free the uncacheable root early.
      This way, __kvm_new_mmu_pgd is able to look up the cache when going from
      32-bit to 64-bit (if there is a hit, the invalid root becomes the least
      recently used).  An example of this is nested virtualization with shadow
      paging, when a 64-bit L1 runs a 32-bit L2.
      
      As a side effect (which is actually the reason why this patch was
      written), PGD caching does not use the old value of mmu->root_level
      and mmu->shadow_root_level anymore.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5499ea73
    • Paolo Bonzini's avatar
      KVM: x86/mmu: do not pass vcpu to root freeing functions · 0c1c92f1
      Paolo Bonzini authored
      These functions only operate on a given MMU, of which there is more
      than one in a vCPU (we care about two, because the third does not have
      any roots and is only used to walk guest page tables).  They do need a
      struct kvm in order to lock the mmu_lock, but they do not needed anything
      else in the struct kvm_vcpu.  So, pass the vcpu->kvm directly to them.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0c1c92f1
    • Paolo Bonzini's avatar
      KVM: x86/mmu: do not consult levels when freeing roots · 594bef79
      Paolo Bonzini authored
      Right now, PGD caching requires a complicated dance of first computing
      the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling
      kvm_init_mmu().
      
      Part of this is due to kvm_mmu_free_roots using mmu->root_level and
      mmu->shadow_root_level to distinguish whether the page table uses a single
      root or 4 PAE roots.  Because kvm_init_mmu() can overwrite mmu->root_level,
      kvm_mmu_free_roots() must be called before kvm_init_mmu().
      
      However, even after kvm_init_mmu() there is a way to detect whether the
      page table may hold PAE roots, as root.hpa isn't backed by a shadow when
      it points at PAE roots.  Using this method results in simpler code, and
      is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the
      MMU has been initialized.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      594bef79