1. 18 Nov, 2022 27 commits
  2. 17 Nov, 2022 13 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-svm-harden' into HEAD · 771a579c
      Paolo Bonzini authored
      This fixes three issues in nested SVM:
      
      1) in the shutdown_interception() vmexit handler we call kvm_vcpu_reset().
      However, if running nested and L1 doesn't intercept shutdown, the function
      resets vcpu->arch.hflags without properly leaving the nested state.
      This leaves the vCPU in inconsistent state and later triggers a kernel
      panic in SVM code.  The same bug can likely be triggered by sending INIT
      via local apic to a vCPU which runs a nested guest.
      
      On VMX we are lucky that the issue can't happen because VMX always
      intercepts triple faults, thus triple fault in L2 will always be
      redirected to L1.  Plus, handle_triple_fault() doesn't reset the vCPU.
      INIT IPI can't happen on VMX either because INIT events are masked while
      in VMX mode.
      
      Secondarily, KVM doesn't honour SHUTDOWN intercept bit of L1 on SVM.
      A normal hypervisor should always intercept SHUTDOWN, a unit test on
      the other hand might want to not do so.
      
      Finally, the guest can trigger a kernel non rate limited printk on SVM
      from the guest, which is fixed as well.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      771a579c
    • Maxim Levitsky's avatar
      KVM: x86: remove exit_int_info warning in svm_handle_exit · 05311ce9
      Maxim Levitsky authored
      It is valid to receive external interrupt and have broken IDT entry,
      which will lead to #GP with exit_int_into that will contain the index of
      the IDT entry (e.g any value).
      
      Other exceptions can happen as well, like #NP or #SS
      (if stack switch fails).
      
      Thus this warning can be user triggred and has very little value.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-10-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05311ce9
    • Maxim Levitsky's avatar
      KVM: selftests: add svm part to triple_fault_test · 8357b9e1
      Maxim Levitsky authored
      Add a SVM implementation to triple_fault_test to test that
      emulated/injected shutdown works.
      
      Since instead of the VMX, the SVM allows the hypervisor to avoid
      intercepting shutdown in guest, don't intercept shutdown to test that
      KVM suports this correctly.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-9-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8357b9e1
    • Maxim Levitsky's avatar
      KVM: x86: allow L1 to not intercept triple fault · 92e7d5c8
      Maxim Levitsky authored
      This is SVM correctness fix - although a sane L1 would intercept
      SHUTDOWN event, it doesn't have to, so we have to honour this.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      92e7d5c8
    • Maxim Levitsky's avatar
      kvm: selftests: add svm nested shutdown test · 0bd2d3f4
      Maxim Levitsky authored
      Add test that tests that on SVM if L1 doesn't intercept SHUTDOWN,
      then L2 crashes L1 and doesn't crash L2
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-7-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0bd2d3f4
    • Maxim Levitsky's avatar
      KVM: selftests: move idt_entry to header · fc6392d5
      Maxim Levitsky authored
      struct idt_entry will be used for a test which will break IDT on purpose.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-6-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fc6392d5
    • Maxim Levitsky's avatar
      KVM: x86: forcibly leave nested mode on vCPU reset · ed129ec9
      Maxim Levitsky authored
      While not obivous, kvm_vcpu_reset() leaves the nested mode by clearing
      'vcpu->arch.hflags' but it does so without all the required housekeeping.
      
      On SVM,	it is possible to have a vCPU reset while in guest mode because
      unlike VMX, on SVM, INIT's are not latched in SVM non root mode and in
      addition to that L1 doesn't have to intercept triple fault, which should
      also trigger L1's reset if happens in L2 while L1 didn't intercept it.
      
      If one of the above conditions happen, KVM will	continue to use vmcb02
      while not having in the guest mode.
      
      Later the IA32_EFER will be cleared which will lead to freeing of the
      nested guest state which will (correctly) free the vmcb02, but since
      KVM still uses it (incorrectly) this will lead to a use after free
      and kernel crash.
      
      This issue is assigned CVE-2022-3344
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-5-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ed129ec9
    • Maxim Levitsky's avatar
      KVM: x86: add kvm_leave_nested · f9697df2
      Maxim Levitsky authored
      add kvm_leave_nested which wraps a call to nested_ops->leave_nested
      into a function.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f9697df2
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use · 16ae56d7
      Maxim Levitsky authored
      Make sure that KVM uses vmcb01 before freeing nested state, and warn if
      that is not the case.
      
      This is a minimal fix for CVE-2022-3344 making the kernel print a warning
      instead of a kernel panic.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      16ae56d7
    • Maxim Levitsky's avatar
      KVM: x86: nSVM: leave nested mode on vCPU free · 917401f2
      Maxim Levitsky authored
      If the VM was terminated while nested, we free the nested state
      while the vCPU still is in nested mode.
      
      Soon a warning will be added for this condition.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      917401f2
    • David Matlack's avatar
      KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages · eb298605
      David Matlack authored
      Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked,
      as it will just be faulted back in at the same 4KiB granularity when
      accessed by a vCPU. This may need to be changed if KVM ever supports
      2MiB (or larger) dirty tracking granularity, or faulting huge pages
      during dirty tracking for reads/executes. However for now, these zaps
      are entirely wasteful.
      
      In order to check if this commit increases the CPU usage of the NX
      recovery worker thread I used a modified version of execute_perf_test
      [1] that supports splitting guest memory into multiple slots and reports
      /proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just
      before tearing down the VM. The goal was to force a large number of NX
      Huge Page recoveries and see if the recovery worker used any more CPU.
      
      Test Setup:
      
        echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms
        echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio
      
      Test Command:
      
        ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o
      
              | kvm-nx-lpage-re:se.sum_exec_runtime      |
              | ---------------------------------------- |
      Run     | Before             | After               |
      ------- | ------------------ | ------------------- |
      1       | 730.084105         | 724.375314          |
      2       | 728.751339         | 740.581988          |
      3       | 736.264720         | 757.078163          |
      
      Comparing the median results, this commit results in about a 1% increase
      CPU usage of the NX recovery worker when testing a VM with 16 slots.
      However, the effect is negligible with the default halving time of NX
      pages, which is 1 hour rather than 10 seconds given by period_ms = 1000,
      ratio = 10.
      
      [1] https://lore.kernel.org/kvm/20221019234050.3919566-2-dmatlack@google.com/Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20221103204421.1146958-1-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eb298605
    • Paolo Bonzini's avatar
      KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry · 63d28a25
      Paolo Bonzini authored
      A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map
      only fails in the exact same conditions that the earlier loop
      tested in order to issue a  "break". So, instead of checking twice the
      condition (upper level SPTEs could not be created or was frozen), just
      exit the loop with a goto---the usual poor-man C replacement for RAII
      early returns.
      
      While at it, do not use the "ret" variable for return values of
      functions that do not return a RET_PF_* enum.  This is clearer
      and also makes it possible to initialize ret to RET_PF_RETRY.
      Suggested-by: default avatarRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63d28a25
    • David Matlack's avatar
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault · c4b33d28
      David Matlack authored
      Now that the TDP MMU has a mechanism to split huge pages, use it in the
      fault path when a huge page needs to be replaced with a mapping at a
      lower level.
      
      This change reduces the negative performance impact of NX HugePages.
      Prior to this change if a vCPU executed from a huge page and NX
      HugePages was enabled, the vCPU would take a fault, zap the huge page,
      and mapping the faulting address at 4KiB with execute permissions
      enabled. The rest of the memory would be left *unmapped* and have to be
      faulted back in by the guest upon access (read, write, or execute). If
      guest is backed by 1GiB, a single execute instruction can zap an entire
      GiB of its physical address space.
      
      For example, it can take a VM longer to execute from its memory than to
      populate that memory in the first place:
      
      $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
      
      Populating memory             : 2.748378795s
      Executing from memory         : 2.899670885s
      
      With this change, such faults split the huge page instead of zapping it,
      which avoids the non-present faults on the rest of the huge page:
      
      $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
      
      Populating memory             : 2.729544474s
      Executing from memory         : 0.111965688s   <---
      
      This change also reduces the performance impact of dirty logging when
      eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
      be desirable for read-heavy workloads, as it avoids allocating memory to
      split huge pages that are never written and avoids increasing the TLB
      miss cost on reads of those pages.
      
                   | Config: ept=Y, tdp_mmu=Y, 5% writes           |
                   | Iteration 1 dirty memory time                 |
                   | --------------------------------------------- |
      vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
      ------------ | -------------- | ------------- | ------------ |
      2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
      4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
      8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
      16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
      32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
      64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
      96           | 3.016535213s   | 0.062608739s  | 0.044760838s |
      
      Eager page splitting remains beneficial for write-heavy workloads, but
      the gap is now reduced.
      
                   | Config: ept=Y, tdp_mmu=Y, 100% writes         |
                   | Iteration 1 dirty memory time                 |
                   | --------------------------------------------- |
      vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
      ------------ | -------------- | ------------- | ------------ |
      2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
      4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
      8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
      16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
      32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
      64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
      96           | 2.681123208s   | 0.673778787s  | 0.165386739s |
      
      Further study is needed to determine if the remaining gap is acceptable
      for customer workloads or if eager_page_split=N still requires a-priori
      knowledge of the VM workload, especially when considering these costs
      extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20221109185905.486172-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c4b33d28