1. 14 Dec, 2020 3 commits
    • Tom Lendacky's avatar
      KVM: SVM: Remove the call to sev_platform_status() during setup · 9d4747d0
      Tom Lendacky authored
      When both KVM support and the CCP driver are built into the kernel instead
      of as modules, KVM initialization can happen before CCP initialization. As
      a result, sev_platform_status() will return a failure when it is called
      from sev_hardware_setup(), when this isn't really an error condition.
      
      Since sev_platform_status() doesn't need to be called at this time anyway,
      remove the invocation from sev_hardware_setup().
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <618380488358b56af558f2682203786f09a49483.1607620209.git.thomas.lendacky@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9d4747d0
    • Tom Lendacky's avatar
      x86/cpu: Add VM page flush MSR availablility as a CPUID feature · 69372cf0
      Tom Lendacky authored
      On systems that do not have hardware enforced cache coherency between
      encrypted and unencrypted mappings of the same physical page, the
      hypervisor can use the VM page flush MSR (0xc001011e) to flush the cache
      contents of an SEV guest page. When a small number of pages are being
      flushed, this can be used in place of issuing a WBINVD across all CPUs.
      
      CPUID 0x8000001f_eax[2] is used to determine if the VM page flush MSR is
      available. Add a CPUID feature to indicate it is supported and define the
      MSR.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <f1966379e31f9b208db5257509c4a089a87d33d0.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      69372cf0
    • Uros Bizjak's avatar
      KVM/VMX/SVM: Move kvm_machine_check function to x86.h · 3f1a18b9
      Uros Bizjak authored
      Move kvm_machine_check to x86.h to avoid two exact copies
      of the same function in kvm.c and svm.c.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Message-Id: <20201029135600.122392-1-ubizjak@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3f1a18b9
  2. 12 Dec, 2020 7 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-5.11-1' of... · e8614e5e
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-5.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: Features and Test for 5.11
      
      - memcg accouting for s390 specific parts of kvm and gmap
      - selftest for diag318
      - new kvm_stat for when async_pf falls back to sync
      
      The selftest even triggers a non-critical bug that is unrelated
      to diag318, fix will follow later.
      e8614e5e
    • Paolo Bonzini's avatar
      KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits · 39485ed9
      Paolo Bonzini authored
      Until commit e7c587da ("x86/speculation: Use synthetic bits for
      IBRS/IBPB/STIBP"), KVM was testing both Intel and AMD CPUID bits before
      allowing the guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.
      Testing only Intel bits on VMX processors, or only AMD bits on SVM
      processors, fails if the guests are created with the "opposite" vendor
      as the host.
      
      While at it, also tweak the host CPU check to use the vendor-agnostic
      feature bit X86_FEATURE_IBPB, since we only care about the availability
      of the MSR on the host here and not about specific CPUID bits.
      
      Fixes: e7c587da ("x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarDenis V. Lunev <den@openvz.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39485ed9
    • Cathy Zhang's avatar
      KVM: x86: Expose AVX512_FP16 for supported CPUID · 2224fc9e
      Cathy Zhang authored
      AVX512_FP16 is supported by Intel processors, like Sapphire Rapids.
      It could gain better performance for it's faster compared to FP32
      if the precision or magnitude requirements are met. It's availability
      is indicated by CPUID.(EAX=7,ECX=0):EDX[bit 23].
      
      Expose it in KVM supported CPUID, then guest could make use of it; no
      new registers are used, only new instructions.
      Signed-off-by: default avatarCathy Zhang <cathy.zhang@intel.com>
      Signed-off-by: default avatarKyung Min Park <kyung.min.park@intel.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Message-Id: <20201208033441.28207-3-kyung.min.park@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2224fc9e
    • Kyung Min Park's avatar
      x86: Enumerate AVX512 FP16 CPUID feature flag · e1b35da5
      Kyung Min Park authored
      Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
      flag. Compared with using FP32, using FP16 cut the number of bits
      required for storage in half, reducing the exponent from 8 bits to 5,
      and the mantissa from 23 bits to 10. Using FP16 also enables developers
      to train and run inference on deep learning models fast when all
      precision or magnitude (FP32) is not needed.
      
      A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
      is present. The AVX512 FP16 requires AVX512BW feature be implemented
      since the instructions for manipulating 32bit masks are associated with
      AVX512BW.
      
      The only in-kernel usage of this is kvm passthrough. The CPU feature
      flag is shown as "avx512_fp16" in /proc/cpuinfo.
      Signed-off-by: default avatarKyung Min Park <kyung.min.park@intel.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e1b35da5
    • Aaron Lewis's avatar
      selftests: kvm: Merge user_msr_test into userspace_msr_exit_test · fb636053
      Aaron Lewis authored
      Both user_msr_test and userspace_msr_exit_test tests the functionality
      of kvm_msr_filter.  Instead of testing this feature in two tests, merge
      them together, so there is only one test for this feature.
      Signed-off-by: default avatarAaron Lewis <aaronlewis@google.com>
      Message-Id: <20201204172530.2958493-1-aaronlewis@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fb636053
    • Aaron Lewis's avatar
      selftests: kvm: Test MSR exiting to userspace · 3cea1891
      Aaron Lewis authored
      Add a selftest to test that when the ioctl KVM_X86_SET_MSR_FILTER is
      called with an MSR list, those MSRs exit to userspace.
      
      This test uses 3 MSRs to test this:
        1. MSR_IA32_XSS, an MSR the kernel knows about.
        2. MSR_IA32_FLUSH_CMD, an MSR the kernel does not know about.
        3. MSR_NON_EXISTENT, an MSR invented in this test for the purposes of
           passing a fake MSR from the guest to userspace.  KVM just acts as a
           pass through.
      
      Userspace is also able to inject a #GP.  This is demonstrated when
      MSR_IA32_XSS and MSR_IA32_FLUSH_CMD are misused in the test.  When this
      happens a #GP is initiated in userspace to be thrown in the guest which is
      handled gracefully by the exception handling framework introduced earlier
      in this series.
      
      Tests for the generic instruction emulator were also added.  For this to
      work the module parameter kvm.force_emulation_prefix=1 has to be enabled.
      If it isn't enabled the tests will be skipped.
      
      A test was also added to ensure the MSR permission bitmap is being set
      correctly by executing reads and writes of MSR_FS_BASE and MSR_GS_BASE
      in the guest while alternating which MSR userspace should intercept.  If
      the permission bitmap is being set correctly only one of the MSRs should
      be coming through at a time, and the guest should be able to read and
      write the other one directly.
      Signed-off-by: default avatarAaron Lewis <aaronlewis@google.com>
      Reviewed-by: default avatarAlexander Graf <graf@amazon.com>
      Message-Id: <20201012194716.3950330-5-aaronlewis@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3cea1891
    • Uros Bizjak's avatar
      KVM/VMX: Use TEST %REG,%REG instead of CMP $0,%REG in vmenter.S · 6c44221b
      Uros Bizjak authored
      Saves one byte in __vmx_vcpu_run for the same functionality.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Message-Id: <20201029140457.126965-1-ubizjak@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c44221b
  3. 10 Dec, 2020 4 commits
  4. 09 Dec, 2020 1 commit
  5. 03 Dec, 2020 1 commit
  6. 27 Nov, 2020 1 commit
  7. 19 Nov, 2020 2 commits
  8. 16 Nov, 2020 5 commits
    • Paolo Bonzini's avatar
      KVM: SVM: check CR4 changes against vcpu->arch · dc924b06
      Paolo Bonzini authored
      Similarly to what vmx/vmx.c does, use vcpu->arch.cr4 to check if CR4
      bits PGE, PKE and OSXSAVE have changed.  When switching between VMCB01
      and VMCB02, CPUID has to be adjusted every time if CR4.PKE or CR4.OSXSAVE
      change; without this patch, instead, CR4 would be checked against the
      previous value for L2 on vmentry, and against the previous value for
      L1 on vmexit, and CPUID would not be updated.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dc924b06
    • Cathy Avery's avatar
      KVM: SVM: Move asid to vcpu_svm · 7e8e6eed
      Cathy Avery authored
      KVM does not have separate ASIDs for L1 and L2; either the nested
      hypervisor and nested guests share a single ASID, or on older processor
      the ASID is used only to implement TLB flushing.
      
      Either way, ASIDs are handled at the VM level.  In preparation
      for having different VMCBs passed to VMLOAD/VMRUN/VMSAVE for L1 and
      L2, store the current ASID to struct vcpu_svm and only move it to
      the VMCB in svm_vcpu_run.  This way, TLB flushes can be applied
      no matter which VMCB will be active during the next svm_vcpu_run.
      Signed-off-by: default avatarCathy Avery <cavery@redhat.com>
      Message-Id: <20201011184818.3609-2-cavery@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7e8e6eed
    • Alex Shi's avatar
      x86/kvm: remove unused macro HV_CLOCK_SIZE · 789f52c0
      Alex Shi authored
      This macro is useless, and could cause gcc warning:
      arch/x86/kernel/kvmclock.c:47:0: warning: macro "HV_CLOCK_SIZE" is not
      used [-Wunused-macros]
      Let's remove it.
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: x86@kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <1604651963-10067-1-git-send-email-alex.shi@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      789f52c0
    • Andrew Jones's avatar
      KVM: selftests: x86: Set supported CPUIDs on default VM · 22f232d1
      Andrew Jones authored
      Almost all tests do this anyway and the ones that don't don't
      appear to care. Only vmx_set_nested_state_test assumes that
      a feature (VMX) is disabled until later setting the supported
      CPUIDs. It's better to disable that explicitly anyway.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20201111122636.73346-11-drjones@redhat.com>
      [Restore CPUID_VMX, or vmx_set_nested_state breaks. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      22f232d1
    • Andrew Jones's avatar
      KVM: selftests: Make test skipping consistent · 08d3e277
      Andrew Jones authored
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20201111122636.73346-12-drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      08d3e277
  9. 15 Nov, 2020 16 commits
    • Andrew Jones's avatar
      KVM: selftests: Also build dirty_log_perf_test on AArch64 · 87c5f35e
      Andrew Jones authored
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20201111122636.73346-10-drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      87c5f35e
    • Andrew Jones's avatar
      KVM: selftests: Introduce vm_create_[default_]_with_vcpus · 0aa9ec45
      Andrew Jones authored
      Introduce new vm_create variants that also takes a number of vcpus,
      an amount of per-vcpu pages, and optionally a list of vcpuids. These
      variants will create default VMs with enough additional pages to
      cover the vcpu stacks, per-vcpu pages, and pagetable pages for all.
      The new 'default' variant uses VM_MODE_DEFAULT, whereas the other
      new variant accepts the mode as a parameter.
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20201111122636.73346-6-drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0aa9ec45
    • Andrew Jones's avatar
      KVM: selftests: Make vm_create_default common · ec2f18bb
      Andrew Jones authored
      The code is almost 100% the same anyway. Just move it to common
      and add a few arch-specific macros.
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Message-Id: <20201111122636.73346-5-drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec2f18bb
    • Paolo Bonzini's avatar
      KVM: selftests: always use manual clear in dirty_log_perf_test · f63f0b68
      Paolo Bonzini authored
      Nothing sets USE_CLEAR_DIRTY_LOG anymore, so anything it surrounds
      is dead code.
      
      However, it is the recommended way to use the dirty page bitmap
      for new enough kernel, so use it whenever KVM has the
      KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 capability.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f63f0b68
    • Jim Mattson's avatar
      kvm: x86: Sink cpuid update into vendor-specific set_cr4 functions · 2259c17f
      Jim Mattson authored
      On emulated VM-entry and VM-exit, update the CPUID bits that reflect
      CR4.OSXSAVE and CR4.PKE.
      
      This fixes a bug where the CPUID bits could continue to reflect L2 CR4
      values after emulated VM-exit to L1. It also fixes a related bug where
      the CPUID bits could continue to reflect L1 CR4 values after emulated
      VM-entry to L2. The latter bug is mainly relevant to SVM, wherein
      CPUID is not a required intercept. However, it could also be relevant
      to VMX, because the code to conditionally update these CPUID bits
      assumes that the guest CPUID and the guest CR4 are always in sync.
      
      Fixes: 8eb3f87d ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit")
      Fixes: 2acf923e ("KVM: VMX: Enable XSAVE/XRSTOR for guest")
      Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
      Reported-by: default avatarAbhiroop Dabral <adabral@paloaltonetworks.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarPeter Shier <pshier@google.com>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Cc: Dexuan Cui <dexuan.cui@intel.com>
      Cc: Huaitong Han <huaitong.han@intel.com>
      Message-Id: <20201029170648.483210-1-jmattson@google.com>
      2259c17f
    • Paolo Bonzini's avatar
      selftests: kvm: keep .gitignore add to date · 8aa426e8
      Paolo Bonzini authored
      Add tsc_msrs_test, remove clear_dirty_log_test and alphabetize
      everything.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8aa426e8
    • Peter Xu's avatar
      KVM: selftests: Add "-c" parameter to dirty log test · edd3de6f
      Peter Xu authored
      It's only used to override the existing dirty ring size/count.  If
      with a bigger ring count, we test async of dirty ring.  If with a
      smaller ring count, we test ring full code path.  Async is default.
      
      It has no use for non-dirty-ring tests.
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012241.6208-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      edd3de6f
    • Peter Xu's avatar
      KVM: selftests: Run dirty ring test asynchronously · 019d321a
      Peter Xu authored
      Previously the dirty ring test was working in synchronous way, because
      only with a vmexit (with that it was the ring full event) we'll know
      the hardware dirty bits will be flushed to the dirty ring.
      
      With this patch we first introduce a vcpu kick mechanism using SIGUSR1,
      which guarantees a vmexit and also therefore the flushing of hardware
      dirty bits.  Once this is in place, we can keep the vcpu dirty work
      asynchronous of the whole collection procedure now.  Still, we need
      to be very careful that when reaching the ring buffer soft limit
      (KVM_EXIT_DIRTY_RING_FULL) we must collect the dirty bits before
      continuing the vcpu.
      
      Further increase the dirty ring size to current maximum to make sure
      we torture more on the no-ring-full case, which should be the major
      scenario when the hypervisors like QEMU would like to use this feature.
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012239.6159-1-peterx@redhat.com>
      [Use KVM_SET_SIGNAL_MASK+sigwait instead of a signal handler. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      019d321a
    • Peter Xu's avatar
      KVM: selftests: Add dirty ring buffer test · 84292e56
      Peter Xu authored
      Add the initial dirty ring buffer test.
      
      The current test implements the userspace dirty ring collection, by
      only reaping the dirty ring when the ring is full.
      
      So it's still running synchronously like this:
      
                  vcpu                             main thread
      
        1. vcpu dirties pages
        2. vcpu gets dirty ring full
           (userspace exit)
      
                                             3. main thread waits until full
                                                (so hardware buffers flushed)
                                             4. main thread collects
                                             5. main thread continues vcpu
      
        6. vcpu continues, goes back to 1
      
      We can't directly collects dirty bits during vcpu execution because
      otherwise we can't guarantee the hardware dirty bits were flushed when
      we collect and we're very strict on the dirty bits so otherwise we can
      fail the future verify procedure.  A follow up patch will make this
      test to support async just like the existing dirty log test, by adding
      a vcpu kick mechanism.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012237.6111-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      84292e56
    • Peter Xu's avatar
      KVM: selftests: Introduce after_vcpu_run hook for dirty log test · 60f644fb
      Peter Xu authored
      Provide a hook for the checks after vcpu_run() completes.  Preparation
      for the dirty ring test because we'll need to take care of another
      exit reason.
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012235.6063-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      60f644fb
    • Peter Xu's avatar
      KVM: Don't allocate dirty bitmap if dirty ring is enabled · 044c59c4
      Peter Xu authored
      Because kvm dirty rings and kvm dirty log is used in an exclusive way,
      Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
      At the meantime, since the dirty_bitmap will be conditionally created
      now, we can't use it as a sign of "whether this memory slot enabled
      dirty tracking".  Change users like that to check against the kvm
      memory slot flags.
      
      Note that there still can be chances where the kvm memory slot got its
      dirty_bitmap allocated, _if_ the memory slots are created before
      enabling of the dirty rings and at the same time with the dirty
      tracking capability enabled, they'll still with the dirty_bitmap.
      However it should not hurt much (e.g., the bitmaps will always be
      freed if they are there), and the real users normally won't trigger
      this because dirty bit tracking flag should in most cases only be
      applied to kvm slots only before migration starts, that should be far
      latter than kvm initializes (VM starts).
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012226.5868-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      044c59c4
    • Peter Xu's avatar
      KVM: Make dirty ring exclusive to dirty bitmap log · b2cc64c4
      Peter Xu authored
      There's no good reason to use both the dirty bitmap logging and the
      new dirty ring buffer to track dirty bits.  We should be able to even
      support both of them at the same time, but it could complicate things
      which could actually help little.  Let's simply make it the rule
      before we enable dirty ring on any arch, that we don't allow these two
      interfaces to be used together.
      
      The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
      enablement.  That's where we'll switch from the default dirty logging
      way to the dirty ring way.  As long as kvm->dirty_ring_size is setup
      correctly, we'll once and for all switch to the dirty ring buffer mode
      for the current virtual machine.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012224.5818-1-peterx@redhat.com>
      [Change errno from EINVAL to ENXIO. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b2cc64c4
    • Peter Xu's avatar
      KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed
      Peter Xu authored
      This patch is heavily based on previous work from Lei Cao
      <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
      
      KVM currently uses large bitmaps to track dirty memory.  These bitmaps
      are copied to userspace when userspace queries KVM for its dirty page
      information.  The use of bitmaps is mostly sufficient for live
      migration, as large parts of memory are be dirtied from one log-dirty
      pass to another.  However, in a checkpointing system, the number of
      dirty pages is small and in fact it is often bounded---the VM is
      paused when it has dirtied a pre-defined number of pages. Traversing a
      large, sparsely populated bitmap to find set bits is time-consuming,
      as is copying the bitmap to user-space.
      
      A similar issue will be there for live migration when the guest memory
      is huge while the page dirty procedure is trivial.  In that case for
      each dirty sync we need to pull the whole dirty bitmap to userspace
      and analyse every bit even if it's mostly zeros.
      
      The preferred data structure for above scenarios is a dense list of
      guest frame numbers (GFN).  This patch series stores the dirty list in
      kernel memory that can be memory mapped into userspace to allow speedy
      harvesting.
      
      This patch enables dirty ring for X86 only.  However it should be
      easily extended to other archs as well.
      
      [1] https://patchwork.kernel.org/patch/10471409/Signed-off-by: default avatarLei Cao <lei.cao@stratus.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012222.5767-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fb04a1ed
    • Peter Xu's avatar
      KVM: Pass in kvm pointer into mark_page_dirty_in_slot() · 28bd726a
      Peter Xu authored
      The context will be needed to implement the kvm dirty ring.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012044.5151-5-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28bd726a
    • Paolo Bonzini's avatar
      KVM: remove kvm_clear_guest_page · 2f541442
      Paolo Bonzini authored
      kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty
      for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest.
      We can just inline it in its sole user.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f541442
    • Peter Xu's avatar
      KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] · ff5a983c
      Peter Xu authored
      Originally, we have three code paths that can dirty a page without
      vcpu context for X86:
      
        - init_rmode_identity_map
        - init_rmode_tss
        - kvmgt_rw_gpa
      
      init_rmode_identity_map and init_rmode_tss will be setup on
      destination VM no matter what (and the guest cannot even see them), so
      it does not make sense to track them at all.
      
      To do this, allow __x86_set_memory_region() to return the userspace
      address that just allocated to the caller.  Then in both of the
      functions we directly write to the userspace address instead of
      calling kvm_write_*() APIs.
      
      Another trivial change is that we don't need to explicitly clear the
      identity page table root in init_rmode_identity_map() because no
      matter what we'll write to the whole page with 4M huge page entries.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012044.5151-4-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ff5a983c