1. 06 Aug, 2018 12 commits
    • Paolo Bonzini's avatar
    • Jim Mattson's avatar
      kvm: nVMX: Introduce KVM_CAP_NESTED_STATE · 8fcc4b59
      Jim Mattson authored
      For nested virtualization L0 KVM is managing a bit of state for L2 guests,
      this state can not be captured through the currently available IOCTLs. In
      fact the state captured through all of these IOCTLs is usually a mix of L1
      and L2 state. It is also dependent on whether the L2 guest was running at
      the moment when the process was interrupted to save its state.
      
      With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
      and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
      that is in VMX operation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      [karahmed@ - rename structs and functions and make them ready for AMD and
                   address previous comments.
                 - handle nested.smm state.
                 - rebase & a bit of refactoring.
                 - Merge 7/8 and 8/8 into one patch. ]
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8fcc4b59
    • Paolo Bonzini's avatar
      KVM: x86: do not load vmcs12 pages while still in SMM · 7f7f1ba3
      Paolo Bonzini authored
      If the vCPU enters system management mode while running a nested guest,
      RSM starts processing the vmentry while still in SMM.  In that case,
      however, the pages pointed to by the vmcs12 might be incorrectly
      loaded from SMRAM.  To avoid this, delay the handling of the pages
      until just before the next vmentry.  This is done with a new request
      and a new entry in kvm_x86_ops, which we will be able to reuse for
      nested VMX state migration.
      
      Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7f7f1ba3
    • Paolo Bonzini's avatar
      kvm: selftests: add basic test for state save and restore · fa3899ad
      Paolo Bonzini authored
      The test calls KVM_RUN repeatedly, and creates an entirely new VM with the
      old memory and vCPU state on every exit to userspace.  The kvm_util API is
      expanded with two functions that manage the lifetime of a kvm_vm struct:
      the first closes the file descriptors and leaves the memory allocated,
      and the second opens the file descriptors and reuses the memory from
      the previous incarnation of the kvm_vm struct.
      
      For now the test is very basic, as it does not test for example XSAVE or
      vCPU events.  However, it will test nested virtualization state starting
      with the next patch.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa3899ad
    • Paolo Bonzini's avatar
      kvm: selftests: ensure vcpu file is released · 0a505fe6
      Paolo Bonzini authored
      The selftests were not munmap-ing the kvm_run area from the vcpu file descriptor.
      The result was that kvm_vcpu_release was not called and a reference was left in the
      parent "struct kvm".  Ultimately this was visible in the upcoming state save/restore
      test as an error when KVM attempted to create a duplicate debugfs entry.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0a505fe6
    • Paolo Bonzini's avatar
      kvm: selftests: actually use all of lib/vmx.c · 87ccb7db
      Paolo Bonzini authored
      The allocation of the VMXON and VMCS is currently done twice, in
      lib/vmx.c and in vmx_tsc_adjust_test.c.  Reorganize the code to
      provide a cleaner and easier to use API to the tests.  lib/vmx.c
      now does the complete setup of the VMX data structures, but does not
      create the VM or set CPUID.  This has to be done by the caller.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      87ccb7db
    • Paolo Bonzini's avatar
      kvm: selftests: create a GDT and TSS · 2305339e
      Paolo Bonzini authored
      The GDT and the TSS base were left to zero, and this has interesting effects
      when the TSS descriptor is later read to set up a VMCS's TR_BASE.  Basically
      it worked by chance, and this patch fixes it by setting up all the protected
      mode data structures properly.
      
      Because the GDT and TSS addresses are virtual, the page tables now always
      exist at the time of vcpu setup.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2305339e
    • Paolo Bonzini's avatar
      KVM: x86: ensure all MSRs can always be KVM_GET/SET_MSR'd · 44883f01
      Paolo Bonzini authored
      Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back
      to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you
      they are only accepted under special conditions.  This makes the API a pain to
      use.
      
      To avoid this pain, this patch makes it so that the result of the get-list
      ioctl can always be used for host-initiated get and set.  Since we don't have
      a separate way to check for read-only MSRs, this means some Hyper-V MSRs are
      ignored when written.  Arguably they should not even be in the result of
      GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the
      outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding
      Hyper-V feature.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44883f01
    • Sean Christopherson's avatar
      KVM: vmx: remove save/restore of host BNDCGFS MSR · cf81a7e5
      Sean Christopherson authored
      Linux does not support Memory Protection Extensions (MPX) in the
      kernel itself, thus the BNDCFGS (Bound Config Supervisor) MSR will
      always be zero in the KVM host, i.e. RDMSR in vmx_save_host_state()
      is superfluous.  KVM unconditionally sets VM_EXIT_CLEAR_BNDCFGS,
      i.e. BNDCFGS will always be zero after VMEXIT, thus manually loading
      BNDCFGS is also superfluous.
      
      And in the event the MPX kernel support is added (unlikely given
      that MPX for userspace is in its death throes[1]), BNDCFGS will
      likely be common across all CPUs[2], and at the least shouldn't
      change on a regular basis, i.e. saving the MSR on every VMENTRY is
      completely unnecessary.
      
      WARN_ONCE in hardware_setup() if the host's BNDCFGS is non-zero to
      document that KVM does not preserve BNDCFGS and to serve as a hint
      as to how BNDCFGS likely should be handled if MPX is used in the
      kernel, e.g. BNDCFGS should be saved once during KVM setup.
      
      [1] https://lkml.org/lkml/2018/4/27/1046
      [2] http://www.openwall.com/lists/kernel-hardening/2017/07/24/28Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf81a7e5
    • KarimAllah Ahmed's avatar
      KVM: Switch 'requests' to be 64-bit (explicitly) · 86dafed5
      KarimAllah Ahmed authored
      Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
      use the size of "requests" instead of the hard-coded '32'.
      
      That gives us a bit more room again for arch-specific requests as we
      already ran out of space for x86 due to the hard-coded check.
      
      The only exception here is ARM32 as it is still 32-bits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmáŠ<rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      86dafed5
    • Wei Huang's avatar
      kvm: selftests: add cr4_cpuid_sync_test · ca359066
      Wei Huang authored
      KVM is supposed to update some guest VM's CPUID bits (e.g. OSXSAVE) when
      CR4 is changed. A bug was found in KVM recently and it was fixed by
      Commit c4d21882 ("KVM: x86: Update cpuid properly when CR4.OSXAVE or
      CR4.PKE is changed"). This patch adds a test to verify the synchronization
      between guest VM's CR4 and CPUID bits.
      Signed-off-by: default avatarWei Huang <wei@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ca359066
    • Paolo Bonzini's avatar
      Merge tag 'v4.18-rc6' into HEAD · d2ce98ca
      Paolo Bonzini authored
      Pull bug fixes into the KVM development tree to avoid nasty conflicts.
      d2ce98ca
  2. 02 Aug, 2018 2 commits
  3. 30 Jul, 2018 15 commits
  4. 26 Jul, 2018 3 commits
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock · b5c6f760
      Paul Mackerras authored
      Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
      VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode
      before any VCPUs are created.  However, userspace can change
      kvm->arch.emul_smt_mode at any time up until the first VCPU is created.
      Hence it is (theoretically) possible for the check in
      kvmppc_core_vcpu_create_hv() to race with another userspace thread
      changing kvm->arch.emul_smt_mode.
      
      This fixes it by moving the test that uses kvm->arch.emul_smt_mode into
      the block where kvm->lock is held.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      b5c6f760
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Allow creating max number of VCPUs on POWER9 · 1ebe6b81
      Paul Mackerras authored
      Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
      VCPU ID space", 2018-07-25) allowed use of VCPU IDs up to
      KVM_MAX_VCPU_ID on POWER9 in all guest SMT modes and guest emulated
      hardware SMT modes.  However, with the current definition of
      KVM_MAX_VCPU_ID, a guest SMT mode of 1 and an emulated SMT mode of 8,
      it is only possible to create KVM_MAX_VCPUS / 2 VCPUS, because
      threads_per_subcore is 4 on POWER9 CPUs.  (Using an emulated SMT mode
      of 8 is useful when migrating VMs to or from POWER8 hosts.)
      
      This increases KVM_MAX_VCPU_ID to 8 * KVM_MAX_VCPUS when HV KVM is
      configured in, so that a full complement of KVM_MAX_VCPUS VCPUs can
      be created on POWER9 in all guest SMT modes and emulated hardware
      SMT modes.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      1ebe6b81
    • Sam Bobroff's avatar
      KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space · 1e175d2e
      Sam Bobroff authored
      It is not currently possible to create the full number of possible
      VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer
      threads per core than its core stride (or "VSMT mode"). This is
      because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS
      even though the VCPU ID is less than KVM_MAX_VCPU_ID.
      
      To address this, "pack" the VCORE ID and XIVE offsets by using
      knowledge of the way the VCPU IDs will be used when there are fewer
      guest threads per core than the core stride. The primary thread of
      each core will always be used first. Then, if the guest uses more than
      one thread per core, these secondary threads will sequentially follow
      the primary in each core.
      
      So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
      VCPUs are being spaced apart, so at least half of each core is empty,
      and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
      into the second half of each core (4..7, in an 8-thread core).
      
      Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
      each core is being left empty, and we can map down into the second and
      third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
      
      Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
      threads are being used and 7/8 of the core is empty, allowing use of
      the 1, 5, 3 and 7 thread slots.
      
      (Strides less than 8 are handled similarly.)
      
      This allows the VCORE ID or offset to be calculated quickly from the
      VCPU ID or XIVE server numbers, without access to the VCPU structure.
      
      [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE
       to pr_devel, wrapped line, fixed id check.]
      Signed-off-by: default avatarSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      1e175d2e
  5. 22 Jul, 2018 8 commits