1. 04 Dec, 2019 2 commits
  2. 29 Nov, 2019 2 commits
  3. 28 Nov, 2019 7 commits
    • Anshuman Khandual's avatar
      powerpc: Ultravisor: Add PPC_UV config option · 013a53f2
      Anshuman Khandual authored
      CONFIG_PPC_UV adds support for ultravisor.
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      [ Update config help and commit message ]
      Signed-off-by: default avatarClaudio Carvalho <cclaudio@linux.ibm.com>
      Reviewed-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      013a53f2
    • Bharata B Rao's avatar
      KVM: PPC: Book3S HV: Support reset of secure guest · 22945688
      Bharata B Rao authored
      Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
      This ioctl will be issued by QEMU during reset and includes the
      the following steps:
      
      - Release all device pages of the secure guest.
      - Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
      - Unpin the VPA pages so that they can be migrated back to secure
        side when guest becomes secure again. This is required because
        pinned pages can't be migrated.
      - Reinit the partition scoped page tables
      
      After these steps, guest is ready to issue UV_ESM call once again
      to switch to secure mode.
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      	[Implementation of uv_svm_terminate() and its call from
      	guest shutdown path]
      Signed-off-by: default avatarRam Pai <linuxram@us.ibm.com>
      	[Unpinning of VPA pages]
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      22945688
    • Bharata B Rao's avatar
      KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM · c3262257
      Bharata B Rao authored
      Register the new memslot with UV during plug and unregister
      the memslot during unplug. In addition, release all the
      device pages during unplug.
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      c3262257
    • Bharata B Rao's avatar
      KVM: PPC: Book3S HV: Radix changes for secure guest · 008e359c
      Bharata B Rao authored
      - After the guest becomes secure, when we handle a page fault of a page
        belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
      - Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
      - Ensure all those routines that walk the secondary page tables of
        the guest don't do so in case of secure VM. For secure guest, the
        active secondary page tables are in secure memory and the secondary
        page tables in HV are freed when guest becomes secure.
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      008e359c
    • Bharata B Rao's avatar
      KVM: PPC: Book3S HV: Shared pages support for secure guests · 60f0a643
      Bharata B Rao authored
      A secure guest will share some of its pages with hypervisor (Eg. virtio
      bounce buffers etc). Support sharing of pages between hypervisor and
      ultravisor.
      
      Shared page is reachable via both HV and UV side page tables. Once a
      secure page is converted to shared page, the device page that represents
      the secure page is unmapped from the HV side page tables.
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      60f0a643
    • Bharata B Rao's avatar
      KVM: PPC: Book3S HV: Support for running secure guests · ca9f4942
      Bharata B Rao authored
      A pseries guest can be run as secure guest on Ultravisor-enabled
      POWER platforms. On such platforms, this driver will be used to manage
      the movement of guest pages between the normal memory managed by
      hypervisor (HV) and secure memory managed by Ultravisor (UV).
      
      HV is informed about the guest's transition to secure mode via hcalls:
      
      H_SVM_INIT_START: Initiate securing a VM
      H_SVM_INIT_DONE: Conclude securing a VM
      
      As part of H_SVM_INIT_START, register all existing memslots with
      the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
      the guest to secure mode is complete.
      
      These two states (transition to secure mode STARTED and transition
      to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
      Setting these states will cause the assembly code that enters the
      guest to call the UV_RETURN ucall instead of trying to enter the
      guest directly.
      
      Migration of pages betwen normal and secure memory of secure
      guest is implemented in H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
      
      H_SVM_PAGE_IN: Move the content of a normal page to secure page
      H_SVM_PAGE_OUT: Move the content of a secure page to normal page
      
      Private ZONE_DEVICE memory equal to the amount of secure memory
      available in the platform for running secure guests is created.
      Whenever a page belonging to the guest becomes secure, a page from
      this private device memory is used to represent and track that secure
      page on the HV side. The movement of pages between normal and secure
      memory is done via migrate_vma_pages() using UV_PAGE_IN and
      UV_PAGE_OUT ucalls.
      
      In order to prevent the device private pages (that correspond to pages
      of secure guest) from participating in KSM merging, H_SVM_PAGE_IN
      calls ksm_madvise() under read version of mmap_sem. However
      ksm_madvise() needs to be under write lock.  Hence we call
      kvmppc_svm_page_in with mmap_sem held for writing, and it then
      downgrades to a read lock after calling ksm_madvise.
      
      [paulus@ozlabs.org - roll in patch "KVM: PPC: Book3S HV: Take write
       mmap_sem when calling ksm_madvise"]
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      ca9f4942
    • Bharata B Rao's avatar
      mm: ksm: Export ksm_madvise() · 33cf1707
      Bharata B Rao authored
      On PEF-enabled POWER platforms that support running of secure guests,
      secure pages of the guest are represented by device private pages
      in the host. Such pages needn't participate in KSM merging. This is
      achieved by using ksm_madvise() call which need to be exported
      since KVM PPC can be a kernel module.
      Signed-off-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      33cf1707
  4. 27 Nov, 2019 1 commit
  5. 25 Nov, 2019 1 commit
  6. 23 Nov, 2019 5 commits
    • Jim Mattson's avatar
      kvm: nVMX: Relax guest IA32_FEATURE_CONTROL constraints · 85c9aae9
      Jim Mattson authored
      Commit 37e4c997 ("KVM: VMX: validate individual bits of guest
      MSR_IA32_FEATURE_CONTROL") broke the KVM_SET_MSRS ABI by instituting
      new constraints on the data values that kvm would accept for the guest
      MSR, IA32_FEATURE_CONTROL. Perhaps these constraints should have been
      opt-in via a new KVM capability, but they were applied
      indiscriminately, breaking at least one existing hypervisor.
      
      Relax the constraints to allow either or both of
      FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX and
      FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX to be set when nVMX is
      enabled. This change is sufficient to fix the aforementioned breakage.
      
      Fixes: 37e4c997 ("KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL")
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85c9aae9
    • Sean Christopherson's avatar
      KVM: x86: Grab KVM's srcu lock when setting nested state · ad5996d9
      Sean Christopherson authored
      Acquire kvm->srcu for the duration of ->set_nested_state() to fix a bug
      where nVMX derefences ->memslots without holding ->srcu or ->slots_lock.
      
      The other half of nested migration, ->get_nested_state(), does not need
      to acquire ->srcu as it is a purely a dump of internal KVM (and CPU)
      state to userspace.
      
      Detected as an RCU lockdep splat that is 100% reproducible by running
      KVM's state_test selftest with CONFIG_PROVE_LOCKING=y.  Note that the
      failing function, kvm_is_visible_gfn(), is only checking the validity of
      a gfn, it's not actually accessing guest memory (which is more or less
      unsupported during vmx_set_nested_state() due to incorrect MMU state),
      i.e. vmx_set_nested_state() itself isn't fundamentally broken.  In any
      case, setting nested state isn't a fast path so there's no reason to go
      out of our way to avoid taking ->srcu.
      
        =============================
        WARNING: suspicious RCU usage
        5.4.0-rc7+ #94 Not tainted
        -----------------------------
        include/linux/kvm_host.h:626 suspicious rcu_dereference_check() usage!
      
                     other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by evmcs_test/10939:
         #0: ffff88826ffcb800 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x630 [kvm]
      
        stack backtrace:
        CPU: 1 PID: 10939 Comm: evmcs_test Not tainted 5.4.0-rc7+ #94
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x68/0x9b
         kvm_is_visible_gfn+0x179/0x180 [kvm]
         mmu_check_root+0x11/0x30 [kvm]
         fast_cr3_switch+0x40/0x120 [kvm]
         kvm_mmu_new_cr3+0x34/0x60 [kvm]
         nested_vmx_load_cr3+0xbd/0x1f0 [kvm_intel]
         nested_vmx_enter_non_root_mode+0xab8/0x1d60 [kvm_intel]
         vmx_set_nested_state+0x256/0x340 [kvm_intel]
         kvm_arch_vcpu_ioctl+0x491/0x11a0 [kvm]
         kvm_vcpu_ioctl+0xde/0x630 [kvm]
         do_vfs_ioctl+0xa2/0x6c0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x54/0x200
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f59a2b95f47
      
      Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad5996d9
    • Sean Christopherson's avatar
      KVM: x86: Open code shared_msr_update() in its only caller · 05c19c2f
      Sean Christopherson authored
      Fold shared_msr_update() into its sole user to eliminate its pointless
      bounds check, its godawful printk, its misleading comment (it's called
      under a global lock), and its woefully inaccurate name.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05c19c2f
    • Miaohe Lin's avatar
      KVM: Fix jump label out_free_* in kvm_init() · faf0be22
      Miaohe Lin authored
      The jump label out_free_1 and out_free_2 deal with
      the same stuff, so git rid of one and rename the
      label out_free_0a to retain the label name order.
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      faf0be22
    • Sean Christopherson's avatar
      KVM: x86: Remove a spurious export of a static function · 24885d1d
      Sean Christopherson authored
      A recent change inadvertently exported a static function, which results
      in modpost throwing a warning.  Fix it.
      
      Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      24885d1d
  7. 21 Nov, 2019 13 commits
    • Paolo Bonzini's avatar
      KVM: x86: create mmu/ subdirectory · c50d8ae3
      Paolo Bonzini authored
      Preparatory work for shattering mmu.c into multiple files.  Besides making it easier
      to follow, this will also make it possible to write unit tests for various parts.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c50d8ae3
    • Liran Alon's avatar
      KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use apic-access-page · 0155b2b9
      Liran Alon authored
      According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use
      of the INVVPID/INVEPT Instruction, the hypervisor needs to execute
      INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and
      either: "Virtualize APIC accesses" VM-execution control was changed
      from 0 to 1, OR the value of apic_access_page was changed.
      
      In the nested case, the burden falls on L1, unless L0 enables EPT in
      vmcs02 but L1 enables neither EPT nor VPID in vmcs12.  For this reason
      prepare_vmcs02() and load_vmcs12_host_state() have special code to
      request a TLB flush in case L1 does not use EPT but it uses
      "virtualize APIC accesses".
      
      This special case however is not necessary. On a nested vmentry the
      physical TLB will already be flushed except if all the following apply:
      
      * L0 uses VPID
      
      * L1 uses VPID
      
      * L0 can guarantee TLB entries populated while running L1 are tagged
      differently than TLB entries populated while running L2.
      
      If the first condition is false, the processor will flush the TLB
      on vmentry to L2.  If the second or third condition are false,
      prepare_vmcs02() will request KVM_REQ_TLB_FLUSH.  However, even
      if both are true, no extra TLB flush is needed to handle the APIC
      access page:
      
      * if L1 doesn't use VPID, the second condition doesn't hold and the
      TLB will be flushed anyway.
      
      * if L1 uses VPID, it has to flush the TLB itself with INVVPID and
      section 28.3.3.3 doesn't apply to L0.
      
      * even INVEPT is not needed because, if L0 uses EPT, it uses different
      EPTP when running L2 than L1 (because guest_mode is part of mmu-role).
      In this case SDM section 28.3.3.4 doesn't apply.
      
      Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(),
      one could note that L0 won't flush TLB only in cases where SDM sections
      28.3.3.3 and 28.3.3.4 don't apply.  In particular, if L0 uses different
      VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section
      28.3.3.3 doesn't apply.
      
      Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit().
      
      Side-note: This patch can be viewed as removing parts of commit
      fb6c8198 ("kvm: vmx: Flush TLB when the APIC-access address changes”)
      that is not relevant anymore since commit
      1313cc2b ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”).
      i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT,
      then L0 will use same EPTP for both L0 and L1. Which indeed required
      L0 to execute INVEPT before entering L2 guest. This assumption is
      not true anymore since when guest_mode was added to mmu-role.
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0155b2b9
    • Mao Wenan's avatar
      KVM: x86: remove set but not used variable 'called' · db5a95ec
      Mao Wenan authored
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask:
      arch/x86/kvm/x86.c:7911:7: warning: variable called set but not
      used [-Wunused-but-set-variable]
      
      It is not used since commit 7ee30bc1 ("KVM: x86: deliver KVM
      IOAPIC scan request to target vCPUs")
      Signed-off-by: default avatarMao Wenan <maowenan@huawei.com>
      Fixes: 7ee30bc1 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      db5a95ec
    • Liran Alon's avatar
      KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinning · b11494bc
      Liran Alon authored
      vmcs->apic_access_page is simply a token that the hypervisor puts into
      the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers
      APIC-access VMExit or APIC virtualization logic whenever a CPU running
      in VMX non-root mode read/write from/to this PFN.
      
      As every write either triggers an APIC-access VMExit or write is
      performed on vmcs->virtual_apic_page, the PFN pointed to by
      vmcs->apic_access_page should never actually be touched by CPU.
      
      Therefore, there is no need to mark vmcs02->apic_access_page as dirty
      after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation.
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b11494bc
    • Paolo Bonzini's avatar
      Merge branch 'kvm-tsx-ctrl' into HEAD · 46f4f0aa
      Paolo Bonzini authored
      Conflicts:
      	arch/x86/kvm/vmx/vmx.c
      46f4f0aa
    • Paolo Bonzini's avatar
      KVM: vmx: use MSR_IA32_TSX_CTRL to hard-disable TSX on guest that lack it · b07a5c53
      Paolo Bonzini authored
      If X86_FEATURE_RTM is disabled, the guest should not be able to access
      MSR_IA32_TSX_CTRL.  We can therefore use it in KVM to force all
      transactions from the guest to abort.
      Tested-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b07a5c53
    • Paolo Bonzini's avatar
      KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality · c11f83e0
      Paolo Bonzini authored
      The current guest mitigation of TAA is both too heavy and not really
      sufficient.  It is too heavy because it will cause some affected CPUs
      (those that have MDS_NO but lack TAA_NO) to fall back to VERW and
      get the corresponding slowdown.  It is not really sufficient because
      it will cause the MDS_NO bit to disappear upon microcode update, so
      that VMs started before the microcode update will not be runnable
      anymore afterwards, even with tsx=on.
      
      Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
      the guest and let it run without the VERW mitigation.  Even though
      MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
      it on every vmentry, we can use the shared MSR functionality because
      the host kernel need not protect itself from TSX-based side-channels.
      Tested-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c11f83e0
    • Paolo Bonzini's avatar
      KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID · edef5c36
      Paolo Bonzini authored
      Because KVM always emulates CPUID, the CPUID clear bit
      (bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
      by the hypervisor when performing said emulation.
      
      Right now neither kvm-intel.ko nor kvm-amd.ko implement
      MSR_IA32_TSX_CTRL but this will change in the next patch.
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Tested-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      edef5c36
    • Paolo Bonzini's avatar
      KVM: x86: do not modify masked bits of shared MSRs · de1fca5d
      Paolo Bonzini authored
      "Shared MSRs" are guest MSRs that are written to the host MSRs but
      keep their value until the next return to userspace.  They support
      a mask, so that some bits keep the host value, but this mask is
      only used to skip an unnecessary MSR write and the value written
      to the MSR is always the guest MSR.
      
      Fix this and, while at it, do not update smsr->values[slot].curr if
      for whatever reason the wrmsr fails.  This should only happen due to
      reserved bits, so the value written to smsr->values[slot].curr
      will not match when the user-return notifier and the host value will
      always be restored.  However, it is untidy and in rare cases this
      can actually avoid spurious WRMSRs on return to userspace.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Tested-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      de1fca5d
    • Paolo Bonzini's avatar
      KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES · cbbaa272
      Paolo Bonzini authored
      KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
      to the guests.  It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
      !RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
      hidden (it actually was), yet the value says that TSX is not vulnerable
      to microarchitectural data sampling.  Fix both.
      
      Cc: stable@vger.kernel.org
      Tested-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cbbaa272
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · 14edff88
      Paolo Bonzini authored
      KVM/arm updates for Linux 5.5:
      
      - Allow non-ISV data aborts to be reported to userspace
      - Allow injection of data aborts from userspace
      - Expose stolen time to guests
      - GICv4 performance improvements
      - vgic ITS emulation fixes
      - Simplify FWB handling
      - Enable halt pool counters
      - Make the emulated timer PREEMPT_RT compliant
      
      Conflicts:
      	include/uapi/linux/kvm.h
      14edff88
    • Greg Kurz's avatar
      KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path · 30486e72
      Greg Kurz authored
      We need to check the host page size is big enough to accomodate the
      EQ. Let's do this before taking a reference on the EQ page to avoid
      a potential leak if the check fails.
      
      Cc: stable@vger.kernel.org # v5.2
      Fixes: 13ce3297 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
      Signed-off-by: default avatarGreg Kurz <groug@kaod.org>
      Reviewed-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      30486e72
    • Greg Kurz's avatar
      KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one · 31a88c82
      Greg Kurz authored
      The EQ page is allocated by the guest and then passed to the hypervisor
      with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
      before handing it over to the HW. This reference is dropped either when
      the guest issues the H_INT_RESET hcall or when the KVM device is released.
      But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times,
      either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot).
      In both cases the existing EQ page reference is leaked because we simply
      overwrite it in the XIVE queue structure without calling put_page().
      
      This is especially visible when the guest memory is backed with huge pages:
      start a VM up to the guest userspace, either reboot it or unplug a vCPU,
      quit QEMU. The leak is observed by comparing the value of HugePages_Free in
      /proc/meminfo before and after the VM is run.
      
      Ideally we'd want the XIVE code to handle the EQ page de-allocation at the
      platform level. This isn't the case right now because the various XIVE
      drivers have different allocation needs. It could maybe worth introducing
      hooks for this purpose instead of exposing XIVE internals to the drivers,
      but this is certainly a huge work to be done later.
      
      In the meantime, for easier backport, fix both vCPU unplug and guest reboot
      leaks by introducing a wrapper around xive_native_configure_queue() that
      does the necessary cleanup.
      Reported-by: default avatarSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org # v5.2
      Fixes: 13ce3297 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Signed-off-by: default avatarGreg Kurz <groug@kaod.org>
      Tested-by: default avatarLijun Pan <ljp@linux.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      31a88c82
  8. 20 Nov, 2019 5 commits
  9. 18 Nov, 2019 1 commit
  10. 15 Nov, 2019 3 commits
    • Nitesh Narayan Lal's avatar
      KVM: x86: deliver KVM IOAPIC scan request to target vCPUs · 7ee30bc1
      Nitesh Narayan Lal authored
      In IOAPIC fixed delivery mode instead of flushing the scan
      requests to all vCPUs, we should only send the requests to
      vCPUs specified within the destination field.
      
      This patch introduces kvm_get_dest_vcpus_mask() API which
      retrieves an array of target vCPUs by using
      kvm_apic_map_get_dest_lapic() and then based on the
      vcpus_idx, it sets the bit in a bitmap. However, if the above
      fails kvm_get_dest_vcpus_mask() finds the target vCPUs by
      traversing all available vCPUs. Followed by setting the
      bits in the bitmap.
      
      If we had different vCPUs in the previous request for the
      same redirection table entry then bits corresponding to
      these vCPUs are also set. This to done to keep
      ioapic_handled_vectors synchronized.
      
      This bitmap is then eventually passed on to
      kvm_make_vcpus_request_mask() to generate a masked request
      only for the target vCPUs.
      
      This would enable us to reduce the latency overhead on isolated
      vCPUs caused by the IPI to process due to KVM_REQ_IOAPIC_SCAN.
      Suggested-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarNitesh Narayan Lal <nitesh@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7ee30bc1
    • Radim Krčmář's avatar
      KVM: remember position in kvm->vcpus array · 8750e72a
      Radim Krčmář authored
      Fetching an index for any vcpu in kvm->vcpus array by traversing
      the entire array everytime is costly.
      This patch remembers the position of each vcpu in kvm->vcpus array
      by storing it in vcpus_idx under kvm_vcpu structure.
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarNitesh Narayan Lal <nitesh@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8750e72a
    • Aaron Lewis's avatar
      KVM: nVMX: Add support for capturing highest observable L2 TSC · 662f1d1d
      Aaron Lewis authored
      The L1 hypervisor may include the IA32_TIME_STAMP_COUNTER MSR in the
      vmcs12 MSR VM-exit MSR-store area as a way of determining the highest
      TSC value that might have been observed by L2 prior to VM-exit. The
      current implementation does not capture a very tight bound on this
      value.  To tighten the bound, add the IA32_TIME_STAMP_COUNTER MSR to the
      vmcs02 VM-exit MSR-store area whenever it appears in the vmcs12 VM-exit
      MSR-store area.  When L0 processes the vmcs12 VM-exit MSR-store area
      during the emulation of an L2->L1 VM-exit, special-case the
      IA32_TIME_STAMP_COUNTER MSR, using the value stored in the vmcs02
      VM-exit MSR-store area to derive the value to be stored in the vmcs12
      VM-exit MSR-store area.
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarAaron Lewis <aaronlewis@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      662f1d1d