1. 30 Apr, 2019 17 commits
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration · 13ce3297
      Cédric Le Goater authored
      These controls will be used by the H_INT_SET_QUEUE_CONFIG and
      H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
      Event Queue in the XIVE IC. They will also be used to restore the
      configuration of the XIVE EQs and to capture the internal run-time
      state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
      the EQ toggle bit and EQ index which are updated by the XIVE IC when
      event notifications are enqueued in the EQ.
      
      The value of the guest physical address of the event queue is saved in
      the XIVE internal xive_q structure for later use. That is when
      migration needs to mark the EQ pages dirty to capture a consistent
      memory state of the VM.
      
      To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
      OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
      but restoring the EQ state will.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      13ce3297
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add a control to configure a source · e8676ce5
      Cédric Le Goater authored
      This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
      QEMU to configure the target of a source and also to restore the
      configuration of a source when migrating the VM.
      
      The XIVE source interrupt structure is extended with the value of the
      Effective Interrupt Source Number. The EISN is the interrupt number
      pushed in the event queue that the guest OS will use to dispatch
      events internally. Caching the EISN value in KVM eases the test when
      checking if a reconfiguration is indeed needed.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e8676ce5
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: add a control to initialize a source · 4131f83c
      Cédric Le Goater authored
      The XIVE KVM device maintains a list of interrupt sources for the VM
      which are allocated in the pool of generic interrupts (IPIs) of the
      main XIVE IC controller. These are used for the CPU IPIs as well as
      for virtual device interrupts. The IRQ number space is defined by
      QEMU.
      
      The XIVE device reuses the source structures of the XICS-on-XIVE
      device for the source blocks (2-level tree) and for the source
      interrupts. Under XIVE native, the source interrupt caches mostly
      configuration information and is less used than under the XICS-on-XIVE
      device in which hcalls are still necessary at run-time.
      
      When a source is initialized in KVM, an IPI interrupt source is simply
      allocated at the OPAL level and then MASKED. KVM only needs to know
      about its type: LSI or MSI.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      4131f83c
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVE · eacc56bb
      Cédric Le Goater authored
      The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
      let QEMU connect the vCPU presenters to the XIVE KVM device if
      required. The capability is not advertised for now as the full support
      for the XIVE native exploitation mode is not yet available. When this
      is case, the capability will be advertised on PowerNV Hypervisors
      only. Nested guests (pseries KVM Hypervisor) are not supported.
      
      Internally, the interface to the new KVM device is protected with a
      new interrupt mode: KVMPPC_IRQ_XIVE.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      eacc56bb
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode · 90c73795
      Cédric Le Goater authored
      This is the basic framework for the new KVM device supporting the XIVE
      native exploitation mode. The user interface exposes a new KVM device
      to be created by QEMU, only available when running on a L0 hypervisor.
      Support for nested guests is not available yet.
      
      The XIVE device reuses the device structure of the XICS-on-XIVE device
      as they have a lot in common. That could possibly change in the future
      if the need arise.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      90c73795
    • Paul Mackerras's avatar
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · a878957a
      Paul Mackerras authored
      This merges in the ppc-kvm topic branch from the powerpc tree to get
      patches which touch both general powerpc code and KVM code, one of
      which is a prerequisite for following patches.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      a878957a
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry() · 44b198ae
      Suraj Jitindar Singh authored
      On POWER9 and later processors where the host can schedule vcpus on a
      per thread basis, there is a streamlined entry path used when the guest
      is radix. This entry path saves/restores the fp and vr state in
      kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and
      load_[fp/vr]_state(). This is the same as the old entry path however the
      old entry path also saved/restored the VRSAVE register, which isn't done
      in the new entry path.
      
      This means that the vrsave register is now volatile across guest exit,
      which is an incorrect change in behaviour.
      
      Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry().
      This restores the old, correct, behaviour.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      44b198ae
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Flush TLB on secondary radix threads · 70ea13f6
      Paul Mackerras authored
      When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
      in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
      If those guests are in radix mode, we fail to load the LPID and flush
      the TLB if necessary, leading to the guest crashing with an
      unsupported MMU fault.  This arises from commit 9a4506e1 ("KVM:
      PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
      with relocation on", 2018-05-17), which didn't consider the case
      where indep_threads_mode = N.
      
      For simplicity, this makes the real-mode guest entry path flush the
      TLB in the same place for both radix and hash guests, as we did before
      9a4506e1, though the code is now C code rather than assembly code.
      We also have the radix TLB flush open-coded rather than calling
      radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
      called in real mode, and in real mode we don't want to invoke the
      tracepoint code.
      
      Fixes: 9a4506e1 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      70ea13f6
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C code · 2940ba0c
      Paul Mackerras authored
      This replaces assembler code in book3s_hv_rmhandlers.S that checks
      the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
      with C code in book3s_hv_builtin.c.  Note that unlike the radix
      version, the hash version doesn't do an explicit ERAT invalidation
      because we will invalidate and load up the SLB before entering the
      guest, and that will invalidate the ERAT.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2940ba0c
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Handle virtual mode in XIVE VCPU push code · 7ae9bda7
      Suraj Jitindar Singh authored
      The code in book3s_hv_rmhandlers.S that pushes the XIVE virtual CPU
      context to the hardware currently assumes it is being called in real
      mode, which is usually true.  There is however a path by which it can
      be executed in virtual mode, in the case where indep_threads_mode = N.
      A virtual CPU executing on an offline secondary thread can take a
      hypervisor interrupt in virtual mode and return from the
      kvmppc_hv_entry() call after the kvm_secondary_got_guest label.
      It is possible for it to be given another vcpu to execute before it
      gets to execute the stop instruction.  In that case it will call
      kvmppc_hv_entry() for the second VCPU in virtual mode, and the XIVE
      vCPU push code will be executed in virtual mode.  The result in that
      case will be a host crash due to an unexpected data storage interrupt
      caused by executing the stdcix instruction in virtual mode.
      
      This fixes it by adding a code path for virtual mode, which uses the
      virtual TIMA pointer and normal load/store instructions.
      
      [paulus@ozlabs.org - wrote patch description]
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7ae9bda7
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix XICS-on-XIVE H_IPI when priority = 0 · 1f80ba3d
      Paul Mackerras authored
      This fixes a bug in the XICS emulation on POWER9 machines which is
      triggered by the guest doing a H_IPI with priority = 0 (the highest
      priority).  What happens is that the notification interrupt arrives
      at the destination at priority zero.  The loop in scan_interrupts()
      sees that a priority 0 interrupt is pending, but because xc->mfrr is
      zero, we break out of the loop before taking the notification
      interrupt out of the queue and EOI-ing it.  (This doesn't happen
      when xc->mfrr != 0; in that case we process the priority-0 notification
      interrupt on the first iteration of the loop, and then break out of
      a subsequent iteration of the loop with hirq == XICS_IPI.)
      
      To fix this, we move the prio >= xc->mfrr check down to near the end
      of the loop.  However, there are then some other things that need to
      be adjusted.  Since we are potentially handling the notification
      interrupt and also delivering an IPI to the guest in the same loop
      iteration, we need to update pending and handle any q->pending_count
      value before the xc->mfrr check, rather than at the end of the loop.
      Also, we need to update the queue pointers when we have processed and
      EOI-ed the notification interrupt, since we may not do it later.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      1f80ba3d
    • Palmer Dabbelt's avatar
      KVM: PPC: Book3S HV: smb->smp comment fixup · 6fabc9f2
      Palmer Dabbelt authored
      I made the same typo when trying to grep for uses of smp_wmb and figured
      I might as well fix it.
      Signed-off-by: default avatarPalmer Dabbelt <palmer@sifive.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      6fabc9f2
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S: Allocate guest TCEs on demand too · e1a1ef84
      Alexey Kardashevskiy authored
      We already allocate hardware TCE tables in multiple levels and skip
      intermediate levels when we can, now it is a turn of the KVM TCE tables.
      Thankfully these are allocated already in 2 levels.
      
      This moves the table's last level allocation from the creating helper to
      kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot
      be done in real mode, this creates a virtual mode version of
      kvmppc_tce_put() which handles allocations.
      
      This adds kvmppc_rm_ioba_validate() to do an additional test if
      the consequent kvmppc_tce_put() needs a page which has not been allocated;
      if this is the case, we bail out to virtual mode handlers.
      
      The allocations are protected by a new mutex as kvm->lock is not suitable
      for the task because the fault handler is called with the mmap_sem held
      but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e1a1ef84
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlers · 2001825e
      Alexey Kardashevskiy authored
      The kvmppc_tce_to_ua() helper is called from real and virtual modes
      and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled.
      However if the lockdep debugging is on, the lockdep will most likely break
      in kvm_memslots() because of srcu_dereference_check() so we need to use
      PPC-own kvm_memslots_raw() which uses realmode safe
      rcu_dereference_raw_notrace().
      
      This creates a realmode copy of kvmppc_tce_to_ua() which replaces
      kvm_memslots() with kvm_memslots_raw().
      
      Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside
      HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE.
      
      This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and
      drops the prmap parameter which was never used in the virtual mode.
      
      Fixes: d3695aa4 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15)
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2001825e
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest · 3309bec8
      Alexey Kardashevskiy authored
      The trace_hardirqs_on() sets current->hardirqs_enabled and from here
      the lockdep assumes interrupts are enabled although they are remain
      disabled until the context switches to the guest. Consequent
      srcu_read_lock() checks the flags in rcu_lock_acquire(), observes
      disabled interrupts and prints a warning (see below).
      
      This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to
      prevent lockdep from being confused.
      
      DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280
      [...]
      NIP [c000000000185b84] check_flags.part.25+0x224/0x280
      LR [c000000000185b80] check_flags.part.25+0x220/0x280
      Call Trace:
      [c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable)
      [c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260
      [c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv]
      [c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv]
      [c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70
      Instruction dump:
      419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5
      3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      ---[ end trace 31180adcc848993e ]---
      possible reason: unannotated irqs-off.
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      
      Fixes: 8b24e69f ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26)
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3309bec8
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Implement real mode H_PAGE_INIT handler · eadfb1c5
      Suraj Jitindar Singh authored
      Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel real mode handler halves the time to handle this H_CALL
      compared to handling it in userspace for a hash guest.
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      eadfb1c5
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handler · 2d34d1c3
      Suraj Jitindar Singh authored
      Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel handler halves the time to handle this H_CALL compared to
      handling it in userspace for a radix guest.
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2d34d1c3
  2. 20 Apr, 2019 1 commit
    • Michael Neuling's avatar
      powerpc: Add force enable of DAWR on P9 option · c1fe190c
      Michael Neuling authored
      This adds a flag so that the DAWR can be enabled on P9 via:
        echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
      
      The DAWR was previously force disabled on POWER9 in:
        96541531 powerpc: Disable DAWR in the base POWER9 CPU features
      Also see Documentation/powerpc/DAWR-POWER9.txt
      
      This is a dangerous setting, USE AT YOUR OWN RISK.
      
      Some users may not care about a bad user crashing their box
      (ie. single user/desktop systems) and really want the DAWR.  This
      allows them to force enable DAWR.
      
      This flag can also be used to disable DAWR access. Once this is
      cleared, all DAWR access should be cleared immediately and your
      machine once again safe from crashing.
      
      Userspace may get confused by toggling this. If DAWR is force
      enabled/disabled between getting the number of breakpoints (via
      PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
      inconsistent view of what's available. Similarly for guests.
      
      For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
      enabled in the host AND the guest. For this reason, this won't work on
      POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
      dawr_enable_dangerous file will fail if the hypervisor doesn't support
      writing the DAWR.
      
      To double check the DAWR is working, run this kernel selftest:
        tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
      Any errors/failures/skips mean something is wrong.
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c1fe190c
  3. 11 Apr, 2019 1 commit
  4. 05 Apr, 2019 2 commits
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S: Protect memslots while validating user address · 345077c8
      Alexey Kardashevskiy authored
      Guest physical to user address translation uses KVM memslots and reading
      these requires holding the kvm->srcu lock. However recently introduced
      kvmppc_tce_validate() broke the rule (see the lockdep warning below).
      
      This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
      kvmppc_tce_validate() as well.
      
      =============================
      WARNING: suspicious RCU usage
      5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
      -----------------------------
      include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by qemu-system-ppc/8020:
       #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]
      
      stack backtrace:
      CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
      Call Trace:
      [c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
      [c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
      [c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
      [c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
      [c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
      [c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
      [c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
      [c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70
      
      Fixes: 42de7b9e ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      345077c8
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit · 7cb9eb10
      Suraj Jitindar Singh authored
      There is a hardware bug in some POWER9 processors where a treclaim in
      fake suspend mode can cause an inconsistency in the XER[SO] bit across
      the threads of a core, the workaround being to force the core into SMT4
      when doing the treclaim.
      
      The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
      thread is in fake suspend or real suspend. The important difference here
      being that thread reconfiguration is blocked in real suspend but not
      fake suspend mode.
      
      When we exit a guest which was in fake suspend mode, we force the core
      into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
      However on the new exit path introduced with the function
      kvmhv_run_single_vcpu() we restore the host PSSCR before calling
      kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
      put the thread into real suspend mode when we clear the
      PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
      and the thread which is trying to get the core into SMT4 before it can
      do the treclaim spins forever since it itself is blocking thread
      reconfiguration. The result is that that core is essentially lost.
      
      This results in a trace such as:
      [   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
      [   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
      [   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
      [   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
      [   93.512914] CFAR: c000000000098a5c IRQMASK: 3
      [   93.512915] PACATMSCRATCH: 0000000000000001
      [   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
      [   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
      [   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
      [   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
      [   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
      [   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
      [   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
      [   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
      [   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
      [   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
      [   93.512960] Call Trace:
      [   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
      [   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
      [   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
      [   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
      [   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
      [   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
      [   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
      [   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
      [   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
      [   93.512983] Instruction dump:
      [   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
      [   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214
      
      To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
      kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
      perform the treclaim. Note kvmppc_save_tm_hv() clears the
      PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7cb9eb10
  5. 28 Mar, 2019 19 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-for-5.1' of... · 690edec5
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
      
      KVM/ARM fixes for 5.1
      
      - Fix THP handling in the presence of pre-existing PTEs
      - Honor request for PTE mappings even when THPs are available
      - GICv4 performance improvement
      - Take the srcu lock when writing to guest-controlled ITS data structures
      - Reset the virtual PMU in preemptible context
      - Various cleanups
      690edec5
    • Paolo Bonzini's avatar
      Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION · e2788c4a
      Paolo Bonzini authored
      The documentation does not mention how to delete a slot, add the
      information.
      Reported-by: default avatarNathaniel McCallum <npmccallum@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e2788c4a
    • Sean Christopherson's avatar
      KVM: doc: Document the life cycle of a VM and its resources · 919f6cd8
      Sean Christopherson authored
      The series to add memcg accounting to KVM allocations[1] states:
      
        There are many KVM kernel memory allocations which are tied to the
        life of the VM process and should be charged to the VM process's
        cgroup.
      
      While it is correct to account KVM kernel allocations to the cgroup of
      the process that created the VM, it's technically incorrect to state
      that the KVM kernel memory allocations are tied to the life of the VM
      process.  This is because the VM itself, i.e. struct kvm, is not tied to
      the life of the process which created it, rather it is tied to the life
      of its associated file descriptor.  In other words, kvm_destroy_vm() is
      not invoked until fput() decrements its associated file's refcount to
      zero.  A simple example is to fork() in Qemu and have the child sleep
      indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
      descriptor *and* the rogue child is killed.
      
      The allocations are guaranteed to be *accounted* to the process which
      created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
      ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
      the VM "alive" but can't do anything useful with its reference.
      
      Note that because 'struct kvm' also holds a reference to the mm_struct
      of its owner, the above behavior also applies to userspace allocations.
      
      Given that mucking with a VM's file descriptor can lead to subtle and
      undesirable behavior, e.g. memcg charges persisting after a VM is shut
      down, explicitly document a VM's lifecycle and its impact on the VM's
      resources.
      
      Alternatively, KVM could aggressively free resources when the creating
      process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
      isn't guaranteed to be available, and freeing resources when the creator
      exits is likely to be error prone and fragile as KVM would need to
      ensure that it only freed resources that are truly out of reach. In
      practice, the existing behavior shouldn't be problematic as a properly
      configured system will prevent a child process from being moved out of
      the appropriate cgroup hierarchy, i.e. prevent hiding the process from
      the OOM killer, and will prevent an unprivileged user from being able to
      to hold a reference to struct kvm via another method, e.g. debugfs.
      
      [1]https://patchwork.kernel.org/patch/10806707/Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      919f6cd8
    • Sean Christopherson's avatar
      KVM: selftests: complete IO before migrating guest state · 0f73bbc8
      Sean Christopherson authored
      Documentation/virtual/kvm/api.txt states:
      
        NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
              KVM_EXIT_EPR the corresponding operations are complete (and guest
              state is consistent) only after userspace has re-entered the
              kernel with KVM_RUN.  The kernel side will first finish incomplete
              operations and then check for pending signals.  Userspace can
              re-enter the guest with an unmasked signal pending to complete
              pending operations.
      
      Because guest state may be inconsistent, starting state migration after
      an IO exit without first completing IO may result in test failures, e.g.
      a proposed change to KVM's handling of %rip in its fast PIO handling[1]
      will cause the new VM, i.e. the post-migration VM, to have its %rip set
      to the IN instruction that triggered KVM_EXIT_IO, leading to a test
      assertion due to a stage mismatch.
      
      For simplicitly, require KVM_CAP_IMMEDIATE_EXIT to complete IO and skip
      the test if it's not available.  The addition of KVM_CAP_IMMEDIATE_EXIT
      predates the state selftest by more than a year.
      
      [1] https://patchwork.kernel.org/patch/10848545/
      
      Fixes: fa3899ad ("kvm: selftests: add basic test for state save and restore")
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f73bbc8
    • Sean Christopherson's avatar
      KVM: selftests: disable stack protector for all KVM tests · ffac839d
      Sean Christopherson authored
      Since 4.8.3, gcc has enabled -fstack-protector by default.  This is
      problematic for the KVM selftests as they do not configure fs or gs
      segments (the stack canary is pulled from fs:0x28).  With the default
      behavior, gcc will insert a stack canary on any function that creates
      buffers of 8 bytes or more.  As a result, ucall() will hit a triple
      fault shutdown due to reading a bad fs segment when inserting its
      stack canary, i.e. every test fails with an unexpected SHUTDOWN.
      
      Fixes: 14c47b75 ("kvm: selftests: introduce ucall")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ffac839d
    • Sean Christopherson's avatar
      KVM: selftests: explicitly disable PIE for tests · 0a3f29b5
      Sean Christopherson authored
      KVM selftests embed the guest "image" as a function in the test itself
      and extract the guest code at runtime by manually parsing the elf
      headers.  The parsing is very simple and doesn't supporting fancy things
      like position independent executables.  Recent versions of gcc enable
      pie by default, which results in triple fault shutdowns in the guest due
      to the virtual address in the headers not matching up with the virtual
      address retrieved from the function pointer.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0a3f29b5
    • Sean Christopherson's avatar
      KVM: selftests: assert on exit reason in CR4/cpuid sync test · 8df98ae0
      Sean Christopherson authored
      ...so that the test doesn't end up in an infinite loop if it fails for
      whatever reason, e.g. SHUTDOWN due to gcc inserting stack canary code
      into ucall() and attempting to derefence a null segment.
      
      Fixes: ca359066 ("kvm: selftests: add cr4_cpuid_sync_test")
      Cc: Wei Huang <wei@redhat.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8df98ae0
    • Sean Christopherson's avatar
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson authored
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • Vitaly Kuznetsov's avatar
      x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init · 013cc6eb
      Vitaly Kuznetsov authored
      When userspace initializes guest vCPUs it may want to zero all supported
      MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
      HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5 ("kvm/x86: Update SynIC
      timers on guest entry only") we began doing stimer_mark_pending()
      unconditionally on every config change.
      
      The issue I'm observing manifests itself as following:
      - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
        pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
      - kvm_hv_has_stimer_pending() starts returning true;
      - kvm_vcpu_has_events() starts returning true;
      - kvm_arch_vcpu_runnable() starts returning true;
      - when kvm_arch_vcpu_ioctl_run() gets into
        (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
        - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
          immediately, avoiding normal wait path;
        - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
          userspace to retry.
      
      So instead of normal wait path we get a busy loop on all secondary vCPUs
      before they get INIT signal. This seems to be undesirable, especially given
      that this happens even when Hyper-V extensions are not used.
      
      Generally, it seems to be pointless to mark an stimer as pending in
      stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
      kvm_hv_process_stimers() will do is clear the corresponding bit. We may
      just not mark disabled timers as pending instead.
      
      Fixes: f3b138c5 ("kvm/x86: Update SynIC timers on guest entry only")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      013cc6eb
    • Xiaoyao Li's avatar
      kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs · 2bdb76c0
      Xiaoyao Li authored
      Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
      host doesn't suppot it. We should move it to array emulated_msrs from
      arry msrs_to_save, to report to userspace that guest support this msr.
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2bdb76c0
    • Sean Christopherson's avatar
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b
      Sean Christopherson authored
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0cf9135b
    • Sebastian Andrzej Siewior's avatar
      kvm: don't redefine flags as something else · ca0488aa
      Sebastian Andrzej Siewior authored
      The function irqfd_wakeup() has flags defined as __poll_t and then it
      has additional flags which is used for irqflags.
      
      Redefine the inner flags variable as iflags so it does not shadow the
      outer flags.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ca0488aa
    • Ben Gardon's avatar
      kvm: mmu: Used range based flushing in slot_handle_level_range · f285c633
      Ben Gardon authored
      Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
      in slot_handle_level_range. When range based flushes are not enabled
      kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
      
      This changes the behavior of many functions that indirectly use
      slot_handle_level_range, iff the range based flushes are enabled. The
      only potential problem I see with this is that kvm->tlbs_dirty will be
      cleared less often, however the only caller of slot_handle_level_range that
      checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
      checks it and does a kvm_flush_remote_tlbs after calling
      kvm_unmap_hva_range anyway.
      
      Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
      	without this patch. The patch introduced no new failures.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f285c633
    • Masahiro Yamada's avatar
      KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported · 3d9683cf
      Masahiro Yamada authored
      I do not see any consistency about headers_install of <linux/kvm_para.h>
      and <asm/kvm_para.h>.
      
      According to my analysis of Linux 5.1-rc1, there are 3 groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          alpha, arm, hexagon, mips, powerpc, s390, sparc, x86
      
       [2] <asm/kvm_para.h> is exported, but <linux/kvm_para.h> is not
      
          arc, arm64, c6x, h8300, ia64, m68k, microblaze, nios2, openrisc,
          parisc, sh, unicore32, xtensa
      
       [3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          csky, nds32, riscv
      
      This does not match to the actual KVM support. At least, [2] is
      half-baked.
      
      Nor do arch maintainers look like they care about this. For example,
      commit 0add5371 ("microblaze: Add missing kvm_para.h to Kbuild")
      exported <asm/kvm_para.h> to user-space in order to fix an in-kernel
      build error.
      
      We have two ways to make this consistent:
      
       [A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all
           architectures, irrespective of the KVM support
      
       [B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h>
           to the KVM support
      
      My first attempt was [A] because the code looks cleaner, but Paolo
      suggested [B].
      
      So, this commit goes with [B].
      
      For most architectures, <asm/kvm_para.h> was moved to the kernel-space.
      I changed include/uapi/linux/Kbuild so that it checks generated
      asm/kvm_para.h as well as check-in ones.
      
      After this commit, there will be two groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          arm, arm64, mips, powerpc, s390, x86
      
       [2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          alpha, arc, c6x, csky, h8300, hexagon, ia64, m68k, microblaze,
          nds32, nios2, openrisc, parisc, riscv, sh, sparc, unicore32, xtensa
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3d9683cf
    • Wei Yang's avatar
      KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c
      Wei Yang authored
      * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
        non-zero.
      
      * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
        never return zero.
      
      Based on these two reasons, we can merge the two *if* clause and use the
      return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
      the code and also eliminate the possibility for reader to believe
      nr_mmu_pages would be zero.
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4d66623c
    • Krish Sadhukhan's avatar
      kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields · 711eff3a
      Krish Sadhukhan authored
      According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
      following check is performed on vmentry of L2 guests:
      
          On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP
          field and the IA32_SYSENTER_EIP field must each contain a canonical
          address.
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      711eff3a
    • Singh, Brijesh's avatar
      KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486
      Singh, Brijesh authored
      Errata#1096:
      
      On a nested data page fault when CR.SMAP=1 and the guest data read
      generates a SMAP violation, GuestInstrBytes field of the VMCB on a
      VMEXIT will incorrectly return 0h instead the correct guest
      instruction bytes .
      
      Recommend Workaround:
      
      To determine what instruction the guest was executing the hypervisor
      will have to decode the instruction at the instruction pointer.
      
      The recommended workaround can not be implemented for the SEV
      guest because guest memory is encrypted with the guest specific key,
      and instruction decoder will not be able to decode the instruction
      bytes. If we hit this errata in the SEV guest then log the message
      and request a guest shutdown.
      Reported-by: default avatarVenkatesh Srinivas <venkateshs@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05d5a486
    • Sean Christopherson's avatar
      KVM: Reject device ioctls from processes other than the VM's creator · ddba9180
      Sean Christopherson authored
      KVM's API requires thats ioctls must be issued from the same process
      that created the VM.  In other words, userspace can play games with a
      VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
      creator can do anything useful.  Explicitly reject device ioctls that
      are issued by a process other than the VM's creator, and update KVM's
      API documentation to extend its requirements to device ioctls.
      
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ddba9180
    • Sean Christopherson's avatar
      KVM: doc: Fix incorrect word ordering regarding supported use of APIs · 5e124900
      Sean Christopherson authored
      Per Paolo[1], instantiating multiple VMs in a single process is legal;
      but this conflicts with KVM's API documentation, which states:
      
        The only supported use is one virtual machine per process, and one
        vcpu per thread.
      
      However, an earlier section in the documentation states:
      
         Only run VM ioctls from the same process (address space) that was used
         to create the VM.
      
      and:
      
         Only run vcpu ioctls from the same thread that was used to create the
         vcpu.
      
      This suggests that the conflicting documentation is simply an incorrect
      ordering of of words, i.e. what's really meant is that a virtual machine
      can't be shared across multiple processes and a vCPU can't be shared
      across multiple threads.
      
      Tweak the blurb on issuing ioctls to use a more assertive tone, and
      rewrite the "supported use" sentence to reference said blurb instead of
      poorly restating it in different terms.
      
      Opportunistically add missing punctuation.
      
      [1] https://lkml.kernel.org/r/f23265d4-528e-3bd4-011f-4d7b8f3281db@redhat.com
      
      Fixes: 9c1b96e3 ("KVM: Document basic API")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Improve notes on asynchronous ioctl]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5e124900