1. 30 Apr, 2019 27 commits
    • Cédric Le Goater's avatar
      KVM: Introduce a 'release' method for KVM devices · 2bde9b3e
      Cédric Le Goater authored
      When a P9 sPAPR VM boots, the CAS negotiation process determines which
      interrupt mode to use (XICS legacy or XIVE native) and invokes a
      machine reset to activate the chosen mode.
      
      To be able to switch from one interrupt mode to another, we introduce
      the capability to release a KVM device without destroying the VM. The
      KVM device interface is extended with a new 'release' method which is
      called when the file descriptor of the device is closed.
      
      Once 'release' is called, the 'destroy' method will not be called
      anymore as the device is removed from the device list of the VM.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2bde9b3e
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode · 3fab2d10
      Cédric Le Goater authored
      Full support for the XIVE native exploitation mode is now available,
      advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on
      PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM
      Hypervisor) is not yet available. XIVE should also have been activated
      which is default setting on POWER9 systems running a recent Linux
      kernel.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3fab2d10
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add passthrough support · 232b984b
      Cédric Le Goater authored
      The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
      implement an IRQ space for the guest using the generic IPI interrupts
      of the XIVE IC controller. These interrupts are allocated at the OPAL
      level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
      Interrupt management is performed in the XIVE way: using loads and
      stores on the addresses of the XIVE IPI interrupt ESB pages.
      
      Both KVM devices share the same internal structure caching information
      on the interrupts, among which the xive_irq_data struct containing the
      addresses of the IPI ESB pages and an extra one in case of pass-through.
      The later contains the addresses of the ESB pages of the underlying HW
      controller interrupts, PHB4 in all cases for now.
      
      A guest, when running in the XICS legacy interrupt mode, lets the KVM
      XICS-over-XIVE device "handle" interrupt management, that is to
      perform the loads and stores on the addresses of the ESB pages of the
      guest interrupts. However, when running in XIVE native exploitation
      mode, the KVM XIVE native device exposes the interrupt ESB pages to
      the guest and lets the guest perform directly the loads and stores.
      
      The VMA exposing the ESB pages make use of a custom VM fault handler
      which role is to populate the VMA with appropriate pages. When a fault
      occurs, the guest IRQ number is deduced from the offset, and the ESB
      pages of associated XIVE IPI interrupt are inserted in the VMA (using
      the internal structure caching information on the interrupts).
      
      Supporting device passthrough in the guest running in XIVE native
      exploitation mode adds some extra refinements because the ESB pages
      of a different HW controller (PHB4) need to be exposed to the guest
      along with the initial IPI ESB pages of the XIVE IC controller. But
      the overall mechanic is the same.
      
      When the device HW irqs are mapped into or unmapped from the guest
      IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
      and kvmppc_xive_clr_mapped(), are called to record or clear the
      passthrough interrupt information and to perform the switch.
      
      The approach taken by this patch is to clear the ESB pages of the
      guest IRQ number being mapped and let the VM fault handler repopulate.
      The handler will insert the ESB page corresponding to the HW interrupt
      of the device being passed-through or the initial IPI ESB page if the
      device is being removed.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      232b984b
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pages · 6520ca64
      Cédric Le Goater authored
      Each source is associated with an Event State Buffer (ESB) with a
      even/odd pair of pages which provides commands to manage the source:
      to trigger, to EOI, to turn off the source for instance.
      
      The custom VM fault handler will deduce the guest IRQ number from the
      offset of the fault, and the ESB page of the associated XIVE interrupt
      will be inserted into the VMA using the internal structure caching
      information on the interrupts.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      6520ca64
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add a TIMA mapping · 39e9af3d
      Cédric Le Goater authored
      Each thread has an associated Thread Interrupt Management context
      composed of a set of registers. These registers let the thread handle
      priority management and interrupt acknowledgment. The most important
      are :
      
          - Interrupt Pending Buffer     (IPB)
          - Current Processor Priority   (CPPR)
          - Notification Source Register (NSR)
      
      They are exposed to software in four different pages each proposing a
      view with a different privilege. The first page is for the physical
      thread context and the second for the hypervisor. Only the third
      (operating system) and the fourth (user level) are exposed the guest.
      
      A custom VM fault handler will populate the VMA with the appropriate
      pages, which should only be the OS page for now.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      39e9af3d
    • Cédric Le Goater's avatar
      KVM: Introduce a 'mmap' method for KVM devices · a1cd3f08
      Cédric Le Goater authored
      Some KVM devices will want to handle special mappings related to the
      underlying HW. For instance, the XIVE interrupt controller of the
      POWER9 processor has MMIO pages for thread interrupt management and
      for interrupt source control that need to be exposed to the guest when
      the OS has the required support.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      a1cd3f08
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state · e4945b9d
      Cédric Le Goater authored
      The state of the thread interrupt management registers needs to be
      collected for migration. These registers are cached under the
      'xive_saved_state.w01' field of the VCPU when the VPCU context is
      pulled from the HW thread. An OPAL call retrieves the backup of the
      IPB register in the underlying XIVE NVT structure and merges it in the
      KVM state.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e4945b9d
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages · e6714bd1
      Cédric Le Goater authored
      When migration of a VM is initiated, a first copy of the RAM is
      transferred to the destination before the VM is stopped, but there is
      no guarantee that the EQ pages in which the event notifications are
      queued have not been modified.
      
      To make sure migration will capture a consistent memory state, the
      XIVE device should perform a XIVE quiesce sequence to stop the flow of
      event notifications and stabilize the EQs. This is the purpose of the
      KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
      to force their transfer.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e6714bd1
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add a control to sync the sources · 7b46b616
      Cédric Le Goater authored
      This control will be used by the H_INT_SYNC hcall from QEMU to flush
      event notifications on the XIVE IC owning the source.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7b46b616
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add a global reset control · 5ca80647
      Cédric Le Goater authored
      This control is to be used by the H_INT_RESET hcall from QEMU. Its
      purpose is to clear all configuration of the sources and EQs. This is
      necessary in case of a kexec (for a kdump kernel for instance) to make
      sure that no remaining configuration is left from the previous boot
      setup so that the new kernel can start safely from a clean state.
      
      The queue 7 is ignored when the XIVE device is configured to run in
      single escalation mode. Prio 7 is used by escalations.
      
      The XIVE VP is kept enabled as the vCPU is still active and connected
      to the XIVE device.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      5ca80647
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration · 13ce3297
      Cédric Le Goater authored
      These controls will be used by the H_INT_SET_QUEUE_CONFIG and
      H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
      Event Queue in the XIVE IC. They will also be used to restore the
      configuration of the XIVE EQs and to capture the internal run-time
      state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
      the EQ toggle bit and EQ index which are updated by the XIVE IC when
      event notifications are enqueued in the EQ.
      
      The value of the guest physical address of the event queue is saved in
      the XIVE internal xive_q structure for later use. That is when
      migration needs to mark the EQ pages dirty to capture a consistent
      memory state of the VM.
      
      To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
      OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
      but restoring the EQ state will.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      13ce3297
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Add a control to configure a source · e8676ce5
      Cédric Le Goater authored
      This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
      QEMU to configure the target of a source and also to restore the
      configuration of a source when migrating the VM.
      
      The XIVE source interrupt structure is extended with the value of the
      Effective Interrupt Source Number. The EISN is the interrupt number
      pushed in the event queue that the guest OS will use to dispatch
      events internally. Caching the EISN value in KVM eases the test when
      checking if a reconfiguration is indeed needed.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e8676ce5
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: add a control to initialize a source · 4131f83c
      Cédric Le Goater authored
      The XIVE KVM device maintains a list of interrupt sources for the VM
      which are allocated in the pool of generic interrupts (IPIs) of the
      main XIVE IC controller. These are used for the CPU IPIs as well as
      for virtual device interrupts. The IRQ number space is defined by
      QEMU.
      
      The XIVE device reuses the source structures of the XICS-on-XIVE
      device for the source blocks (2-level tree) and for the source
      interrupts. Under XIVE native, the source interrupt caches mostly
      configuration information and is less used than under the XICS-on-XIVE
      device in which hcalls are still necessary at run-time.
      
      When a source is initialized in KVM, an IPI interrupt source is simply
      allocated at the OPAL level and then MASKED. KVM only needs to know
      about its type: LSI or MSI.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      4131f83c
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVE · eacc56bb
      Cédric Le Goater authored
      The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
      let QEMU connect the vCPU presenters to the XIVE KVM device if
      required. The capability is not advertised for now as the full support
      for the XIVE native exploitation mode is not yet available. When this
      is case, the capability will be advertised on PowerNV Hypervisors
      only. Nested guests (pseries KVM Hypervisor) are not supported.
      
      Internally, the interface to the new KVM device is protected with a
      new interrupt mode: KVMPPC_IRQ_XIVE.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      eacc56bb
    • Cédric Le Goater's avatar
      KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode · 90c73795
      Cédric Le Goater authored
      This is the basic framework for the new KVM device supporting the XIVE
      native exploitation mode. The user interface exposes a new KVM device
      to be created by QEMU, only available when running on a L0 hypervisor.
      Support for nested guests is not available yet.
      
      The XIVE device reuses the device structure of the XICS-on-XIVE device
      as they have a lot in common. That could possibly change in the future
      if the need arise.
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      90c73795
    • Paul Mackerras's avatar
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · a878957a
      Paul Mackerras authored
      This merges in the ppc-kvm topic branch from the powerpc tree to get
      patches which touch both general powerpc code and KVM code, one of
      which is a prerequisite for following patches.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      a878957a
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry() · 44b198ae
      Suraj Jitindar Singh authored
      On POWER9 and later processors where the host can schedule vcpus on a
      per thread basis, there is a streamlined entry path used when the guest
      is radix. This entry path saves/restores the fp and vr state in
      kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and
      load_[fp/vr]_state(). This is the same as the old entry path however the
      old entry path also saved/restored the VRSAVE register, which isn't done
      in the new entry path.
      
      This means that the vrsave register is now volatile across guest exit,
      which is an incorrect change in behaviour.
      
      Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry().
      This restores the old, correct, behaviour.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      44b198ae
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Flush TLB on secondary radix threads · 70ea13f6
      Paul Mackerras authored
      When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
      in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
      If those guests are in radix mode, we fail to load the LPID and flush
      the TLB if necessary, leading to the guest crashing with an
      unsupported MMU fault.  This arises from commit 9a4506e1 ("KVM:
      PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
      with relocation on", 2018-05-17), which didn't consider the case
      where indep_threads_mode = N.
      
      For simplicity, this makes the real-mode guest entry path flush the
      TLB in the same place for both radix and hash guests, as we did before
      9a4506e1, though the code is now C code rather than assembly code.
      We also have the radix TLB flush open-coded rather than calling
      radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
      called in real mode, and in real mode we don't want to invoke the
      tracepoint code.
      
      Fixes: 9a4506e1 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      70ea13f6
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C code · 2940ba0c
      Paul Mackerras authored
      This replaces assembler code in book3s_hv_rmhandlers.S that checks
      the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
      with C code in book3s_hv_builtin.c.  Note that unlike the radix
      version, the hash version doesn't do an explicit ERAT invalidation
      because we will invalidate and load up the SLB before entering the
      guest, and that will invalidate the ERAT.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2940ba0c
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Handle virtual mode in XIVE VCPU push code · 7ae9bda7
      Suraj Jitindar Singh authored
      The code in book3s_hv_rmhandlers.S that pushes the XIVE virtual CPU
      context to the hardware currently assumes it is being called in real
      mode, which is usually true.  There is however a path by which it can
      be executed in virtual mode, in the case where indep_threads_mode = N.
      A virtual CPU executing on an offline secondary thread can take a
      hypervisor interrupt in virtual mode and return from the
      kvmppc_hv_entry() call after the kvm_secondary_got_guest label.
      It is possible for it to be given another vcpu to execute before it
      gets to execute the stop instruction.  In that case it will call
      kvmppc_hv_entry() for the second VCPU in virtual mode, and the XIVE
      vCPU push code will be executed in virtual mode.  The result in that
      case will be a host crash due to an unexpected data storage interrupt
      caused by executing the stdcix instruction in virtual mode.
      
      This fixes it by adding a code path for virtual mode, which uses the
      virtual TIMA pointer and normal load/store instructions.
      
      [paulus@ozlabs.org - wrote patch description]
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7ae9bda7
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix XICS-on-XIVE H_IPI when priority = 0 · 1f80ba3d
      Paul Mackerras authored
      This fixes a bug in the XICS emulation on POWER9 machines which is
      triggered by the guest doing a H_IPI with priority = 0 (the highest
      priority).  What happens is that the notification interrupt arrives
      at the destination at priority zero.  The loop in scan_interrupts()
      sees that a priority 0 interrupt is pending, but because xc->mfrr is
      zero, we break out of the loop before taking the notification
      interrupt out of the queue and EOI-ing it.  (This doesn't happen
      when xc->mfrr != 0; in that case we process the priority-0 notification
      interrupt on the first iteration of the loop, and then break out of
      a subsequent iteration of the loop with hirq == XICS_IPI.)
      
      To fix this, we move the prio >= xc->mfrr check down to near the end
      of the loop.  However, there are then some other things that need to
      be adjusted.  Since we are potentially handling the notification
      interrupt and also delivering an IPI to the guest in the same loop
      iteration, we need to update pending and handle any q->pending_count
      value before the xc->mfrr check, rather than at the end of the loop.
      Also, we need to update the queue pointers when we have processed and
      EOI-ed the notification interrupt, since we may not do it later.
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      1f80ba3d
    • Palmer Dabbelt's avatar
      KVM: PPC: Book3S HV: smb->smp comment fixup · 6fabc9f2
      Palmer Dabbelt authored
      I made the same typo when trying to grep for uses of smp_wmb and figured
      I might as well fix it.
      Signed-off-by: default avatarPalmer Dabbelt <palmer@sifive.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      6fabc9f2
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S: Allocate guest TCEs on demand too · e1a1ef84
      Alexey Kardashevskiy authored
      We already allocate hardware TCE tables in multiple levels and skip
      intermediate levels when we can, now it is a turn of the KVM TCE tables.
      Thankfully these are allocated already in 2 levels.
      
      This moves the table's last level allocation from the creating helper to
      kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot
      be done in real mode, this creates a virtual mode version of
      kvmppc_tce_put() which handles allocations.
      
      This adds kvmppc_rm_ioba_validate() to do an additional test if
      the consequent kvmppc_tce_put() needs a page which has not been allocated;
      if this is the case, we bail out to virtual mode handlers.
      
      The allocations are protected by a new mutex as kvm->lock is not suitable
      for the task because the fault handler is called with the mmap_sem held
      but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e1a1ef84
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlers · 2001825e
      Alexey Kardashevskiy authored
      The kvmppc_tce_to_ua() helper is called from real and virtual modes
      and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled.
      However if the lockdep debugging is on, the lockdep will most likely break
      in kvm_memslots() because of srcu_dereference_check() so we need to use
      PPC-own kvm_memslots_raw() which uses realmode safe
      rcu_dereference_raw_notrace().
      
      This creates a realmode copy of kvmppc_tce_to_ua() which replaces
      kvm_memslots() with kvm_memslots_raw().
      
      Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside
      HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE.
      
      This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and
      drops the prmap parameter which was never used in the virtual mode.
      
      Fixes: d3695aa4 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15)
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2001825e
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest · 3309bec8
      Alexey Kardashevskiy authored
      The trace_hardirqs_on() sets current->hardirqs_enabled and from here
      the lockdep assumes interrupts are enabled although they are remain
      disabled until the context switches to the guest. Consequent
      srcu_read_lock() checks the flags in rcu_lock_acquire(), observes
      disabled interrupts and prints a warning (see below).
      
      This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to
      prevent lockdep from being confused.
      
      DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280
      [...]
      NIP [c000000000185b84] check_flags.part.25+0x224/0x280
      LR [c000000000185b80] check_flags.part.25+0x220/0x280
      Call Trace:
      [c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable)
      [c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260
      [c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv]
      [c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv]
      [c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70
      Instruction dump:
      419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5
      3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      ---[ end trace 31180adcc848993e ]---
      possible reason: unannotated irqs-off.
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      
      Fixes: 8b24e69f ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26)
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3309bec8
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Implement real mode H_PAGE_INIT handler · eadfb1c5
      Suraj Jitindar Singh authored
      Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel real mode handler halves the time to handle this H_CALL
      compared to handling it in userspace for a hash guest.
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      eadfb1c5
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handler · 2d34d1c3
      Suraj Jitindar Singh authored
      Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel handler halves the time to handle this H_CALL compared to
      handling it in userspace for a radix guest.
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2d34d1c3
  2. 20 Apr, 2019 1 commit
    • Michael Neuling's avatar
      powerpc: Add force enable of DAWR on P9 option · c1fe190c
      Michael Neuling authored
      This adds a flag so that the DAWR can be enabled on P9 via:
        echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
      
      The DAWR was previously force disabled on POWER9 in:
        96541531 powerpc: Disable DAWR in the base POWER9 CPU features
      Also see Documentation/powerpc/DAWR-POWER9.txt
      
      This is a dangerous setting, USE AT YOUR OWN RISK.
      
      Some users may not care about a bad user crashing their box
      (ie. single user/desktop systems) and really want the DAWR.  This
      allows them to force enable DAWR.
      
      This flag can also be used to disable DAWR access. Once this is
      cleared, all DAWR access should be cleared immediately and your
      machine once again safe from crashing.
      
      Userspace may get confused by toggling this. If DAWR is force
      enabled/disabled between getting the number of breakpoints (via
      PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
      inconsistent view of what's available. Similarly for guests.
      
      For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
      enabled in the host AND the guest. For this reason, this won't work on
      POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
      dawr_enable_dangerous file will fail if the hypervisor doesn't support
      writing the DAWR.
      
      To double check the DAWR is working, run this kernel selftest:
        tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
      Any errors/failures/skips mean something is wrong.
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c1fe190c
  3. 11 Apr, 2019 1 commit
  4. 05 Apr, 2019 2 commits
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Book3S: Protect memslots while validating user address · 345077c8
      Alexey Kardashevskiy authored
      Guest physical to user address translation uses KVM memslots and reading
      these requires holding the kvm->srcu lock. However recently introduced
      kvmppc_tce_validate() broke the rule (see the lockdep warning below).
      
      This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
      kvmppc_tce_validate() as well.
      
      =============================
      WARNING: suspicious RCU usage
      5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
      -----------------------------
      include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by qemu-system-ppc/8020:
       #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]
      
      stack backtrace:
      CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
      Call Trace:
      [c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
      [c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
      [c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
      [c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
      [c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
      [c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
      [c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
      [c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70
      
      Fixes: 42de7b9e ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      345077c8
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit · 7cb9eb10
      Suraj Jitindar Singh authored
      There is a hardware bug in some POWER9 processors where a treclaim in
      fake suspend mode can cause an inconsistency in the XER[SO] bit across
      the threads of a core, the workaround being to force the core into SMT4
      when doing the treclaim.
      
      The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
      thread is in fake suspend or real suspend. The important difference here
      being that thread reconfiguration is blocked in real suspend but not
      fake suspend mode.
      
      When we exit a guest which was in fake suspend mode, we force the core
      into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
      However on the new exit path introduced with the function
      kvmhv_run_single_vcpu() we restore the host PSSCR before calling
      kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
      put the thread into real suspend mode when we clear the
      PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
      and the thread which is trying to get the core into SMT4 before it can
      do the treclaim spins forever since it itself is blocking thread
      reconfiguration. The result is that that core is essentially lost.
      
      This results in a trace such as:
      [   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
      [   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
      [   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
      [   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
      [   93.512914] CFAR: c000000000098a5c IRQMASK: 3
      [   93.512915] PACATMSCRATCH: 0000000000000001
      [   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
      [   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
      [   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
      [   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
      [   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
      [   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
      [   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
      [   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
      [   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
      [   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
      [   93.512960] Call Trace:
      [   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
      [   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
      [   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
      [   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
      [   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
      [   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
      [   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
      [   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
      [   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
      [   93.512983] Instruction dump:
      [   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
      [   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214
      
      To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
      kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
      perform the treclaim. Note kvmppc_save_tm_hv() clears the
      PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7cb9eb10
  5. 28 Mar, 2019 9 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-for-5.1' of... · 690edec5
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
      
      KVM/ARM fixes for 5.1
      
      - Fix THP handling in the presence of pre-existing PTEs
      - Honor request for PTE mappings even when THPs are available
      - GICv4 performance improvement
      - Take the srcu lock when writing to guest-controlled ITS data structures
      - Reset the virtual PMU in preemptible context
      - Various cleanups
      690edec5
    • Paolo Bonzini's avatar
      Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION · e2788c4a
      Paolo Bonzini authored
      The documentation does not mention how to delete a slot, add the
      information.
      Reported-by: default avatarNathaniel McCallum <npmccallum@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e2788c4a
    • Sean Christopherson's avatar
      KVM: doc: Document the life cycle of a VM and its resources · 919f6cd8
      Sean Christopherson authored
      The series to add memcg accounting to KVM allocations[1] states:
      
        There are many KVM kernel memory allocations which are tied to the
        life of the VM process and should be charged to the VM process's
        cgroup.
      
      While it is correct to account KVM kernel allocations to the cgroup of
      the process that created the VM, it's technically incorrect to state
      that the KVM kernel memory allocations are tied to the life of the VM
      process.  This is because the VM itself, i.e. struct kvm, is not tied to
      the life of the process which created it, rather it is tied to the life
      of its associated file descriptor.  In other words, kvm_destroy_vm() is
      not invoked until fput() decrements its associated file's refcount to
      zero.  A simple example is to fork() in Qemu and have the child sleep
      indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
      descriptor *and* the rogue child is killed.
      
      The allocations are guaranteed to be *accounted* to the process which
      created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
      ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
      the VM "alive" but can't do anything useful with its reference.
      
      Note that because 'struct kvm' also holds a reference to the mm_struct
      of its owner, the above behavior also applies to userspace allocations.
      
      Given that mucking with a VM's file descriptor can lead to subtle and
      undesirable behavior, e.g. memcg charges persisting after a VM is shut
      down, explicitly document a VM's lifecycle and its impact on the VM's
      resources.
      
      Alternatively, KVM could aggressively free resources when the creating
      process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
      isn't guaranteed to be available, and freeing resources when the creator
      exits is likely to be error prone and fragile as KVM would need to
      ensure that it only freed resources that are truly out of reach. In
      practice, the existing behavior shouldn't be problematic as a properly
      configured system will prevent a child process from being moved out of
      the appropriate cgroup hierarchy, i.e. prevent hiding the process from
      the OOM killer, and will prevent an unprivileged user from being able to
      to hold a reference to struct kvm via another method, e.g. debugfs.
      
      [1]https://patchwork.kernel.org/patch/10806707/Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      919f6cd8
    • Sean Christopherson's avatar
      KVM: selftests: complete IO before migrating guest state · 0f73bbc8
      Sean Christopherson authored
      Documentation/virtual/kvm/api.txt states:
      
        NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
              KVM_EXIT_EPR the corresponding operations are complete (and guest
              state is consistent) only after userspace has re-entered the
              kernel with KVM_RUN.  The kernel side will first finish incomplete
              operations and then check for pending signals.  Userspace can
              re-enter the guest with an unmasked signal pending to complete
              pending operations.
      
      Because guest state may be inconsistent, starting state migration after
      an IO exit without first completing IO may result in test failures, e.g.
      a proposed change to KVM's handling of %rip in its fast PIO handling[1]
      will cause the new VM, i.e. the post-migration VM, to have its %rip set
      to the IN instruction that triggered KVM_EXIT_IO, leading to a test
      assertion due to a stage mismatch.
      
      For simplicitly, require KVM_CAP_IMMEDIATE_EXIT to complete IO and skip
      the test if it's not available.  The addition of KVM_CAP_IMMEDIATE_EXIT
      predates the state selftest by more than a year.
      
      [1] https://patchwork.kernel.org/patch/10848545/
      
      Fixes: fa3899ad ("kvm: selftests: add basic test for state save and restore")
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f73bbc8
    • Sean Christopherson's avatar
      KVM: selftests: disable stack protector for all KVM tests · ffac839d
      Sean Christopherson authored
      Since 4.8.3, gcc has enabled -fstack-protector by default.  This is
      problematic for the KVM selftests as they do not configure fs or gs
      segments (the stack canary is pulled from fs:0x28).  With the default
      behavior, gcc will insert a stack canary on any function that creates
      buffers of 8 bytes or more.  As a result, ucall() will hit a triple
      fault shutdown due to reading a bad fs segment when inserting its
      stack canary, i.e. every test fails with an unexpected SHUTDOWN.
      
      Fixes: 14c47b75 ("kvm: selftests: introduce ucall")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ffac839d
    • Sean Christopherson's avatar
      KVM: selftests: explicitly disable PIE for tests · 0a3f29b5
      Sean Christopherson authored
      KVM selftests embed the guest "image" as a function in the test itself
      and extract the guest code at runtime by manually parsing the elf
      headers.  The parsing is very simple and doesn't supporting fancy things
      like position independent executables.  Recent versions of gcc enable
      pie by default, which results in triple fault shutdowns in the guest due
      to the virtual address in the headers not matching up with the virtual
      address retrieved from the function pointer.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0a3f29b5
    • Sean Christopherson's avatar
      KVM: selftests: assert on exit reason in CR4/cpuid sync test · 8df98ae0
      Sean Christopherson authored
      ...so that the test doesn't end up in an infinite loop if it fails for
      whatever reason, e.g. SHUTDOWN due to gcc inserting stack canary code
      into ucall() and attempting to derefence a null segment.
      
      Fixes: ca359066 ("kvm: selftests: add cr4_cpuid_sync_test")
      Cc: Wei Huang <wei@redhat.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8df98ae0
    • Sean Christopherson's avatar
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson authored
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • Vitaly Kuznetsov's avatar
      x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init · 013cc6eb
      Vitaly Kuznetsov authored
      When userspace initializes guest vCPUs it may want to zero all supported
      MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
      HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5 ("kvm/x86: Update SynIC
      timers on guest entry only") we began doing stimer_mark_pending()
      unconditionally on every config change.
      
      The issue I'm observing manifests itself as following:
      - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
        pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
      - kvm_hv_has_stimer_pending() starts returning true;
      - kvm_vcpu_has_events() starts returning true;
      - kvm_arch_vcpu_runnable() starts returning true;
      - when kvm_arch_vcpu_ioctl_run() gets into
        (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
        - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
          immediately, avoiding normal wait path;
        - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
          userspace to retry.
      
      So instead of normal wait path we get a busy loop on all secondary vCPUs
      before they get INIT signal. This seems to be undesirable, especially given
      that this happens even when Hyper-V extensions are not used.
      
      Generally, it seems to be pointless to mark an stimer as pending in
      stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
      kvm_hv_process_stimers() will do is clear the corresponding bit. We may
      just not mark disabled timers as pending instead.
      
      Fixes: f3b138c5 ("kvm/x86: Update SynIC timers on guest entry only")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      013cc6eb