1. 22 Oct, 2015 7 commits
    • Christoffer Dall's avatar
      arm/arm64: KVM: Rework the arch timer to use level-triggered semantics · 4b4b4512
      Christoffer Dall authored
      The arch timer currently uses edge-triggered semantics in the sense that
      the line is never sampled by the vgic and lowering the line from the
      timer to the vgic doesn't have any effect on the pending state of
      virtual interrupts in the vgic.  This means that we do not support a
      guest with the otherwise valid behavior of (1) disable interrupts (2)
      enable the timer (3) disable the timer (4) enable interrupts.  Such a
      guest would validly not expect to see any interrupts on real hardware,
      but will see interrupts on KVM.
      
      This patch fixes this shortcoming through the following series of
      changes.
      
      First, we change the flow of the timer/vgic sync/flush operations.  Now
      the timer is always flushed/synced before the vgic, because the vgic
      samples the state of the timer output.  This has the implication that we
      move the timer operations in to non-preempible sections, but that is
      fine after the previous commit getting rid of hrtimer schedules on every
      entry/exit.
      
      Second, we change the internal behavior of the timer, letting the timer
      keep track of its previous output state, and only lower/raise the line
      to the vgic when the state changes.  Note that in theory this could have
      been accomplished more simply by signalling the vgic every time the
      state *potentially* changed, but we don't want to be hitting the vgic
      more often than necessary.
      
      Third, we get rid of the use of the map->active field in the vgic and
      instead simply set the interrupt as active on the physical distributor
      whenever the input to the GIC is asserted and conversely clear the
      physical active state when the input to the GIC is deasserted.
      
      Fourth, and finally, we now initialize the timer PPIs (and all the other
      unused PPIs for now), to be level-triggered, and modify the sync code to
      sample the line state on HW sync and re-inject a new interrupt if it is
      still pending at that time.
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      4b4b4512
    • Christoffer Dall's avatar
      arm/arm64: KVM: Add forwarded physical interrupts documentation · 4cf1bc4c
      Christoffer Dall authored
      Forwarded physical interrupts on arm/arm64 is a tricky concept and the
      way we deal with them is not apparently easy to understand by reading
      various specs.
      
      Therefore, add a proper documentation file explaining the flow and
      rationale of the behavior of the vgic.
      
      Some of this text was contributed by Marc Zyngier and edited by me.
      Omissions and errors are all mine.
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      4cf1bc4c
    • Christoffer Dall's avatar
      arm/arm64: KVM: Use appropriate define in VGIC reset code · 54723bb3
      Christoffer Dall authored
      We currently initialize the SGIs to be enabled in the VGIC code, but we
      use the VGIC_NR_PPIS define for this purpose, instead of the the more
      natural VGIC_NR_SGIS.  Change this slightly confusing use of the
      defines.
      
      Note: This should have no functional change, as both names are defined
      to the number 16.
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      54723bb3
    • Christoffer Dall's avatar
      arm/arm64: KVM: Implement GICD_ICFGR as RO for PPIs · 8bf9a701
      Christoffer Dall authored
      The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only.
      We currently simulate this behavior by writing a hardcoded value to the
      register for the SGIs and PPIs on every write of these bits to the
      register (ignoring what the guest actually wrote), and by writing the
      same value as the reset value to the register.
      
      This is a bit counter-intuitive, as the register is RO for these bits,
      and we can just implement it that way, allowing us to control the value
      of the bits purely in the reset code.
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      8bf9a701
    • Christoffer Dall's avatar
      arm/arm64: KVM: vgic: Factor out level irq processing on guest exit · 9103617d
      Christoffer Dall authored
      Currently vgic_process_maintenance() processes dealing with a completed
      level-triggered interrupt directly, but we are soon going to reuse this
      logic for level-triggered mapped interrupts with the HW bit set, so
      move this logic into a separate static function.
      
      Probably the most scary part of this commit is convincing yourself that
      the current flow is safe compared to the old one.  In the following I
      try to list the changes and why they are harmless:
      
        Move vgic_irq_clear_queued after kvm_notify_acked_irq:
          Harmless because the only potential effect of clearing the queued
          flag wrt.  kvm_set_irq is that vgic_update_irq_pending does not set
          the pending bit on the emulated CPU interface or in the
          pending_on_cpu bitmask if the function is called with level=1.
          However, the point of kvm_notify_acked_irq is to call kvm_set_irq
          with level=0, and we set the queued flag again in
          __kvm_vgic_sync_hwstate later on if the level is stil high.
      
        Move vgic_set_lr before kvm_notify_acked_irq:
          Also, harmless because the LR are cpu-local operations and
          kvm_notify_acked only affects the dist
      
        Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq:
          Also harmless, because now we check the level state in the
          clear_soft_pend function and lower the pending bits if the level is
          low.
      Reviewed-by: default avatarEric Auger <eric.auger@linaro.org>
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      9103617d
    • Christoffer Dall's avatar
      arm/arm64: KVM: arch_timer: Only schedule soft timer on vcpu_block · d35268da
      Christoffer Dall authored
      We currently schedule a soft timer every time we exit the guest if the
      timer did not expire while running the guest.  This is really not
      necessary, because the only work we do in the timer work function is to
      kick the vcpu.
      
      Kicking the vcpu does two things:
      (1) If the vpcu thread is on a waitqueue, make it runnable and remove it
      from the waitqueue.
      (2) If the vcpu is running on a different physical CPU from the one
      doing the kick, it sends a reschedule IPI.
      
      The second case cannot happen, because the soft timer is only ever
      scheduled when the vcpu is not running.  The first case is only relevant
      when the vcpu thread is on a waitqueue, which is only the case when the
      vcpu thread has called kvm_vcpu_block().
      
      Therefore, we only need to make sure a timer is scheduled for
      kvm_vcpu_block(), which we do by encapsulating all calls to
      kvm_vcpu_block() with kvm_timer_{un}schedule calls.
      
      Additionally, we only schedule a soft timer if the timer is enabled and
      unmasked, since it is useless otherwise.
      
      Note that theoretically userspace can use the SET_ONE_REG interface to
      change registers that should cause the timer to fire, even if the vcpu
      is blocked without a scheduled timer, but this case was not supported
      before this patch and we leave it for future work for now.
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      d35268da
    • Christoffer Dall's avatar
      KVM: Add kvm_arch_vcpu_{un}blocking callbacks · 3217f7c2
      Christoffer Dall authored
      Some times it is useful for architecture implementations of KVM to know
      when the VCPU thread is about to block or when it comes back from
      blocking (arm/arm64 needs to know this to properly implement timers, for
      example).
      
      Therefore provide a generic architecture callback function in line with
      what we do elsewhere for KVM generic-arch interactions.
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      3217f7c2
  2. 20 Oct, 2015 6 commits
    • Christoffer Dall's avatar
      arm/arm64: KVM: Fix disabled distributor operation · 0d997491
      Christoffer Dall authored
      We currently do a single update of the vgic state when the distributor
      enable/disable control register is accessed and then bypass updating the
      state for as long as the distributor remains disabled.
      
      This is incorrect, because updating the state does not consider the
      distributor enable bit, and this you can end up in a situation where an
      interrupt is marked as pending on the CPU interface, but not pending on
      the distributor, which is an impossible state to be in, and triggers a
      warning.  Consider for example the following sequence of events:
      
      1. An interrupt is marked as pending on the distributor
         - the interrupt is also forwarded to the CPU interface
      2. The guest turns off the distributor (it's about to do a reboot)
         - we stop updating the CPU interface state from now on
      3. The guest disables the pending interrupt
         - we remove the pending state from the distributor, but don't touch
           the CPU interface, see point 2.
      
      Since the distributor disable bit really means that no interrupts should
      be forwarded to the CPU interface, we modify the code to keep updating
      the internal VGIC state, but always set the CPU interface pending bits
      to zero when the distributor is disabled.
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      0d997491
    • Christoffer Dall's avatar
      arm/arm64: KVM: Clear map->active on pend/active clear · 544c572e
      Christoffer Dall authored
      When a guest reboots or offlines/onlines CPUs, it is not uncommon for it
      to clear the pending and active states of an interrupt through the
      emulated VGIC distributor.  However, since the architected timers are
      defined by the architecture to be level triggered and the guest
      rightfully expects them to be that, but we emulate them as
      edge-triggered, we have to mimic level-triggered behavior for an
      edge-triggered virtual implementation.
      
      We currently do not signal the VGIC when the map->active field is true,
      because it indicates that the guest has already been signalled of the
      interrupt as required.  Normally this field is set to false when the
      guest deactivates the virtual interrupt through the sync path.
      
      We also need to catch the case where the guest deactivates the interrupt
      through the emulated distributor, again allowing guests to boot even if
      the original virtual timer signal hit before the guest's GIC
      initialization sequence is run.
      Reviewed-by: default avatarEric Auger <eric.auger@linaro.org>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      544c572e
    • Christoffer Dall's avatar
      arm/arm64: KVM: Fix arch timer behavior for disabled interrupts · cff9211e
      Christoffer Dall authored
      We have an interesting issue when the guest disables the timer interrupt
      on the VGIC, which happens when turning VCPUs off using PSCI, for
      example.
      
      The problem is that because the guest disables the virtual interrupt at
      the VGIC level, we never inject interrupts to the guest and therefore
      never mark the interrupt as active on the physical distributor.  The
      host also never takes the timer interrupt (we only use the timer device
      to trigger a guest exit and everything else is done in software), so the
      interrupt does not become active through normal means.
      
      The result is that we keep entering the guest with a programmed timer
      that will always fire as soon as we context switch the hardware timer
      state and run the guest, preventing forward progress for the VCPU.
      
      Since the active state on the physical distributor is really part of the
      timer logic, it is the job of our virtual arch timer driver to manage
      this state.
      
      The timer->map->active boolean field indicates whether we have signalled
      this interrupt to the vgic and if that interrupt is still pending or
      active.  As long as that is the case, the hardware doesn't have to
      generate physical interrupts and therefore we mark the interrupt as
      active on the physical distributor.
      
      We also have to restore the pending state of an interrupt that was
      queued to an LR but was retired from the LR for some reason, while
      remaining pending in the LR.
      
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Reported-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      cff9211e
    • Arnd Bergmann's avatar
      KVM: arm: use GIC support unconditionally · 4a5d69b7
      Arnd Bergmann authored
      The vgic code on ARM is built for all configurations that enable KVM,
      but the parent_data field that it references is only present when
      CONFIG_IRQ_DOMAIN_HIERARCHY is set:
      
      virt/kvm/arm/vgic.c: In function 'kvm_vgic_map_phys_irq':
      virt/kvm/arm/vgic.c:1781:13: error: 'struct irq_data' has no member named 'parent_data'
      
      This flag is implied by the GIC driver, and indeed the VGIC code only
      makes sense if a GIC is present. This changes the CONFIG_KVM symbol
      to always select GIC, which avoids the issue.
      
      Fixes: 662d9715 ("arm/arm64: KVM: Kill CONFIG_KVM_ARM_{VGIC,TIMER}")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      4a5d69b7
    • Pavel Fedin's avatar
      KVM: arm/arm64: Fix memory leak if timer initialization fails · 399ea0f6
      Pavel Fedin authored
      Jump to correct label and free kvm_host_cpu_state
      Reviewed-by: default avatarWei Huang <wei@redhat.com>
      Signed-off-by: default avatarPavel Fedin <p.fedin@samsung.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      399ea0f6
    • Pavel Fedin's avatar
      KVM: arm/arm64: Do not inject spurious interrupts · 437f9963
      Pavel Fedin authored
      When lowering a level-triggered line from userspace, we forgot to lower
      the pending bit on the emulated CPU interface and we also did not
      re-compute the pending_on_cpu bitmap for the CPU affected by the change.
      
      Update vgic_update_irq_pending() to fix the two issues above and also
      raise a warning in vgic_quue_irq_to_lr if we encounter an interrupt
      pending on a CPU which is neither marked active nor pending.
      
        [ Commit text reworked completely - Christoffer ]
      Signed-off-by: default avatarPavel Fedin <p.fedin@samsung.com>
      Signed-off-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      437f9963
  3. 25 Sep, 2015 4 commits
    • David Hildenbrand's avatar
      KVM: disable halt_poll_ns as default for s390x · 920552b2
      David Hildenbrand authored
      We observed some performance degradation on s390x with dynamic
      halt polling. Until we can provide a proper fix, let's enable
      halt_poll_ns as default only for supported architectures.
      
      Architectures are now free to set their own halt_poll_ns
      default value.
      Signed-off-by: default avatarDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      920552b2
    • Paolo Bonzini's avatar
      KVM: x86: fix off-by-one in reserved bits check · 58c95070
      Paolo Bonzini authored
      29ecd660 ("KVM: x86: avoid uninitialized variable warning",
      2015-09-06) introduced a not-so-subtle problem, which probably
      escaped review because it was not part of the patch context.
      
      Before the patch, leaf was always equal to iterator.level.  After,
      it is equal to iterator.level - 1 in the call to is_shadow_zero_bits_set,
      and when is_shadow_zero_bits_set does another "-1" the check on
      reserved bits becomes incorrect.  Using "iterator.level" in the call
      fixes this call trace:
      
      WARNING: CPU: 2 PID: 17000 at arch/x86/kvm/mmu.c:3385 handle_mmio_page_fault.part.93+0x1a/0x20 [kvm]()
      Modules linked in: tun sha256_ssse3 sha256_generic drbg binfmt_misc ipv6 vfat fat fuse dm_crypt dm_mod kvm_amd kvm crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power amd64_edac_mod k10temp edac_core amdkfd amd_iommu_v2 radeon acpi_cpufreq
      [...]
      Call Trace:
        dump_stack+0x4e/0x84
        warn_slowpath_common+0x95/0xe0
        warn_slowpath_null+0x1a/0x20
        handle_mmio_page_fault.part.93+0x1a/0x20 [kvm]
        tdp_page_fault+0x231/0x290 [kvm]
        ? emulator_pio_in_out+0x6e/0xf0 [kvm]
        kvm_mmu_page_fault+0x36/0x240 [kvm]
        ? svm_set_cr0+0x95/0xc0 [kvm_amd]
        pf_interception+0xde/0x1d0 [kvm_amd]
        handle_exit+0x181/0xa70 [kvm_amd]
        ? kvm_arch_vcpu_ioctl_run+0x68b/0x1730 [kvm]
        kvm_arch_vcpu_ioctl_run+0x6f6/0x1730 [kvm]
        ? kvm_arch_vcpu_ioctl_run+0x68b/0x1730 [kvm]
        ? preempt_count_sub+0x9b/0xf0
        ? mutex_lock_killable_nested+0x26f/0x490
        ? preempt_count_sub+0x9b/0xf0
        kvm_vcpu_ioctl+0x358/0x710 [kvm]
        ? __fget+0x5/0x210
        ? __fget+0x101/0x210
        do_vfs_ioctl+0x2f4/0x560
        ? __fget_light+0x29/0x90
        SyS_ioctl+0x4c/0x90
        entry_SYSCALL_64_fastpath+0x16/0x73
      ---[ end trace 37901c8686d84de6 ]---
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Tested-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      58c95070
    • Paolo Bonzini's avatar
      KVM: x86: use correct page table format to check nested page table reserved bits · 6fec2144
      Paolo Bonzini authored
      Intel CPUID on AMD host or vice versa is a weird case, but it can
      happen.  Handle it by checking the host CPU vendor instead of the
      guest's in reset_tdp_shadow_zero_bits_mask.  For speed, the
      check uses the fact that Intel EPT has an X (executable) bit while
      AMD NPT has NX.
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Tested-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6fec2144
    • Paolo Bonzini's avatar
      KVM: svm: do not call kvm_set_cr0 from init_vmcb · 79a8059d
      Paolo Bonzini authored
      kvm_set_cr0 may want to call kvm_zap_gfn_range and thus access the
      memslots array (SRCU protected).  Using a mini SRCU critical section
      is ugly, and adding it to kvm_arch_vcpu_create doesn't work because
      the VMX vcpu_create callback calls synchronize_srcu.
      
      Fixes this lockdep splat:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.3.0-rc1+ #1 Not tainted
      -------------------------------
      include/linux/kvm_host.h:488 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      rcu_scheduler_active = 1, debug_locks = 0
      1 lock held by qemu-system-i38/17000:
       #0:  (&(&kvm->mmu_lock)->rlock){+.+...}, at: kvm_zap_gfn_range+0x24/0x1a0 [kvm]
      
      [...]
      Call Trace:
       dump_stack+0x4e/0x84
       lockdep_rcu_suspicious+0xfd/0x130
       kvm_zap_gfn_range+0x188/0x1a0 [kvm]
       kvm_set_cr0+0xde/0x1e0 [kvm]
       init_vmcb+0x760/0xad0 [kvm_amd]
       svm_create_vcpu+0x197/0x250 [kvm_amd]
       kvm_arch_vcpu_create+0x47/0x70 [kvm]
       kvm_vm_ioctl+0x302/0x7e0 [kvm]
       ? __lock_is_held+0x51/0x70
       ? __fget+0x101/0x210
       do_vfs_ioctl+0x2f4/0x560
       ? __fget_light+0x29/0x90
       SyS_ioctl+0x4c/0x90
       entry_SYSCALL_64_fastpath+0x16/0x73
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      79a8059d
  4. 22 Sep, 2015 1 commit
  5. 21 Sep, 2015 1 commit
  6. 20 Sep, 2015 3 commits
    • Thomas Huth's avatar
      KVM: PPC: Book3S: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store() · 3eb4ee68
      Thomas Huth authored
      Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
      functions) has to be protected via the kvm->srcu lock.
      The kvmppc_h_logical_ci_load() and -store() functions are missing
      this lock so far, so let's add it there, too.
      This fixes the problem that the kernel reports "suspicious RCU usage"
      when lock debugging is enabled.
      
      Cc: stable@vger.kernel.org # v4.1+
      Fixes: 99342cf8Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      3eb4ee68
    • Gautham R. Shenoy's avatar
      KVM: PPC: Book3S HV: Pass the correct trap argument to kvmhv_commence_exit · 7e022e71
      Gautham R. Shenoy authored
      In guest_exit_cont we call kvmhv_commence_exit which expects the trap
      number as the argument. However r3 doesn't contain the trap number at
      this point and as a result we would be calling the function with a
      spurious trap number.
      
      Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
      r12 contains the trap number.
      
      Cc: stable@vger.kernel.org # v4.1+
      Fixes: eddb60fbSigned-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      7e022e71
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs · 5fc3e64f
      Paul Mackerras authored
      This fixes a bug which results in stale vcore pointers being left in
      the per-cpu preempted vcore lists when a VM is destroyed.  The result
      of the stale vcore pointers is usually either a crash or a lockup
      inside collect_piggybacks() when another VM is run.  A typical
      lockup message looks like:
      
      [  472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
      [  472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
      [  472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
      [  472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
      [  472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
      [  472.162337] REGS: c000001e41bff520 TRAP: 0901   Not tainted  (4.2.0-kvm+)
      [  472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24848844  XER: 00000000
      [  472.162588] CFAR: c00000000096b0ac SOFTE: 1
      GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
      GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
      GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
      GPR12: c000000000111130 c00000000fdae400
      [  472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
      [  472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
      [  472.163179] Call Trace:
      [  472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
      [  472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
      [  472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
      [  472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
      [  472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
      [  472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
      [  472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
      [  472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
      [  472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
      [  472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
      [  472.164060] Instruction dump:
      [  472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
      [  472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070
      
      The bug is that kvmppc_run_vcpu does not correctly handle the case
      where a vcpu task receives a signal while its guest vcpu is executing
      in the guest as a result of being piggy-backed onto the execution of
      another vcore.  In that case we need to wait for the vcpu to finish
      executing inside the guest, and then remove this vcore from the
      preempted vcores list.  That way, we avoid leaving this vcpu's vcore
      on the preempted vcores list when the vcpu gets interrupted.
      
      Fixes: ec257165Reported-by: default avatarThomas Huth <thuth@redhat.com>
      Tested-by: default avatarThomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      5fc3e64f
  7. 18 Sep, 2015 2 commits
    • Igor Mammedov's avatar
      kvm: svm: reset mmu on VCPU reset · ebae871a
      Igor Mammedov authored
      When INIT/SIPI sequence is sent to VCPU which before that
      was in use by OS, VMRUN might fail with:
      
       KVM: entry failed, hardware error 0xffffffff
       EAX=00000000 EBX=00000000 ECX=00000000 EDX=000006d3
       ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
       EIP=00000000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
       ES =0000 00000000 0000ffff 00009300
       CS =9a00 0009a000 0000ffff 00009a00
       [...]
       CR0=60000010 CR2=b6f3e000 CR3=01942000 CR4=000007e0
       [...]
       EFER=0000000000000000
      
      with corresponding SVM error:
       KVM: FAILED VMRUN WITH VMCB:
       [...]
       cpl:            0                efer:         0000000000001000
       cr0:            0000000080010010 cr2:          00007fd7fe85bf90
       cr3:            0000000187d0c000 cr4:          0000000000000020
       [...]
      
      What happens is that VCPU state right after offlinig:
      CR0: 0x80050033  EFER: 0xd01  CR4: 0x7e0
        -> long mode with CR3 pointing to longmode page tables
      
      and when VCPU gets INIT/SIPI following transition happens
      CR0: 0 -> 0x60000010 EFER: 0x0  CR4: 0x7e0
        -> paging disabled with stale CR3
      
      However SVM under the hood puts VCPU in Paged Real Mode*
      which effectively translates CR0 0x60000010 -> 80010010 after
      
         svm_vcpu_reset()
             -> init_vmcb()
                 -> kvm_set_cr0()
                     -> svm_set_cr0()
      
      but from  kvm_set_cr0() perspective CR0: 0 -> 0x60000010
      only caching bits are changed and
      commit d81135a5
       ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")'
      regressed svm_vcpu_reset() which relied on MMU being reset.
      
      As result VMRUN after svm_vcpu_reset() tries to run
      VCPU in Paged Real Mode with stale MMU context (longmode page tables),
      which causes some AMD CPUs** to bail out with VMEXIT_INVALID.
      
      Fix issue by unconditionally resetting MMU context
      at init_vmcb() time.
      
      	* AMD64 Architecture Programmerâ€s Manual,
      	    Volume 2: System Programming, rev: 3.25
      	      15.19 Paged Real Mode
      	** Opteron 1216
      Signed-off-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Fixes: d81135a5
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ebae871a
    • Dominik Dingel's avatar
      sched: access local runqueue directly in single_task_running · 00cc1633
      Dominik Dingel authored
      Commit 2ee507c4 ("sched: Add function single_task_running to let a task
      check if it is the only task running on a cpu") referenced the current
      runqueue with the smp_processor_id.  When CONFIG_DEBUG_PREEMPT is enabled,
      that is only allowed if preemption is disabled or the currrent task is
      bound to the local cpu (e.g. kernel worker).
      
      With commit f7819512 ("kvm: add halt_poll_ns module parameter") KVM
      calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
      generates a lot of kernel messages.
      
      To avoid adding preemption in that cases, as it would limit the usefulness,
      we change single_task_running to access directly the cpu local runqueue.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Fixes: 2ee507c4Signed-off-by: default avatarDominik Dingel <dingel@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      00cc1633
  8. 17 Sep, 2015 5 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-arm-for-4.3-rc2-2' of... · efe4d36a
      Paolo Bonzini authored
      Merge tag 'kvm-arm-for-4.3-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
      
      Second set of KVM/ARM changes for 4.3-rc2
      
      - Workaround for a Cortex-A57 erratum
      - Bug fix for the debugging infrastructure
      - Fix for 32bit guests with more than 4GB of address space
        on a 32bit host
      - A number of fixes for the (unusual) case when we don't use
        the in-kernel GIC emulation
      - Removal of ThumbEE handling on arm64, since these have been
        dropped from the architecture before anyone actually ever
        built a CPU
      - Remove the KVM_ARM_MAX_VCPUS limitation which has become
        fairly pointless
      efe4d36a
    • Ming Lei's avatar
      arm/arm64: KVM: Remove 'config KVM_ARM_MAX_VCPUS' · ef748917
      Ming Lei authored
      This patch removes config option of KVM_ARM_MAX_VCPUS,
      and like other ARCHs, just choose the maximum allowed
      value from hardware, and follows the reasons:
      
      1) from distribution view, the option has to be
      defined as the max allowed value because it need to
      meet all kinds of virtulization applications and
      need to support most of SoCs;
      
      2) using a bigger value doesn't introduce extra memory
      consumption, and the help text in Kconfig isn't accurate
      because kvm_vpu structure isn't allocated until request
      of creating VCPU is sent from QEMU;
      
      3) the main effect is that the field of vcpus[] in 'struct kvm'
      becomes a bit bigger(sizeof(void *) per vcpu) and need more cache
      lines to hold the structure, but 'struct kvm' is one generic struct,
      and it has worked well on other ARCHs already in this way. Also,
      the world switch frequecy is often low, for example, it is ~2000
      when running kernel building load in VM from APM xgene KVM host,
      so the effect is very small, and the difference can't be observed
      in my test at all.
      
      Cc: Dann Frazier <dann.frazier@canonical.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      ef748917
    • Will Deacon's avatar
      arm64: KVM: Remove all traces of the ThumbEE registers · 34c3faa3
      Will Deacon authored
      Although the ThumbEE registers and traps were present in earlier
      versions of the v8 architecture, it was retrospectively removed and so
      we can do the same.
      
      Whilst this breaks migrating a guest started on a previous version of
      the kernel, it is much better to kill these (non existent) registers
      as soon as possible.
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      [maz: added commend about migration]
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      34c3faa3
    • Marc Zyngier's avatar
      arm: KVM: Disable virtual timer even if the guest is not using it · 688bc577
      Marc Zyngier authored
      When running a guest with the architected timer disabled (with QEMU and
      the kernel_irqchip=off option, for example), it is important to make
      sure the timer gets turned off. Otherwise, the guest may try to
      enable it anyway, leading to a screaming HW interrupt.
      
      The fix is to unconditionally turn off the virtual timer on guest
      exit.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      688bc577
    • Marc Zyngier's avatar
      arm64: KVM: Disable virtual timer even if the guest is not using it · c4cbba9f
      Marc Zyngier authored
      When running a guest with the architected timer disabled (with QEMU and
      the kernel_irqchip=off option, for example), it is important to make
      sure the timer gets turned off. Otherwise, the guest may try to
      enable it anyway, leading to a screaming HW interrupt.
      
      The fix is to unconditionally turn off the virtual timer on guest
      exit.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      c4cbba9f
  9. 16 Sep, 2015 6 commits
  10. 15 Sep, 2015 4 commits
    • Jason Wang's avatar
      kvm: fix zero length mmio searching · 8f4216c7
      Jason Wang authored
      Currently, if we had a zero length mmio eventfd assigned on
      KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
      always compares the kvm_io_range() with the length that guest
      wrote. This will cause e.g for vhost, kick will be trapped by qemu
      userspace instead of vhost. Fixing this by using zero length if an
      iodevice is zero length.
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8f4216c7
    • Jason Wang's avatar
      kvm: fix double free for fast mmio eventfd · eefd6b06
      Jason Wang authored
      We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
      and once on KVM_FAST_MMIO_BUS but with a single iodev
      instance. This will lead to an issue: kvm_io_bus_destroy() knows
      nothing about the devices on two buses pointing to a single dev. Which
      will lead to double free[1] during exit. Fix this by allocating two
      instances of iodevs then registering one on KVM_MMIO_BUS and another
      on KVM_FAST_MMIO_BUS.
      
      CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
      Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
      task: ffff88009ae0c4b0 ti: ffff88020e7f0000 task.ti: ffff88020e7f0000
      RIP: 0010:[<ffffffffc07e25d8>]  [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm]
      RSP: 0018:ffff88020e7f3bc8  EFLAGS: 00010292
      RAX: dead000000200200 RBX: ffff8801ec19c900 RCX: 000000018200016d
      RDX: ffff8801ec19cf80 RSI: ffffea0008bf1d40 RDI: ffff8801ec19c900
      RBP: ffff88020e7f3bd8 R08: 000000002fc75a01 R09: 000000018200016d
      R10: ffffffffc07df6ae R11: ffff88022fc75a98 R12: ffff88021e7cc000
      R13: ffff88021e7cca48 R14: ffff88021e7cca50 R15: ffff8801ec19c880
      FS:  00007fc1ee3e6700(0000) GS:ffff88023e240000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f8f389d8000 CR3: 000000023dc13000 CR4: 00000000001427e0
      Stack:
      ffff88021e7cc000 0000000000000000 ffff88020e7f3be8 ffffffffc07e2622
      ffff88020e7f3c38 ffffffffc07df69a ffff880232524160 ffff88020e792d80
       0000000000000000 ffff880219b78c00 0000000000000008 ffff8802321686a8
      Call Trace:
      [<ffffffffc07e2622>] ioeventfd_destructor+0x12/0x20 [kvm]
      [<ffffffffc07df69a>] kvm_put_kvm+0xca/0x210 [kvm]
      [<ffffffffc07df818>] kvm_vcpu_release+0x18/0x20 [kvm]
      [<ffffffff811f69f7>] __fput+0xe7/0x250
      [<ffffffff811f6bae>] ____fput+0xe/0x10
      [<ffffffff81093f04>] task_work_run+0xd4/0xf0
      [<ffffffff81079358>] do_exit+0x368/0xa50
      [<ffffffff81082c8f>] ? recalc_sigpending+0x1f/0x60
      [<ffffffff81079ad5>] do_group_exit+0x45/0xb0
      [<ffffffff81085c71>] get_signal+0x291/0x750
      [<ffffffff810144d8>] do_signal+0x28/0xab0
      [<ffffffff810f3a3b>] ? do_futex+0xdb/0x5d0
      [<ffffffff810b7028>] ? __wake_up_locked_key+0x18/0x20
      [<ffffffff810f3fa6>] ? SyS_futex+0x76/0x170
      [<ffffffff81014fc9>] do_notify_resume+0x69/0xb0
      [<ffffffff817cb9af>] int_signal+0x12/0x17
      Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 10 00 00
       RIP  [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm]
       RSP <ffff88020e7f3bc8>
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eefd6b06
    • Jason Wang's avatar
      kvm: factor out core eventfd assign/deassign logic · 85da11ca
      Jason Wang authored
      This patch factors out core eventfd assign/deassign logic and leaves
      the argument checking and bus index selection to callers.
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85da11ca
    • Jason Wang's avatar
      kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd · 8453fecb
      Jason Wang authored
      We only want zero length mmio eventfd to be registered on
      KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
      make sure this.
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8453fecb
  11. 14 Sep, 2015 1 commit