Commits · 3cfc3092f40bc37c57ba556cfd8de4218f2135ab · Kirill Smelkov / linux

03 Dec, 2009 40 commits

KVM: x86: Add KVM_GET/SET_VCPU_EVENTS · 3cfc3092

Jan Kiszka authored Nov 12, 2009

This new IOCTL exports all yet user-invisible states related to
exceptions, interrupts, and NMIs. Together with appropriate user space
changes, this fixes sporadic problems of vmsave/restore, live migration
and system reset.

[avi: future-proof abi by adding a flags field]
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

3cfc3092

KVM: VMX: Report unexpected simultaneous exceptions as internal errors · 65ac7264

Avi Kivity authored Nov 04, 2009

These happen when we trap an exception when another exception is being
delivered; we only expect these with MCEs and page faults.  If something
unexpected happens, things probably went south and we're better off reporting
an internal error and freezing.
Signed-off-by: Avi Kivity <avi@redhat.com>

65ac7264

KVM: Allow internal errors reported to userspace to carry extra data · a9c7399d

Avi Kivity authored Nov 04, 2009

Usually userspace will freeze the guest so we can inspect it, but some
internal state is not available.  Add extra data to internal error
reporting so we can expose it to the debugger.  Extra data is specific
to the suberror.
Signed-off-by: Avi Kivity <avi@redhat.com>

a9c7399d

KVM: Reorder IOCTLs in main kvm.h · c54d2aba

Jan Kiszka authored Nov 02, 2009

Obviously, people tend to extend this header at the bottom - more or
less blindly. Ensure that deprecated stuff gets its own corner again by
moving things to the top. Also add some comments and reindent IOCTLs to
make them more readable and reduce the risk of number collisions.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

c54d2aba

KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG · 4f926bf2

Jan Kiszka authored Oct 30, 2009

Decouple KVM_GUESTDBG_INJECT_DB and KVM_GUESTDBG_INJECT_BP from
KVM_GUESTDBG_ENABLE, their are actually orthogonal. At this chance,
avoid triggering the WARN_ON in kvm_queue_exception if there is already
an exception pending and reject such invalid requests.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

4f926bf2

KVM: only clear irq_source_id if irqchip is present · e50212bb

Marcelo Tosatti authored Oct 29, 2009

Otherwise kvm might attempt to dereference a NULL pointer.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

e50212bb

KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic · 2204ae3c

Marcelo Tosatti authored Oct 29, 2009

Otherwise kvm might attempt to dereference a NULL pointer.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

2204ae3c

KVM: x86: disallow multiple KVM_CREATE_IRQCHIP · 3ddea128

Marcelo Tosatti authored Oct 29, 2009

Otherwise kvm will leak memory on multiple KVM_CREATE_IRQCHIP.
Also serialize multiple accesses with kvm->lock.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

3ddea128

KVM: VMX: Remove vmx->msr_offset_efer · 92c0d900

Avi Kivity authored Oct 29, 2009

This variable is used to communicate between a caller and a callee; switch
to a function argument instead.
Signed-off-by: Avi Kivity <avi@redhat.com>

92c0d900

KVM: MMU: update invlpg handler comment · 5f5c35aa

Marcelo Tosatti authored Oct 26, 2009

Large page translations are always synchronized (either in level 3
or level 2), so its not necessary to properly deal with them
in the invlpg handler.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

5f5c35aa

KVM: VMX: move CR3/PDPTR update to vmx_set_cr3 · 7c93be44

Marcelo Tosatti authored Oct 26, 2009

GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from
outside guest context. Similarly pdptrs are updated via load_pdptrs.

Let kvm_set_cr3 perform the update, removing it from the vcpu_run
fast path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

7c93be44

KVM: remove duplicated task_switch check · 1655e3a3

Gleb Natapov authored Oct 25, 2009

Probably introduced by a bad merge.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

1655e3a3

KVM: powerpc: Fix BUILD_BUG_ON condition · c0a187e1

Hollis Blanchard authored Oct 23, 2009

The old BUILD_BUG_ON implementation didn't work with __builtin_constant_p().
Fixing that revealed this test had been inverted for a long time without
anybody noticing...
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

c0a187e1

KVM: VMX: Use shared msr infrastructure · 26bb0981

Avi Kivity authored Sep 07, 2009

Instead of reloading syscall MSRs on every preemption, use the new shared
msr infrastructure to reload them at the last possible minute (just before
exit to userspace).

Improves vcpu/idle/vcpu switches by about 2000 cycles (when EFER needs to be
reloaded as well).

[jan: fix slot index missing indirection]
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

26bb0981

KVM: x86 shared msr infrastructure · 18863bdd

Avi Kivity authored Sep 07, 2009

The various syscall-related MSRs are fairly expensive to switch.  Currently
we switch them on every vcpu preemption, which is far too often:

- if we're switching to a kernel thread (idle task, threaded interrupt,
  kernel-mode virtio server (vhost-net), for example) and back, then
  there's no need to switch those MSRs since kernel threasd won't
  be exiting to userspace.

- if we're switching to another guest running an identical OS, most likely
  those MSRs will have the same value, so there's little point in reloading
  them.

- if we're running the same OS on the guest and host, the MSRs will have
  identical values and reloading is unnecessary.

This patch uses the new user return notifiers to implement last-minute
switching, and checks the msr values to avoid unnecessary reloading.
Signed-off-by: Avi Kivity <avi@redhat.com>

18863bdd

KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx autoload msr area · 44ea2b17

Avi Kivity authored Sep 06, 2009

Currently MSR_KERNEL_GS_BASE is saved and restored as part of the
guest/host msr reloading.  Since we wish to lazy-restore all the other
msrs, save and reload MSR_KERNEL_GS_BASE explicitly instead of using
the common code.
Signed-off-by: Avi Kivity <avi@redhat.com>

44ea2b17

KVM: SVM: init_vmcb(): remove redundant save->cr0 initialization · 3ce672d4

Eduardo Habkost authored Oct 24, 2009

The svm_set_cr0() call will initialize save->cr0 properly even when npt is
enabled, clearing the NW and CD bits as expected, so we don't need to
initialize it manually for npt_enabled anymore.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

3ce672d4

KVM: SVM: Reset cr0 properly on vcpu reset · 18fa000a

Eduardo Habkost authored Oct 24, 2009

svm_vcpu_reset() was not properly resetting the contents of the guest-visible
cr0 register, causing the following issue:
https://bugzilla.redhat.com/show_bug.cgi?id=525699

Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine
with paging enabled, making the vcpu get a pagefault exception while trying to
run it.

Instead of setting vmcb->save.cr0 directly, the new code just resets
kvm->arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on
vmcb->save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0().

kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure
kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

18fa000a

KVM: VMX: Use macros instead of hex value on cr0 initialization · fa40052c

Eduardo Habkost authored Oct 24, 2009

This should have no effect, it is just to make the code clearer.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

fa40052c

KVM: Enable 32bit dirty log pointers on 64bit host · 6ff5894c

Arnd Bergmann authored Oct 22, 2009

With big endian userspace, we can't quite figure out if a pointer
is 32 bit (shifted >> 32) or 64 bit when we read a 64 bit pointer.

This is what happens with dirty logging. To get the pointer interpreted
correctly, we thus need Arnd's patch to implement a compat layer for
the ioctl:

A better way to do this is to add a separate compat_ioctl() method that
converts this for you.

Based on initial patch from Arnd Bergmann.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

6ff5894c

KVM: allow userspace to adjust kvmclock offset · afbcf7ab

Glauber Costa authored Oct 16, 2009

When we migrate a kvm guest that uses pvclock between two hosts, we may
suffer a large skew. This is because there can be significant differences
between the monotonic clock of the hosts involved. When a new host with
a much larger monotonic time starts running the guest, the view of time
will be significantly impacted.

Situation is much worse when we do the opposite, and migrate to a host with
a smaller monotonic clock.

This proposed ioctl will allow userspace to inform us what is the monotonic
clock value in the source host, so we can keep the time skew short, and
more importantly, never goes backwards. Userspace may also need to trigger
the current data, since from the first migration onwards, it won't be
reflected by a simple call to clock_gettime() anymore.

[marcelo: future-proof abi with a flags field]
[jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

afbcf7ab

KVM: fix irq_source_id size verification · cd5a2685

Marcelo Tosatti authored Oct 17, 2009

find_first_zero_bit works with bit numbers, not bytes.

Fixes

https://sourceforge.net/tracker/?func=detail&aid=2847560&group_id=180599&atid=893831Reported-by: "Xu, Jiajun" <jiajun.xu@intel.com>
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

cd5a2685

KVM: SVM: Cleanup NMI singlestep · 6be7d306

Jan Kiszka authored Oct 18, 2009

Push the NMI-related singlestep variable into vcpu_svm. It's dealing
with an AMD-specific deficit, nothing generic for x86.
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

 arch/x86/include/asm/kvm_host.h |    1 -
 arch/x86/kvm/svm.c              |   12 +++++++-----
 2 files changed, 7 insertions(+), 6 deletions(-)
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

6be7d306

KVM: x86: Fix guest single-stepping while interruptible · 94fe45da

Jan Kiszka authored Oct 18, 2009

Commit 705c5323 opened the doors of hell by unconditionally injecting
single-step flags as long as guest_debug signaled this. This doesn't
work when the guest branches into some interrupt or exception handler
and triggers a vmexit with flag reloading.

Fix it by saving cs:rip when user space requests single-stepping and
restricting the trace flag injection to this guest code position.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

94fe45da

KVM: Xen PV-on-HVM guest support · ffde22ac

Ed Swierk authored Oct 15, 2009

Support for Xen PV-on-HVM guests can be implemented almost entirely in
userspace, except for handling one annoying MSR that maps a Xen
hypercall blob into guest address space.

A generic mechanism to delegate MSR writes to userspace seems overkill
and risks encouraging similar MSR abuse in the future.  Thus this patch
adds special support for the Xen HVM MSR.

I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
KVM which MSR the guest will write to, as well as the starting address
and size of the hypercall blobs (one each for 32-bit and 64-bit) that
userspace has loaded from files.  When the guest writes to the MSR, KVM
copies one page of the blob from userspace to the guest.

I've tested this patch with a hacked-up version of Gerd's userspace
code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

[jan: fix i386 build warning]
[avi: future proof abi with a flags field]
Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

ffde22ac

KVM: x86: Drop unneeded CONFIG_HAS_IOMEM check · 94c30d9c

Jan Kiszka authored Oct 12, 2009

This (broken) check dates back to the days when this code was shared
across architectures. x86 has IOMEM, so drop it.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

94c30d9c

KVM: VMX: fix handle_pause declaration · 9fb41ba8
Marcelo Tosatti authored Oct 12, 2009
```
There's no kvm_run argument anymore.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
```
9fb41ba8

KVM: x86: Harden against cpufreq · 6b7d7e76

Zachary Amsden authored Oct 09, 2009

If cpufreq can't determine the CPU khz, or cpufreq is not compiled in,
we should fallback to the measured TSC khz.
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

6b7d7e76

KVM: SVM: Support Pause Filter in AMD processors · 565d0998

Mark Langsdorf authored Oct 06, 2009

New AMD processors (Family 0x10 models 8+) support the Pause
Filter Feature.  This feature creates a new field in the VMCB
called Pause Filter Count.  If Pause Filter Count is greater
than 0 and intercepting PAUSEs is enabled, the processor will
increment an internal counter when a PAUSE instruction occurs
instead of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,
especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.

Experimental results show that most spinlocks are held
for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID
bit.

On a 24 core system running 4 guests each with 16 VCPUs,
this patch improved overall performance of each guest's
32 job kernbench by approximately 3-5% when combined
with a scheduler algorithm thati caused the VCPU to
sleep for a brief period. Further performance improvement
may be possible with a more sophisticated yield algorithm.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

565d0998

KVM: VMX: Add support for Pause-Loop Exiting · 4b8d54f9

Zhai, Edwin authored Oct 09, 2009

New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution
control fields:
PLE_Gap    - upper bound on the amount of time between two successive
             executions of PAUSE in a loop.
PLE_Window - upper bound on the amount of time a guest is allowed to execute in
             a PAUSE loop

If the time, between this execution of PAUSE and previous one, exceeds the
PLE_Gap, processor consider this PAUSE belongs to a new loop.
Otherwise, processor determins the the total execution time of this loop(since
1st PAUSE in this loop), and triggers a VM exit if total time exceeds the
PLE_Window.
* Refer SDM volume 3b section 21.6.13 & 22.1.3.

Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP
is sched-out after hold a spinlock, then other VPs for same lock are sched-in
to waste the CPU time.

Our tests indicate that most spinlocks are held for less than 212 cycles.
Performance tests show that with 2X LP over-commitment we can get +2% perf
improvement for kernel build(Even more perf gain with more LPs).
Signed-off-by: Zhai Edwin <edwin.zhai@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

4b8d54f9

KVM: introduce kvm_vcpu_on_spin · d255f4f2

Zhai, Edwin authored Oct 09, 2009

Introduce kvm_vcpu_on_spin, to be used by VMX/SVM to yield processing
once the cpu detects pause-based looping.
Signed-off-by: "Zhai, Edwin" <edwin.zhai@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

d255f4f2

KVM: SVM: Remove nsvm_printk debugging code · d36f19e9

Joerg Roedel authored Oct 09, 2009

With all important informations now delivered through
tracepoints we can savely remove the nsvm_printk debugging
code for nested svm.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

d36f19e9

KVM: SVM: Add tracepoint for skinit instruction · 532a46b9

Joerg Roedel authored Oct 09, 2009

This patch adds a tracepoint for the event that the guest
executed the SKINIT instruction. This information is
important because SKINIT is an SVM extenstion not yet
implemented by nested SVM and we may need this information
for debugging hypervisors that do not yet run on nested SVM.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

532a46b9

KVM: SVM: Add tracepoint for invlpga instruction · ec1ff790

Joerg Roedel authored Oct 09, 2009

This patch adds a tracepoint for the event that the guest
executed the INVLPGA instruction.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

ec1ff790

KVM: SVM: Add tracepoint for #vmexit because intr pending · 236649de

Joerg Roedel authored Oct 09, 2009

This patch adds a special tracepoint for the event that a
nested #vmexit is injected because kvm wants to inject an
interrupt into the guest.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

236649de

KVM: SVM: Add tracepoint for injected #vmexit · 17897f36

Joerg Roedel authored Oct 09, 2009

This patch adds a tracepoint for a nested #vmexit that gets
re-injected to the guest.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

17897f36

KVM: SVM: Add tracepoint for nested #vmexit · d8cabddf

Joerg Roedel authored Oct 09, 2009

This patch adds a tracepoint for every #vmexit we get from a
nested guest.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

d8cabddf

KVM: SVM: Add tracepoint for nested vmrun · 0ac406de

Joerg Roedel authored Oct 09, 2009

This patch adds a dedicated kvm tracepoint for a nested
vmrun.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

0ac406de

KVM: SVM: Move INTR vmexit out of atomic code · cd3ff653

Joerg Roedel authored Oct 09, 2009

The nested SVM code emulates a #vmexit caused by a request
to open the irq window right in the request function. This
is a bug because the request function runs with preemption
and interrupts disabled but the #vmexit emulation might
sleep. This can cause a schedule()-while-atomic bug and is
fixed with this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

cd3ff653

KVM: SVM: Notify nested hypervisor of lost event injections · 8d23c466

Alexander Graf authored Oct 09, 2009

If event_inj is valid on a #vmexit the host CPU would write
the contents to exit_int_info, so the hypervisor knows that
the event wasn't injected.

We don't do this in nested SVM by now which is a bug and
fixed by this patch.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

8d23c466