Commit 5dfd486c authored by Dave Hansen's avatar Dave Hansen Committed by H. Peter Anvin

x86, kvm: Fix kvm's use of __pa() on percpu areas

In short, it is illegal to call __pa() on an address holding
a percpu variable.  This replaces those __pa() calls with
slow_virt_to_phys().  All of the cases in this patch are
in boot time (or CPU hotplug time at worst) code, so the
slow pagetable walking in slow_virt_to_phys() is not expected
to have a performance impact.

The times when this actually matters are pretty obscure
(certain 32-bit NUMA systems), but it _does_ happen.  It is
important to keep KVM guests working on these systems because
the real hardware is getting harder and harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

	CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[    0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.  The behavior varied,
though, depending on what got corrupted by the bad write.
Signed-off-by: default avatarDave Hansen <dave@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20130122212435.4905663F@kernel.stglabs.ibm.comAcked-by: default avatarRik van Riel <riel@redhat.com>
Reviewed-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
parent d7656534
...@@ -297,9 +297,9 @@ static void kvm_register_steal_time(void) ...@@ -297,9 +297,9 @@ static void kvm_register_steal_time(void)
memset(st, 0, sizeof(*st)); memset(st, 0, sizeof(*st));
wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
cpu, __pa(st)); cpu, slow_virt_to_phys(st));
} }
static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED; static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
...@@ -324,7 +324,7 @@ void __cpuinit kvm_guest_cpu_init(void) ...@@ -324,7 +324,7 @@ void __cpuinit kvm_guest_cpu_init(void)
return; return;
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
u64 pa = __pa(&__get_cpu_var(apf_reason)); u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
#ifdef CONFIG_PREEMPT #ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS; pa |= KVM_ASYNC_PF_SEND_ALWAYS;
...@@ -340,7 +340,8 @@ void __cpuinit kvm_guest_cpu_init(void) ...@@ -340,7 +340,8 @@ void __cpuinit kvm_guest_cpu_init(void)
/* Size alignment is implied but just to make it explicit. */ /* Size alignment is implied but just to make it explicit. */
BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
__get_cpu_var(kvm_apic_eoi) = 0; __get_cpu_var(kvm_apic_eoi) = 0;
pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
| KVM_MSR_ENABLED;
wrmsrl(MSR_KVM_PV_EOI_EN, pa); wrmsrl(MSR_KVM_PV_EOI_EN, pa);
} }
......
...@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) ...@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
int low, high, ret; int low, high, ret;
struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
low = (int)__pa(src) | 1; low = (int)slow_virt_to_phys(src) | 1;
high = ((u64)__pa(src) >> 32); high = ((u64)slow_virt_to_phys(src) >> 32);
ret = native_write_msr_safe(msr_kvm_system_time, low, high); ret = native_write_msr_safe(msr_kvm_system_time, low, high);
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
cpu, high, low, txt); cpu, high, low, txt);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment