1. 27 Nov, 2018 12 commits
    • Liran Alon's avatar
      KVM: nVMX: vmcs12 revision_id is always VMCS12_REVISION even when copied from eVMCS · 52ad7eb3
      Liran Alon authored
      vmcs12 represents the per-CPU cache of L1 active vmcs12.
      
      This cache can be loaded by one of the following:
      1) Guest making a vmcs12 active by exeucting VMPTRLD
      2) Guest specifying eVMCS in VP assist page and executing
      VMLAUNCH/VMRESUME.
      
      Either way, vmcs12 should have revision_id of VMCS12_REVISION.
      Which is not equal to eVMCS revision_id which specifies used
      VersionNumber of eVMCS struct (e.g. KVM_EVMCS_VERSION).
      
      Specifically, this causes an issue in restoring a nested VM state
      because vmx_set_nested_state() verifies that vmcs12->revision_id
      is equal to VMCS12_REVISION which was not true in case vmcs12
      was populated from an eVMCS by vmx_get_nested_state() which calls
      copy_enlightened_to_vmcs12().
      Reviewed-by: default avatarDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52ad7eb3
    • Liran Alon's avatar
      KVM: nVMX: Verify eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD · 72aeb60c
      Liran Alon authored
      According to TLFS section 16.11.2 Enlightened VMCS, the first u32
      field of eVMCS should specify eVMCS VersionNumber.
      
      This version should be in the range of supported eVMCS versions exposed
      to guest via CPUID.0x4000000A.EAX[0:15].
      The range which KVM expose to guest in this CPUID field should be the
      same as the value returned in vmcs_version by nested_enable_evmcs().
      
      According to the above, eVMCS VMPTRLD should verify that version specified
      in given eVMCS is in the supported range. However, current code
      mistakenly verfies this field against VMCS12_REVISION.
      
      One can also see that when KVM use eVMCS, it makes sure that
      alloc_vmcs_cpu() sets allocated eVMCS revision_id to KVM_EVMCS_VERSION.
      
      Obvious fix should just change eVMCS VMPTRLD to verify first u32 field
      of eVMCS is equal to KVM_EVMCS_VERSION.
      However, it turns out that Microsoft Hyper-V fails to comply to their
      own invented interface: When Hyper-V use eVMCS, it just sets first u32
      field of eVMCS to revision_id specified in MSR_IA32_VMX_BASIC (In our
      case: VMCS12_REVISION). Instead of used eVMCS version number which is
      one of the supported versions specified in CPUID.0x4000000A.EAX[0:15].
      To overcome Hyper-V bug, we accept either a supported eVMCS version
      or VMCS12_REVISION as valid values for first u32 field of eVMCS.
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: default avatarMark Kanda <mark.kanda@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      72aeb60c
    • Leonid Shatz's avatar
      KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset · 326e7425
      Leonid Shatz authored
      Since commit e79f245d ("X86/KVM: Properly update 'tsc_offset' to
      represent the running guest"), vcpu->arch.tsc_offset meaning was
      changed to always reflect the tsc_offset value set on active VMCS.
      Regardless if vCPU is currently running L1 or L2.
      
      However, above mentioned commit failed to also change
      kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
      This is because vmx_write_tsc_offset() could set the tsc_offset value
      in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
      However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
      to given offset parameter. Without taking into account the possible
      addition of vmcs12->tsc_offset. (Same is true for SVM case).
      
      Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
      actually set tsc_offset in active VMCS and modify
      kvm_vcpu_write_tsc_offset() to set returned value in
      vcpu->arch.tsc_offset.
      In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
      to make it clear that it is meant to set L1 TSC offset.
      
      Fixes: e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarLeonid Shatz <leonid.shatz@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      326e7425
    • Yi Wang's avatar
      x86/kvm/vmx: fix old-style function declaration · 1e4329ee
      Yi Wang authored
      The inline keyword which is not at the beginning of the function
      declaration may trigger the following build warnings, so let's fix it:
      
      arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      Signed-off-by: default avatarYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1e4329ee
    • Yi Wang's avatar
      KVM: x86: fix empty-body warnings · 354cb410
      Yi Wang authored
      We get the following warnings about empty statements when building
      with 'W=1':
      
      arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      
      Rework the debug helper macro to get rid of these warnings.
      Signed-off-by: default avatarYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      354cb410
    • Liran Alon's avatar
      KVM: VMX: Update shared MSRs to be saved/restored on MSR_EFER.LMA changes · f48b4711
      Liran Alon authored
      When guest transitions from/to long-mode by modifying MSR_EFER.LMA,
      the list of shared MSRs to be saved/restored on guest<->host
      transitions is updated (See vmx_set_efer() call to setup_msrs()).
      
      On every entry to guest, vcpu_enter_guest() calls
      vmx_prepare_switch_to_guest(). This function should also take care
      of setting the shared MSRs to be saved/restored. However, the
      function does nothing in case we are already running with loaded
      guest state (vmx->loaded_cpu_state != NULL).
      
      This means that even when guest modifies MSR_EFER.LMA which results
      in updating the list of shared MSRs, it isn't being taken into account
      by vmx_prepare_switch_to_guest() because it happens while we are
      running with loaded guest state.
      
      To fix above mentioned issue, add a flag to mark that the list of
      shared MSRs has been updated and modify vmx_prepare_switch_to_guest()
      to set shared MSRs when running with host state *OR* list of shared
      MSRs has been updated.
      
      Note that this issue was mistakenly introduced by commit
      678e315e ("KVM: vmx: add dedicated utility to access guest's
      kernel_gs_base") because previously vmx_set_efer() always called
      vmx_load_host_state() which resulted in vmx_prepare_switch_to_guest() to
      set shared MSRs.
      
      Fixes: 678e315e ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base")
      Reported-by: default avatarEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: default avatarMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: default avatarLiam Merwick <liam.merwick@oracle.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f48b4711
    • Liran Alon's avatar
      KVM: x86: Fix kernel info-leak in KVM_HC_CLOCK_PAIRING hypercall · bcbfbd8e
      Liran Alon authored
      kvm_pv_clock_pairing() allocates local var
      "struct kvm_clock_pairing clock_pairing" on stack and initializes
      all it's fields besides padding (clock_pairing.pad[]).
      
      Because clock_pairing var is written completely (including padding)
      to guest memory, failure to init struct padding results in kernel
      info-leak.
      
      Fix the issue by making sure to also init the padding with zeroes.
      
      Fixes: 55dd00a7 ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
      Reported-by: syzbot+a8ef68d71211ba264f56@syzkaller.appspotmail.com
      Reviewed-by: default avatarMark Kanda <mark.kanda@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bcbfbd8e
    • Liran Alon's avatar
      KVM: nVMX: Fix kernel info-leak when enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS more than once · 7f9ad1df
      Liran Alon authored
      Consider the case that userspace enables KVM_CAP_HYPERV_ENLIGHTENED_VMCS twice:
      1) kvm_vcpu_ioctl_enable_cap() is called to enable
      KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs().
      2) nested_enable_evmcs() sets enlightened_vmcs_enabled to true and fills
      vmcs_version which is then copied to userspace.
      3) kvm_vcpu_ioctl_enable_cap() is called again to enable
      KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs().
      4) This time nested_enable_evmcs() just returns 0 as
      enlightened_vmcs_enabled is already true. *Without filling
      vmcs_version*.
      5) kvm_vcpu_ioctl_enable_cap() continues as usual and copies
      *uninitialized* vmcs_version to userspace which leads to kernel info-leak.
      
      Fix this issue by simply changing nested_enable_evmcs() to always fill
      vmcs_version output argument. Even when enlightened_vmcs_enabled is
      already set to true.
      
      Note that SVM's nested_enable_evmcs() should not be modified because it
      always returns a non-zero value (-ENODEV) which results in
      kvm_vcpu_ioctl_enable_cap() skipping the copy of vmcs_version to
      userspace (as it should).
      
      Fixes: 57b119da ("KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability")
      Reported-by: syzbot+cfbc368e283d381f8cef@syzkaller.appspotmail.com
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7f9ad1df
    • Wei Wang's avatar
      svm: Add mutex_lock to protect apic_access_page_done on AMD systems · 30510387
      Wei Wang authored
      There is a race condition when accessing kvm->arch.apic_access_page_done.
      Due to it, x86_set_memory_region will fail when creating the second vcpu
      for a svm guest.
      
      Add a mutex_lock to serialize the accesses to apic_access_page_done.
      This lock is also used by vmx for the same purpose.
      Signed-off-by: default avatarWei Wang <wawei@amazon.de>
      Signed-off-by: default avatarAmadeusz Juskowiak <ajusk@amazon.de>
      Signed-off-by: default avatarJulian Stecklina <jsteckli@amazon.de>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Reviewed-by: default avatarJoerg Roedel <jroedel@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      30510387
    • Wanpeng Li's avatar
      KVM: X86: Fix scan ioapic use-before-initialization · e97f852f
      Wanpeng Li authored
      Reported by syzkaller:
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8
       PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 7 PID: 5059 Comm: debug Tainted: G           OE     4.19.0-rc5 #16
       RIP: 0010:__lock_acquire+0x1a6/0x1990
       Call Trace:
        lock_acquire+0xdb/0x210
        _raw_spin_lock+0x38/0x70
        kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
        vcpu_enter_guest+0x167e/0x1910 [kvm]
        kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
        kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
        do_vfs_ioctl+0xa5/0x690
        ksys_ioctl+0x6d/0x80
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x83/0x6e0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr
      and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
      However, irqchip is not initialized by this simple testcase, ioapic/apic
      objects should not be accessed.
      This can be triggered by the following program:
      
          #define _GNU_SOURCE
      
          #include <endian.h>
          #include <stdint.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <string.h>
          #include <sys/syscall.h>
          #include <sys/types.h>
          #include <unistd.h>
      
          uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
      
          int main(void)
          {
          	syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
          	long res = 0;
          	memcpy((void*)0x20000040, "/dev/kvm", 9);
          	res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0);
          	if (res != -1)
          		r[0] = res;
          	res = syscall(__NR_ioctl, r[0], 0xae01, 0);
          	if (res != -1)
          		r[1] = res;
          	res = syscall(__NR_ioctl, r[1], 0xae41, 0);
          	if (res != -1)
          		r[2] = res;
          	memcpy(
          			(void*)0x20000080,
          			"\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00"
          			"\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43"
          			"\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33"
          			"\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe"
          			"\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22"
          			"\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb",
          			106);
          	syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080);
          	syscall(__NR_ioctl, r[2], 0xae80, 0);
          	return 0;
          }
      
      This patch fixes it by bailing out scan ioapic if ioapic is not initialized in
      kernel.
      Reported-by: default avatarWei Wu <ww9210@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wei Wu <ww9210@gmail.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e97f852f
    • Wanpeng Li's avatar
      KVM: LAPIC: Fix pv ipis use-before-initialization · 38ab012f
      Wanpeng Li authored
      Reported by syzkaller:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
       PGD 800000040410c067 P4D 800000040410c067 PUD 40410d067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 3 PID: 2567 Comm: poc Tainted: G           OE     4.19.0-rc5 #16
       RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm]
       Call Trace:
        kvm_emulate_hypercall+0x3cc/0x700 [kvm]
        handle_vmcall+0xe/0x10 [kvm_intel]
        vmx_handle_exit+0xc1/0x11b0 [kvm_intel]
        vcpu_enter_guest+0x9fb/0x1910 [kvm]
        kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
        kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
        do_vfs_ioctl+0xa5/0x690
        ksys_ioctl+0x6d/0x80
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x83/0x6e0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the apic map has not yet been initialized, the testcase
      triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map
      is dereferenced. This patch fixes it by checking whether or not apic map is
      NULL and bailing out immediately if that is the case.
      
      Fixes: 4180bf1b (KVM: X86: Implement "send IPI" hypercall)
      Reported-by: default avatarWei Wu <ww9210@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wei Wu <ww9210@gmail.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      38ab012f
    • Luiz Capitulino's avatar
      KVM: VMX: re-add ple_gap module parameter · a87c99e6
      Luiz Capitulino authored
      Apparently, the ple_gap parameter was accidentally removed
      by commit c8e88717. Add it
      back.
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: c8e88717Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a87c99e6
  2. 25 Nov, 2018 1 commit
  3. 18 Nov, 2018 23 commits
  4. 16 Nov, 2018 4 commits