1. 29 Dec, 2022 40 commits
    • Chao Gao's avatar
      KVM: x86: Do compatibility checks when onlining CPU · c82a5c5c
      Chao Gao authored
      Do compatibility checks when enabling hardware to effectively add
      compatibility checks when onlining a CPU.  Abort enabling, i.e. the
      online process, if the (hotplugged) CPU is incompatible with the known
      good setup.
      
      At init time, KVM does compatibility checks to ensure that all online
      CPUs support hardware virtualization and a common set of features. But
      KVM uses hotplugged CPUs without such compatibility checks. On Intel
      CPUs, this leads to #GP if the hotplugged CPU doesn't support VMX, or
      VM-Entry failure if the hotplugged CPU doesn't support all features
      enabled by KVM.
      
      Note, this is little more than a NOP on SVM, as SVM already checks for
      full SVM support during hardware enabling.
      
      Opportunistically add a pr_err() if setup_vmcs_config() fails, and
      tweak all error messages to output which CPU failed.
      Signed-off-by: default avatarChao Gao <chao.gao@intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20221130230934.1014142-41-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c82a5c5c
    • Sean Christopherson's avatar
      KVM: x86: Move CPU compat checks hook to kvm_x86_ops (from kvm_x86_init_ops) · d83420c2
      Sean Christopherson authored
      Move the .check_processor_compatibility() callback from kvm_x86_init_ops
      to kvm_x86_ops to allow a future patch to do compatibility checks during
      CPU hotplug.
      
      Do kvm_ops_update() before compat checks so that static_call() can be
      used during compat checks.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20221130230934.1014142-40-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d83420c2
    • Sean Christopherson's avatar
      KVM: SVM: Check for SVM support in CPU compatibility checks · 325fc957
      Sean Christopherson authored
      Check that SVM is supported and enabled in the processor compatibility
      checks.  SVM already checks for support during hardware enabling,
      i.e. this doesn't really add new functionality.  The net effect is that
      KVM will refuse to load if a CPU doesn't have SVM fully enabled, as
      opposed to failing KVM_CREATE_VM.
      
      Opportunistically move svm_check_processor_compat() up in svm.c so that
      it can be invoked during hardware enabling in a future patch.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-39-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      325fc957
    • Sean Christopherson's avatar
      KVM: VMX: Shuffle support checks and hardware enabling code around · 8504ef21
      Sean Christopherson authored
      Reorder code in vmx.c so that the VMX support check helpers reside above
      the hardware enabling helpers, which will allow KVM to perform support
      checks during hardware enabling (in a future patch).
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-38-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8504ef21
    • Sean Christopherson's avatar
      KVM: x86: Do VMX/SVM support checks directly in vendor code · d4193132
      Sean Christopherson authored
      Do basic VMX/SVM support checks directly in vendor code instead of
      implementing them via kvm_x86_ops hooks.  Beyond the superficial benefit
      of providing common messages, which isn't even clearly a net positive
      since vendor code can provide more precise/detailed messages, there's
      zero advantage to bouncing through common x86 code.
      
      Consolidating the checks will also simplify performing the checks
      across all CPUs (in a future patch).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-37-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d4193132
    • Sean Christopherson's avatar
      KVM: VMX: Use current CPU's info to perform "disabled by BIOS?" checks · 462689b3
      Sean Christopherson authored
      Use this_cpu_has() instead of boot_cpu_has() to perform the effective
      "disabled by BIOS?" checks for VMX.  This will allow consolidating code
      between vmx_disabled_by_bios() and vmx_check_processor_compat().
      
      Checking the boot CPU isn't a strict requirement as any divergence in VMX
      enabling between the boot CPU and other CPUs will result in KVM refusing
      to load thanks to the aforementioned vmx_check_processor_compat().
      
      Furthermore, using the boot CPU was an unintentional change introduced by
      commit a4d0b2fd ("KVM: VMX: Use VMX feature flag to query BIOS
      enabling").  Prior to using the feature flags, KVM checked the raw MSR
      value from the current CPU.
      Reported-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20221130230934.1014142-36-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      462689b3
    • Sean Christopherson's avatar
      KVM: x86: Unify pr_fmt to use module name for all KVM modules · 8d20bd63
      Sean Christopherson authored
      Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
      use consistent formatting across common x86, Intel, and AMD code.  In
      addition to providing consistent print formatting, using KBUILD_MODNAME,
      e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
      SGX and ...) as technologies without generating weird messages, and
      without causing naming conflicts with other kernel code, e.g. "SEV: ",
      "tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
      
      Opportunistically move away from printk() for prints that need to be
      modified anyways, e.g. to drop a manual "kvm: " prefix.
      
      Opportunistically convert a few SGX WARNs that are similarly modified to
      WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
      that they would fire repeatedly and spam the kernel log without providing
      unique information in each print.
      
      Note, defining pr_fmt yields undesirable results for code that uses KVM's
      printk wrappers, e.g. vcpu_unimpl().  But, that's a pre-existing problem
      as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
      wrappers is relatively limited in KVM x86 code.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Message-Id: <20221130230934.1014142-35-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8d20bd63
    • Sean Christopherson's avatar
      KVM: x86: Use KBUILD_MODNAME to specify vendor module name · 08a9d59c
      Sean Christopherson authored
      Use KBUILD_MODNAME to specify the vendor module name instead of manually
      writing out the name to make it a bit more obvious that the name isn't
      completely arbitrary.  A future patch will also use KBUILD_MODNAME to
      define pr_fmt, at which point using KBUILD_MODNAME for kvm_x86_ops.name
      further reinforces the intended usage of kvm_x86_ops.name.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-34-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      08a9d59c
    • Sean Christopherson's avatar
      KVM: Drop kvm_arch_check_processor_compat() hook · 81a1cf9f
      Sean Christopherson authored
      Drop kvm_arch_check_processor_compat() and its support code now that all
      architecture implementations are nops.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20221130230934.1014142-33-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81a1cf9f
    • Sean Christopherson's avatar
      KVM: x86: Do CPU compatibility checks in x86 code · 3045c483
      Sean Christopherson authored
      Move the CPU compatibility checks to pure x86 code, i.e. drop x86's use
      of the common kvm_x86_check_cpu_compat() arch hook.  x86 is the only
      architecture that "needs" to do per-CPU compatibility checks, moving
      the logic to x86 will allow dropping the common code, and will also
      give x86 more control over when/how the compatibility checks are
      performed, e.g. TDX will need to enable hardware (do VMXON) in order to
      perform compatibility checks.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20221130230934.1014142-32-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3045c483
    • Sean Christopherson's avatar
      KVM: VMX: Make VMCS configuration/capabilities structs read-only after init · 58ca1930
      Sean Christopherson authored
      Tag vmcs_config and vmx_capability structs as __init, the canonical
      configuration is generated during hardware_setup() and must never be
      modified after that point.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-31-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      58ca1930
    • Sean Christopherson's avatar
      KVM: Drop kvm_arch_{init,exit}() hooks · a578a0a9
      Sean Christopherson authored
      Drop kvm_arch_init() and kvm_arch_exit() now that all implementations
      are nops.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-30-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a578a0a9
    • Sean Christopherson's avatar
      KVM: s390: Mark __kvm_s390_init() and its descendants as __init · 6c30cd2e
      Sean Christopherson authored
      Tag __kvm_s390_init() and its unique helpers as __init.  These functions
      are only ever called during module_init(), but could not be tagged
      accordingly while they were invoked from the common kvm_arch_init(),
      which is not __init because of x86.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Message-Id: <20221130230934.1014142-29-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c30cd2e
    • Sean Christopherson's avatar
      KVM: s390: Do s390 specific init without bouncing through kvm_init() · b8449265
      Sean Christopherson authored
      Move the guts of kvm_arch_init() into a new helper, __kvm_s390_init(),
      and invoke the new helper directly from kvm_s390_init() instead of
      bouncing through kvm_init().  Invoking kvm_arch_init() is the very
      first action performed by kvm_init(), i.e. this is a glorified nop.
      
      Moving setup to __kvm_s390_init() will allow tagging more functions as
      __init, and emptying kvm_arch_init() will allow dropping the hook
      entirely once all architecture implementations are nops.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Message-Id: <20221130230934.1014142-28-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b8449265
    • Sean Christopherson's avatar
      KVM: PPC: Move processor compatibility check to module init · ae19b15d
      Sean Christopherson authored
      Move KVM PPC's compatibility checks to their respective module_init()
      hooks, there's no need to wait until KVM's common compat check, nor is
      there a need to perform the check on every CPU (provided by common KVM's
      hook), as the compatibility checks operate on global data.
      
        arch/powerpc/include/asm/cputable.h: extern struct cpu_spec *cur_cpu_spec;
        arch/powerpc/kvm/book3s.c: return 0
        arch/powerpc/kvm/e500.c: strcmp(cur_cpu_spec->cpu_name, "e500v2")
        arch/powerpc/kvm/e500mc.c: strcmp(cur_cpu_spec->cpu_name, "e500mc")
                                   strcmp(cur_cpu_spec->cpu_name, "e5500")
                                   strcmp(cur_cpu_spec->cpu_name, "e6500")
      
      Cc: Fabiano Rosas <farosas@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Message-Id: <20221130230934.1014142-27-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ae19b15d
    • Sean Christopherson's avatar
      KVM: RISC-V: Tag init functions and data with __init, __ro_after_init · 45b66dc1
      Sean Christopherson authored
      Now that KVM setup is handled directly in riscv_kvm_init(), tag functions
      and data that are used/set only during init with __init/__ro_after_init.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-26-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45b66dc1
    • Sean Christopherson's avatar
      KVM: RISC-V: Do arch init directly in riscv_kvm_init() · 20deee32
      Sean Christopherson authored
      Fold the guts of kvm_arch_init() into riscv_kvm_init() instead of
      bouncing through kvm_init()=>kvm_arch_init().  Functionally, this is a
      glorified nop as invoking kvm_arch_init() is the very first action
      performed by kvm_init().
      
      Moving setup to riscv_kvm_init(), which is tagged __init, will allow
      tagging more functions and data with __init and __ro_after_init.  And
      emptying kvm_arch_init() will allow dropping the hook entirely once all
      architecture implementations are nops.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-25-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      20deee32
    • Sean Christopherson's avatar
      KVM: MIPS: Register die notifier prior to kvm_init() · eed9fcdf
      Sean Christopherson authored
      Call kvm_init() only after _all_ setup is complete, as kvm_init() exposes
      /dev/kvm to userspace and thus allows userspace to create VMs (and call
      other ioctls).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Message-Id: <20221130230934.1014142-24-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eed9fcdf
    • Sean Christopherson's avatar
      KVM: MIPS: Setup VZ emulation? directly from kvm_mips_init() · 3fb8e89a
      Sean Christopherson authored
      Invoke kvm_mips_emulation_init() directly from kvm_mips_init() instead
      of bouncing through kvm_init()=>kvm_arch_init().  Functionally, this is
      a glorified nop as invoking kvm_arch_init() is the very first action
      performed by kvm_init().
      
      Emptying kvm_arch_init() will allow dropping the hook entirely once all
      architecture implementations are nops.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Message-Id: <20221130230934.1014142-23-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3fb8e89a
    • Sean Christopherson's avatar
      KVM: MIPS: Hardcode callbacks to hardware virtualization extensions · 1cfc1c7b
      Sean Christopherson authored
      Now that KVM no longer supports trap-and-emulate (see commit 45c7e8af
      "MIPS: Remove KVM_TE support"), hardcode the MIPS callbacks to the
      virtualization callbacks.
      
      Harcoding the callbacks eliminates the technically-unnecessary check on
      non-NULL kvm_mips_callbacks in kvm_arch_init().  MIPS has never supported
      multiple in-tree modules, i.e. barring an out-of-tree module, where
      copying and renaming kvm.ko counts as "out-of-tree", KVM could never
      encounter a non-NULL set of callbacks during module init.
      
      The callback check is also subtly broken, as it is not thread safe,
      i.e. if there were multiple modules, loading both concurrently would
      create a race between checking and setting kvm_mips_callbacks.
      
      Given that out-of-tree shenanigans are not the kernel's responsibility,
      hardcode the callbacks to simplify the code.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Message-Id: <20221130230934.1014142-22-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1cfc1c7b
    • Sean Christopherson's avatar
      KVM: arm64: Mark kvm_arm_init() and its unique descendants as __init · 53bf620a
      Sean Christopherson authored
      Tag kvm_arm_init() and its unique helper as __init, and tag data that is
      only ever modified under the kvm_arm_init() umbrella as read-only after
      init.
      
      Opportunistically name the boolean param in kvm_timer_hyp_init()'s
      prototype to match its definition.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-21-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      53bf620a
    • Sean Christopherson's avatar
      KVM: arm64: Do arm/arch initialization without bouncing through kvm_init() · 1dc0f02d
      Sean Christopherson authored
      Do arm/arch specific initialization directly in arm's module_init(), now
      called kvm_arm_init(), instead of bouncing through kvm_init() to reach
      kvm_arch_init().  Invoking kvm_arch_init() is the very first action
      performed by kvm_init(), so from a initialization perspective this is a
      glorified nop.
      
      Avoiding kvm_arch_init() also fixes a mostly benign bug as kvm_arch_exit()
      doesn't properly unwind if a later stage of kvm_init() fails.  While the
      soon-to-be-deleted comment about compiling as a module being unsupported
      is correct, kvm_arch_exit() can still be called by kvm_init() if any step
      after the call to kvm_arch_init() succeeds.
      
      Add a FIXME to call out that pKVM initialization isn't unwound if
      kvm_init() fails, which is a pre-existing problem inherited from
      kvm_arch_exit().
      
      Making kvm_arch_init() a nop will also allow dropping kvm_arch_init() and
      kvm_arch_exit() entirely once all other architectures follow suit.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-20-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1dc0f02d
    • Sean Christopherson's avatar
      KVM: arm64: Unregister perf callbacks if hypervisor finalization fails · 78b3bf48
      Sean Christopherson authored
      Undo everything done by init_subsystems() if a later initialization step
      fails, i.e. unregister perf callbacks in addition to unregistering the
      power management notifier.
      
      Fixes: bfa79a80 ("KVM: arm64: Elevate hypervisor mappings creation at EL2")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-19-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      78b3bf48
    • Sean Christopherson's avatar
      KVM: arm64: Free hypervisor allocations if vector slot init fails · 6baaeda8
      Sean Christopherson authored
      Teardown hypervisor mode if vector slot setup fails in order to avoid
      leaking any allocations done by init_hyp_mode().
      
      Fixes: b881cdce ("KVM: arm64: Allocate hyp vectors statically")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-18-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6baaeda8
    • Marc Zyngier's avatar
      KVM: arm64: Simplify the CPUHP logic · 466d27e4
      Marc Zyngier authored
      For a number of historical reasons, the KVM/arm64 hotplug setup is pretty
      complicated, and we have two extra CPUHP notifiers for vGIC and timers.
      
      It looks pretty pointless, and gets in the way of further changes.
      So let's just expose some helpers that can be called from the core
      CPUHP callback, and get rid of everything else.
      
      This gives us the opportunity to drop a useless notifier entry,
      as well as tidy-up the timer enable/disable, which was a bit odd.
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-17-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      466d27e4
    • Sean Christopherson's avatar
      KVM: x86: Serialize vendor module initialization (hardware setup) · 3af4a9e6
      Sean Christopherson authored
      Acquire a new mutex, vendor_module_lock, in kvm_x86_vendor_init() while
      doing hardware setup to ensure that concurrent calls are fully serialized.
      KVM rejects attempts to load vendor modules if a different module has
      already been loaded, but doesn't handle the case where multiple vendor
      modules are loaded at the same time, and module_init() doesn't run under
      the global module_mutex.
      
      Note, in practice, this is likely a benign bug as no platform exists that
      supports both SVM and VMX, i.e. barring a weird VM setup, one of the
      vendor modules is guaranteed to fail a support check before modifying
      common KVM state.
      
      Alternatively, KVM could perform an atomic CMPXCHG on .hardware_enable,
      but that comes with its own ugliness as it would require setting
      .hardware_enable before success is guaranteed, e.g. attempting to load
      the "wrong" could result in spurious failure to load the "right" module.
      
      Introduce a new mutex as using kvm_lock is extremely deadlock prone due
      to kvm_lock being taken under cpus_write_lock(), and in the future, under
      under cpus_read_lock().  Any operation that takes cpus_read_lock() while
      holding kvm_lock would potentially deadlock, e.g. kvm_timer_init() takes
      cpus_read_lock() to register a callback.  In theory, KVM could avoid
      such problematic paths, i.e. do less setup under kvm_lock, but avoiding
      all calls to cpus_read_lock() is subtly difficult and thus fragile.  E.g.
      updating static calls also acquires cpus_read_lock().
      
      Inverting the lock ordering, i.e. always taking kvm_lock outside
      cpus_read_lock(), is not a viable option as kvm_lock is taken in various
      callbacks that may be invoked under cpus_read_lock(), e.g. x86's
      kvmclock_cpufreq_notifier().
      
      The lockdep splat below is dependent on future patches to take
      cpus_read_lock() in hardware_enable_all(), but as above, deadlock is
      already is already possible.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        6.0.0-smp--7ec93244f194-init2 #27 Tainted: G           O
        ------------------------------------------------------
        stable/251833 is trying to acquire lock:
        ffffffffc097ea28 (kvm_lock){+.+.}-{3:3}, at: hardware_enable_all+0x1f/0xc0 [kvm]
      
                     but task is already holding lock:
        ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm]
      
                     which lock already depends on the new lock.
      
                     the existing dependency chain (in reverse order) is:
      
                     -> #1 (cpu_hotplug_lock){++++}-{0:0}:
               cpus_read_lock+0x2a/0xa0
               __cpuhp_setup_state+0x2b/0x60
               __kvm_x86_vendor_init+0x16a/0x1870 [kvm]
               kvm_x86_vendor_init+0x23/0x40 [kvm]
               0xffffffffc0a4d02b
               do_one_initcall+0x110/0x200
               do_init_module+0x4f/0x250
               load_module+0x1730/0x18f0
               __se_sys_finit_module+0xca/0x100
               __x64_sys_finit_module+0x1d/0x20
               do_syscall_64+0x3d/0x80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
                     -> #0 (kvm_lock){+.+.}-{3:3}:
               __lock_acquire+0x16f4/0x30d0
               lock_acquire+0xb2/0x190
               __mutex_lock+0x98/0x6f0
               mutex_lock_nested+0x1b/0x20
               hardware_enable_all+0x1f/0xc0 [kvm]
               kvm_dev_ioctl+0x45e/0x930 [kvm]
               __se_sys_ioctl+0x77/0xc0
               __x64_sys_ioctl+0x1d/0x20
               do_syscall_64+0x3d/0x80
               entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
                     other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(cpu_hotplug_lock);
                                       lock(kvm_lock);
                                       lock(cpu_hotplug_lock);
          lock(kvm_lock);
      
                      *** DEADLOCK ***
      
        1 lock held by stable/251833:
         #0: ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3af4a9e6
    • Sean Christopherson's avatar
      KVM: VMX: Do _all_ initialization before exposing /dev/kvm to userspace · e32b1200
      Sean Christopherson authored
      Call kvm_init() only after _all_ setup is complete, as kvm_init() exposes
      /dev/kvm to userspace and thus allows userspace to create VMs (and call
      other ioctls).  E.g. KVM will encounter a NULL pointer when attempting to
      add a vCPU to the per-CPU loaded_vmcss_on_cpu list if userspace is able to
      create a VM before vmx_init() configures said list.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000008
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0002) - not-present page
       PGD 0 P4D 0
       Oops: 0002 [#1] SMP
       CPU: 6 PID: 1143 Comm: stable Not tainted 6.0.0-rc7+ #988
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       RIP: 0010:vmx_vcpu_load_vmcs+0x68/0x230 [kvm_intel]
        <TASK>
        vmx_vcpu_load+0x16/0x60 [kvm_intel]
        kvm_arch_vcpu_load+0x32/0x1f0 [kvm]
        vcpu_load+0x2f/0x40 [kvm]
        kvm_arch_vcpu_create+0x231/0x310 [kvm]
        kvm_vm_ioctl+0x79f/0xe10 [kvm]
        ? handle_mm_fault+0xb1/0x220
        __x64_sys_ioctl+0x80/0xb0
        do_syscall_64+0x2b/0x50
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
       RIP: 0033:0x7f5a6b05743b
        </TASK>
       Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel(+) kvm irqbypass
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e32b1200
    • Sean Christopherson's avatar
      KVM: x86: Move guts of kvm_arch_init() to standalone helper · 4f8396b9
      Sean Christopherson authored
      Move the guts of kvm_arch_init() to a new helper, kvm_x86_vendor_init(),
      so that VMX can do _all_ arch and vendor initialization before calling
      kvm_init().  Calling kvm_init() must be the _very_ last step during init,
      as kvm_init() exposes /dev/kvm to userspace, i.e. allows creating VMs.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f8396b9
    • Sean Christopherson's avatar
      KVM: VMX: Move Hyper-V eVMCS initialization to helper · 451d39e8
      Sean Christopherson authored
      Move Hyper-V's eVMCS initialization to a dedicated helper to clean up
      vmx_init(), and add a comment to call out that the Hyper-V init code
      doesn't need to be unwound if vmx_init() ultimately fails.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20221130230934.1014142-13-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      451d39e8
    • Sean Christopherson's avatar
      KVM: VMX: Don't bother disabling eVMCS static key on module exit · da66de44
      Sean Christopherson authored
      Don't disable the eVMCS static key on module exit, kvm_intel.ko owns the
      key so there can't possibly be users after the kvm_intel.ko is unloaded,
      at least not without much bigger issues.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da66de44
    • Sean Christopherson's avatar
      KVM: VMX: Reset eVMCS controls in VP assist page during hardware disabling · 2916b70f
      Sean Christopherson authored
      Reset the eVMCS controls in the per-CPU VP assist page during hardware
      disabling instead of waiting until kvm-intel's module exit.  The controls
      are activated if and only if KVM creates a VM, i.e. don't need to be
      reset if hardware is never enabled.
      
      Doing the reset during hardware disabling will naturally fix a potential
      NULL pointer deref bug once KVM disables CPU hotplug while enabling and
      disabling hardware (which is necessary to fix a variety of bugs).  If the
      kernel is running as the root partition, the VP assist page is unmapped
      during CPU hot unplug, and so KVM's clearing of the eVMCS controls needs
      to occur with CPU hot(un)plug disabled, otherwise KVM could attempt to
      write to a CPU's VP assist page after it's unmapped.
      Reported-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20221130230934.1014142-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2916b70f
    • Sean Christopherson's avatar
      KVM: Drop arch hardware (un)setup hooks · 63a1bd8a
      Sean Christopherson authored
      Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that
      all implementations are nops.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: Eric Farman <farman@linux.ibm.com>	# s390
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Message-Id: <20221130230934.1014142-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63a1bd8a
    • Sean Christopherson's avatar
      KVM: x86: Move hardware setup/unsetup to init/exit · b7483387
      Sean Christopherson authored
      Now that kvm_arch_hardware_setup() is called immediately after
      kvm_arch_init(), fold the guts of kvm_arch_hardware_(un)setup() into
      kvm_arch_{init,exit}() as a step towards dropping one of the hooks.
      
      To avoid having to unwind various setup, e.g registration of several
      notifiers, slot in the vendor hardware setup before the registration of
      said notifiers and callbacks.  Introducing a functional change while
      moving code is less than ideal, but the alternative is adding a pile of
      unwinding code, which is much more error prone, e.g. several attempts to
      move the setup code verbatim all introduced bugs.
      
      Add a comment to document that kvm_ops_update() is effectively the point
      of no return, e.g. it sets the kvm_x86_ops.hardware_enable canary and so
      needs to be unwound.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b7483387
    • Sean Christopherson's avatar
      KVM: x86: Do timer initialization after XCR0 configuration · 1935542a
      Sean Christopherson authored
      Move kvm_arch_init()'s call to kvm_timer_init() down a few lines below
      the XCR0 configuration code.  A future patch will move hardware setup
      into kvm_arch_init() and slot in vendor hardware setup before the call
      to kvm_timer_init() so that timer initialization (among other stuff)
      doesn't need to be unwound if vendor setup fails.  XCR0 setup on the
      other hand needs to happen before vendor hardware setup.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1935542a
    • Sean Christopherson's avatar
      KVM: s390: Move hardware setup/unsetup to init/exit · e43f5762
      Sean Christopherson authored
      Now that kvm_arch_hardware_setup() is called immediately after
      kvm_arch_init(), fold the guts of kvm_arch_hardware_(un)setup() into
      kvm_arch_{init,exit}() as a step towards dropping one of the hooks.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Message-Id: <20221130230934.1014142-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e43f5762
    • Sean Christopherson's avatar
      KVM: s390: Unwind kvm_arch_init() piece-by-piece() if a step fails · b801ef42
      Sean Christopherson authored
      In preparation for folding kvm_arch_hardware_setup() into kvm_arch_init(),
      unwind initialization one step at a time instead of simply calling
      kvm_arch_exit().  Using kvm_arch_exit() regardless of which initialization
      step failed relies on all affected state playing nice with being undone
      even if said state wasn't first setup.  That holds true for state that is
      currently configured by kvm_arch_init(), but not for state that's handled
      by kvm_arch_hardware_setup(), e.g. calling gmap_unregister_pte_notifier()
      without first registering a notifier would result in list corruption due
      to attempting to delete an entry that was never added to the list.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Message-Id: <20221130230934.1014142-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b801ef42
    • Sean Christopherson's avatar
      KVM: Teardown VFIO ops earlier in kvm_exit() · 73b8dc04
      Sean Christopherson authored
      Move the call to kvm_vfio_ops_exit() further up kvm_exit() to try and
      bring some amount of symmetry to the setup order in kvm_init(), and more
      importantly so that the arch hooks are invoked dead last by kvm_exit().
      This will allow arch code to move away from the arch hooks without any
      change in ordering between arch code and common code in kvm_exit().
      
      That kvm_vfio_ops_exit() is called last appears to be 100% arbitrary.  It
      was bolted on after the fact by commit 571ee1b6 ("kvm: vfio: fix
      unregister kvm_device_ops of vfio").  The nullified kvm_device_ops_table
      is also local to kvm_main.c and is used only when there are active VMs,
      so unless arch code is doing something truly bizarre, nullifying the
      table earlier in kvm_exit() is little more than a nop.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Message-Id: <20221130230934.1014142-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      73b8dc04
    • Sean Christopherson's avatar
      KVM: Allocate cpus_hardware_enabled after arch hardware setup · c9650228
      Sean Christopherson authored
      Allocate cpus_hardware_enabled after arch hardware setup so that arch
      "init" and "hardware setup" are called back-to-back and thus can be
      combined in a future patch.  cpus_hardware_enabled is never used before
      kvm_create_vm(), i.e. doesn't have a dependency with hardware setup and
      only needs to be allocated before /dev/kvm is exposed to userspace.
      
      Free the object before the arch hooks are invoked to maintain symmetry,
      and so that arch code can move away from the hooks without having to
      worry about ordering changes.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Message-Id: <20221130230934.1014142-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9650228
    • Sean Christopherson's avatar
      KVM: Initialize IRQ FD after arch hardware setup · 5910ccf0
      Sean Christopherson authored
      Move initialization of KVM's IRQ FD workqueue below arch hardware setup
      as a step towards consolidating arch "init" and "hardware setup", and
      eventually towards dropping the hooks entirely.  There is no dependency
      on the workqueue being created before hardware setup, the workqueue is
      used only when destroying VMs, i.e. only needs to be created before
      /dev/kvm is exposed to userspace.
      
      Move the destruction of the workqueue before the arch hooks to maintain
      symmetry, and so that arch code can move away from the hooks without
      having to worry about ordering changes.
      
      Reword the comment about kvm_irqfd_init() needing to come after
      kvm_arch_init() to call out that kvm_arch_init() must come before common
      KVM does _anything_, as x86 very subtly relies on that behavior to deal
      with multiple calls to kvm_init(), e.g. if userspace attempts to load
      kvm_amd.ko and kvm_intel.ko.  Tag the code with a FIXME, as x86's subtle
      requirement is gross, and invoking an arch callback as the very first
      action in a helper that is called only from arch code is silly.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5910ccf0
    • Sean Christopherson's avatar
      KVM: Register /dev/kvm as the _very_ last thing during initialization · 2b012812
      Sean Christopherson authored
      Register /dev/kvm, i.e. expose KVM to userspace, only after all other
      setup has completed.  Once /dev/kvm is exposed, userspace can start
      invoking KVM ioctls, creating VMs, etc...  If userspace creates a VM
      before KVM is done with its configuration, bad things may happen, e.g.
      KVM will fail to properly migrate vCPU state if a VM is created before
      KVM has registered preemption notifiers.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221130230934.1014142-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b012812