1. 04 Sep, 2024 10 commits
    • Sean Christopherson's avatar
      KVM: x86: Register "emergency disable" callbacks when virt is enabled · 590b09b1
      Sean Christopherson authored
      Register the "disable virtualization in an emergency" callback just
      before KVM enables virtualization in hardware, as there is no functional
      need to keep the callbacks registered while KVM happens to be loaded, but
      is inactive, i.e. if KVM hasn't enabled virtualization.
      
      Note, unregistering the callback every time the last VM is destroyed could
      have measurable latency due to the synchronize_rcu() needed to ensure all
      references to the callback are dropped before KVM is unloaded.  But the
      latency should be a small fraction of the total latency of disabling
      virtualization across all CPUs, and userspace can set enable_virt_at_load
      to completely eliminate the runtime overhead.
      
      Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
      There is no reason to force vendor code to do the registration, and either
      way KVM would need a new kvm_x86_ops hook.
      Suggested-by: default avatarKai Huang <kai.huang@intel.com>
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      590b09b1
    • Sean Christopherson's avatar
      x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef · 6d55a942
      Sean Christopherson authored
      Define cpu_emergency_virt_cb even if the kernel is being built without KVM
      support so that KVM can reference the typedef in asm/kvm_host.h without
      needing yet more #ifdefs.
      
      No functional change intended.
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6d55a942
    • Sean Christopherson's avatar
      KVM: Add arch hooks for enabling/disabling virtualization · b67107a2
      Sean Christopherson authored
      Add arch hooks that are invoked when KVM enables/disable virtualization.
      x86 will use the hooks to register an "emergency disable" callback, which
      is essentially an x86-specific shutdown notifier that is used when the
      kernel is doing an emergency reboot/shutdown/kexec.
      
      Add comments for the declarations to help arch code understand exactly
      when the callbacks are invoked.  Alternatively, the APIs themselves could
      communicate most of the same info, but kvm_arch_pre_enable_virtualization()
      and kvm_arch_post_disable_virtualization() are a bit cumbersome, and make
      it a bit less obvious that they are intended to be implemented as a pair.
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b67107a2
    • Sean Christopherson's avatar
      KVM: Add a module param to allow enabling virtualization when KVM is loaded · b4886fab
      Sean Christopherson authored
      Add an on-by-default module param, enable_virt_at_load, to let userspace
      force virtualization to be enabled in hardware when KVM is initialized,
      i.e. just before /dev/kvm is exposed to userspace.  Enabling virtualization
      during KVM initialization allows userspace to avoid the additional latency
      when creating/destroying the first/last VM (or more specifically, on the
      0=>1 and 1=>0 edges of creation/destruction).
      
      Now that KVM uses the cpuhp framework to do per-CPU enabling, the latency
      could be non-trivial as the cpuhup bringup/teardown is serialized across
      CPUs, e.g. the latency could be problematic for use case that need to spin
      up VMs quickly.
      
      Prior to commit 10474ae8 ("KVM: Activate Virtualization On Demand"),
      KVM _unconditionally_ enabled virtualization during load, i.e. there's no
      fundamental reason KVM needs to dynamically toggle virtualization.  These
      days, the only known argument for not enabling virtualization is to allow
      KVM to be autoloaded without blocking other out-of-tree hypervisors, and
      such use cases can simply change the module param, e.g. via command line.
      
      Note, the aforementioned commit also mentioned that enabling SVM (AMD's
      virtualization extensions) can result in "using invalid TLB entries".
      It's not clear whether the changelog was referring to a KVM bug, a CPU
      bug, or something else entirely.  Regardless, leaving virtualization off
      by default is not a robust "fix", as any protection provided is lost the
      instant userspace creates the first VM.
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4886fab
    • Sean Christopherson's avatar
      KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVM · 0617a769
      Sean Christopherson authored
      Rename x86's the per-CPU vendor hooks used to enable virtualization in
      hardware to align with the recently renamed arch hooks.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240830043600.127750-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0617a769
    • Sean Christopherson's avatar
      KVM: MIPS: Rename virtualization {en,dis}abling APIs to match common KVM · 5381eca1
      Sean Christopherson authored
      Rename MIPS's trampoline hooks for virtualization enabling to match the
      recently renamed arch hooks.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5381eca1
    • Sean Christopherson's avatar
      KVM: Rename arch hooks related to per-CPU virtualization enabling · 071f24ad
      Sean Christopherson authored
      Rename the per-CPU hooks used to enable virtualization in hardware to
      align with the KVM-wide helpers in kvm_main.c, and to better capture that
      the callbacks are invoked on every online CPU.
      
      No functional change intended.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240830043600.127750-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      071f24ad
    • Sean Christopherson's avatar
      KVM: Rename symbols related to enabling virtualization hardware · 70c01943
      Sean Christopherson authored
      Rename the various functions (and a variable) that enable virtualization
      to prepare for upcoming changes, and to clean up artifacts of KVM's
      previous behavior, which required manually juggling locks around
      kvm_usage_count.
      
      Drop the "nolock" qualifier from per-CPU functions now that there are no
      "nolock" implementations of the "all" variants, i.e. now that calling a
      non-nolock function from a nolock function isn't confusing (unlike this
      sentence).
      
      Drop "all" from the outer helpers as they no longer manually iterate
      over all CPUs, and because it might not be obvious what "all" refers to.
      
      In lieu of the above qualifiers, append "_cpu" to the end of the functions
      that are per-CPU helpers for the outer APIs.
      
      Opportunistically prepend "kvm" to all functions to help make it clear
      that they are KVM helpers, but mostly because there's no reason not to.
      
      Lastly, use "virtualization" instead of "hardware", because while the
      functions do enable virtualization in hardware, there are a _lot_ of
      things that KVM enables in hardware.
      
      Defer renaming the arch hooks to future patches, purely to reduce the
      amount of churn in a single commit.
      Reviewed-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70c01943
    • Sean Christopherson's avatar
      KVM: Register cpuhp and syscore callbacks when enabling hardware · 9a798b13
      Sean Christopherson authored
      Register KVM's cpuhp and syscore callback when enabling virtualization
      in hardware instead of registering the callbacks during initialization,
      and let the CPU up/down framework invoke the inner enable/disable
      functions.  Registering the callbacks during initialization makes things
      more complex than they need to be, as KVM needs to be very careful about
      handling races between enabling CPUs being onlined/offlined and hardware
      being enabled/disabled.
      
      Intel TDX support will require KVM to enable virtualization during KVM
      initialization, i.e. will add another wrinkle to things, at which point
      sorting out the potential races with kvm_usage_count would become even
      more complex.
      
      Note, using the cpuhp framework has a subtle behavioral change: enabling
      will be done serially across all CPUs, whereas KVM currently sends an IPI
      to all CPUs in parallel.  While serializing virtualization enabling could
      create undesirable latency, the issue is limited to the 0=>1 transition of
      VM creation.  And even that can be mitigated, e.g. by letting userspace
      force virtualization to be enabled when KVM is initialized.
      
      Cc: Chao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9a798b13
    • Sean Christopherson's avatar
      KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlock · 44d17459
      Sean Christopherson authored
      Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
      on x86 due to a chain of locks and SRCU synchronizations.  Translating the
      below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
      CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
      fairness of r/w semaphores).
      
          CPU0                     CPU1                     CPU2
      1   lock(&kvm->slots_lock);
      2                                                     lock(&vcpu->mutex);
      3                                                     lock(&kvm->srcu);
      4                            lock(cpu_hotplug_lock);
      5                            lock(kvm_lock);
      6                            lock(&kvm->slots_lock);
      7                                                     lock(cpu_hotplug_lock);
      8   sync(&kvm->srcu);
      
      Note, there are likely more potential deadlocks in KVM x86, e.g. the same
      pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
      __kvmclock_cpufreq_notifier():
      
        cpuhp_cpufreq_online()
        |
        -> cpufreq_online()
           |
           -> cpufreq_gov_performance_limits()
              |
              -> __cpufreq_driver_target()
                 |
                 -> __target_index()
                    |
                    -> cpufreq_freq_transition_begin()
                       |
                       -> cpufreq_notify_transition()
                          |
                          -> ... __kvmclock_cpufreq_notifier()
      
      But, actually triggering such deadlocks is beyond rare due to the
      combination of dependencies and timings involved.  E.g. the cpufreq
      notifier is only used on older CPUs without a constant TSC, mucking with
      the NX hugepage mitigation while VMs are running is very uncommon, and
      doing so while also onlining/offlining a CPU (necessary to generate
      contention on cpu_hotplug_lock) would be even more unusual.
      
      The most robust solution to the general cpu_hotplug_lock issue is likely
      to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
      notifier doesn't to take kvm_lock.  For now, settle for fixing the most
      blatant deadlock, as switching to an RCU-protected list is a much more
      involved change, but add a comment in locking.rst to call out that care
      needs to be taken when walking holding kvm_lock and walking vm_list.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
        ------------------------------------------------------
        tee/35048 is trying to acquire lock:
        ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]
      
        but task is already holding lock:
        ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]
      
        which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
        -> #3 (kvm_lock){+.+.}-{3:3}:
               __mutex_lock+0x6a/0xb40
               mutex_lock_nested+0x1f/0x30
               kvm_dev_ioctl+0x4fb/0xe50 [kvm]
               __se_sys_ioctl+0x7b/0xd0
               __x64_sys_ioctl+0x21/0x30
               x64_sys_call+0x15d0/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        -> #2 (cpu_hotplug_lock){++++}-{0:0}:
               cpus_read_lock+0x2e/0xb0
               static_key_slow_inc+0x16/0x30
               kvm_lapic_set_base+0x6a/0x1c0 [kvm]
               kvm_set_apic_base+0x8f/0xe0 [kvm]
               kvm_set_msr_common+0x9ae/0xf80 [kvm]
               vmx_set_msr+0xa54/0xbe0 [kvm_intel]
               __kvm_set_msr+0xb6/0x1a0 [kvm]
               kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
               kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
               __se_sys_ioctl+0x7b/0xd0
               __x64_sys_ioctl+0x21/0x30
               x64_sys_call+0x15d0/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        -> #1 (&kvm->srcu){.+.+}-{0:0}:
               __synchronize_srcu+0x44/0x1a0
               synchronize_srcu_expedited+0x21/0x30
               kvm_swap_active_memslots+0x110/0x1c0 [kvm]
               kvm_set_memslot+0x360/0x620 [kvm]
               __kvm_set_memory_region+0x27b/0x300 [kvm]
               kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
               kvm_vm_ioctl+0x295/0x650 [kvm]
               __se_sys_ioctl+0x7b/0xd0
               __x64_sys_ioctl+0x21/0x30
               x64_sys_call+0x15d0/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
               __lock_acquire+0x15ef/0x2e30
               lock_acquire+0xe0/0x260
               __mutex_lock+0x6a/0xb40
               mutex_lock_nested+0x1f/0x30
               set_nx_huge_pages+0x179/0x1e0 [kvm]
               param_attr_store+0x93/0x100
               module_attr_store+0x22/0x40
               sysfs_kf_write+0x81/0xb0
               kernfs_fop_write_iter+0x133/0x1d0
               vfs_write+0x28d/0x380
               ksys_write+0x70/0xe0
               __x64_sys_write+0x1f/0x30
               x64_sys_call+0x281b/0x2e60
               do_syscall_64+0x83/0x160
               entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Cc: Chao Gao <chao.gao@intel.com>
      Fixes: 0bf50497 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240830043600.127750-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44d17459
  2. 14 Aug, 2024 5 commits
    • Yan Zhao's avatar
      KVM: selftests: Test memslot move in memslot_perf_test with quirk disabled · 61de4c34
      Yan Zhao authored
      Add a new user option to memslot_perf_test to allow testing memslot move
      with quirk KVM_X86_QUIRK_SLOT_ZAP_ALL disabled.
      Signed-off-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Message-ID: <20240703021219.13939-1-yan.y.zhao@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      61de4c34
    • Yan Zhao's avatar
      KVM: selftests: Allow slot modification stress test with quirk disabled · 218f6415
      Yan Zhao authored
      Add a new user option to memslot_modification_stress_test to allow testing
      with slot zap quirk KVM_X86_QUIRK_SLOT_ZAP_ALL disabled.
      Signed-off-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Message-ID: <20240703021206.13923-1-yan.y.zhao@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      218f6415
    • Yan Zhao's avatar
      KVM: selftests: Test slot move/delete with slot zap quirk enabled/disabled · b4ed2c67
      Yan Zhao authored
      Update set_memory_region_test to make sure memslot move and deletion
      function correctly both when slot zap quirk KVM_X86_QUIRK_SLOT_ZAP_ALL is
      enabled and disabled.
      Signed-off-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Message-ID: <20240703021119.13904-1-yan.y.zhao@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4ed2c67
    • Yan Zhao's avatar
      KVM: x86/mmu: Introduce a quirk to control memslot zap behavior · aa8d1f48
      Yan Zhao authored
      Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
      KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
      VMs. Make sure KVM behave as if the quirk is always disabled for
      non-KVM_X86_DEFAULT_VM VMs.
      
      The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options:
      - when enabled:  Invalidate/zap all SPTEs ("zap-all"),
      - when disabled: Precisely zap only the leaf SPTEs within the range of the
                       moving/deleting memory slot ("zap-slot-leafs-only").
      
      "zap-all" is today's KVM behavior to work around a bug [1] where the
      changing the zapping behavior of memslot move/deletion would cause VM
      instability for VMs with an Nvidia GPU assigned; while
      "zap-slot-leafs-only" allows for more precise zapping of SPTEs within the
      memory slot range, improving performance in certain scenarios [2], and
      meeting the functional requirements for TDX.
      
      Previous attempts to select "zap-slot-leafs-only" include a per-VM
      capability approach [3] (which was not preferred because the root cause of
      the bug remained unidentified) and a per-memslot flag approach [4]. Sean
      and Paolo finally recommended the implementation of this quirk and
      explained that it's the least bad option [5].
      
      By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use
      "zap-all". Users have the option to disable the quirk to select
      "zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are
      unaffected by this bug.
      
      For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is
      always selected without user's opt-in, regardless of if the user opts for
      "zap-all".
      This is because it is assumed until proven otherwise that non-
      KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most
      importantly, it's because TDX must have "zap-slot-leafs-only" always
      selected. In TDX's case a memslot's GPA range can be a mixture of "private"
      or "shared" memory. Shared is roughly analogous to how EPT is handled for
      normal VMs, but private GPAs need lots of special treatment:
      1) "zap-all" would require to zap private root page or non-leaf entries or
         at least leaf-entries beyond the deleting memslot scope. However, TDX
         demands that the root page of the private page table remains unchanged,
         with leaf entries being zapped before non-leaf entries, and any dropped
         private guest pages must be re-accepted by the guest.
      2) if "zap-all" zaps only shared page tables, it would result in private
         pages still being mapped when the memslot is gone. This may affect even
         other processes if later the gmem fd was whole punched, causing the
         pages being freed on the host while still mapped in the TD, because
         there's no pgoff to the gfn information to zap the private page table
         after memslot is gone.
      
      So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for
      non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or
      complicating quirk disabling interface (current quirk disabling interface
      is limited, no way to query quirks, or force them to be disabled).
      
      Add a new function kvm_mmu_zap_memslot_leafs() to implement
      "zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(),
      bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as
      1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can
         it be moved. It is only deleted by KVM when APICv is permanently
         inhibited.
      2) kvm_vcpu_reload_apic_access_page() effectively does nothing when
         APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted.
      3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on
         costly IPIs.
      Suggested-by: default avatarKai Huang <kai.huang@intel.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
      Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2]
      Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3]
      Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4]
      Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5]
      Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6]
      Co-developed-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aa8d1f48
    • Sean Christopherson's avatar
      KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) · 66155de9
      Sean Christopherson authored
      Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
      directly emulate instructions for ES/SNP, and instead the guest must
      explicitly request emulation.  Unless the guest explicitly requests
      emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
      SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
      
      But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
      because except for ES/SNP, doing so requires setting reserved bits in the
      SPTE, i.e. the SPTE can't be readable while also generating a #VC on
      writes.  Because KVM never creates MMIO SPTEs and jumps directly to
      emulation, the guest never gets a #VC.  And since KVM simply resumes the
      guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
      into an infinite #NPF loop if the vCPU attempts to write read-only memory.
      
      Disallow read-only memory for all VMs with protected state, i.e. for
      upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
      to support read-only memory, as TDX uses EPT Violation #VE to reflect the
      fault into the guest, e.g. KVM could configure read-only SPTEs with RX
      protections and SUPPRESS_VE=0.  But there is no strong use case for
      supporting read-only memslots on TDX, e.g. the main historical usage is
      to emulate option ROMs, but TDX disallows executing from shared memory.
      And if someone comes along with a legitimate, strong use case, the
      restriction can always be lifted for TDX.
      
      Don't bother trying to retroactively apply the restriction to SEV-ES
      VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
      possibly work for SEV-ES, i.e. disallowing such memslots is really just
      means reporting an error to userspace instead of silently hanging vCPUs.
      Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
      isn't worth the marginal benefit it would provide userspace.
      
      Fixes: 26c44aa9 ("KVM: SEV: define VM types for SEV and SEV-ES")
      Fixes: 1dfe571c ("KVM: SEV: Add initial SEV-SNP support")
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Michael Roth <michael.roth@amd.com>
      Cc: Vishal Annapurve <vannapurve@google.com>
      Cc: Ackerly Tng <ackerleytng@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240809190319.1710470-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      66155de9
  3. 13 Aug, 2024 9 commits
  4. 11 Aug, 2024 9 commits
    • Linus Torvalds's avatar
      Linux 6.11-rc3 · 7c626ce4
      Linus Torvalds authored
      7c626ce4
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7006fe2f
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
      
       - Fix 32-bit PTI for real.
      
         pti_clone_entry_text() is called twice, once before initcalls so that
         initcalls can use the user-mode helper and then again after text is
         set read only. Setting read only on 32-bit might break up the PMD
         mapping, which makes the second invocation of pti_clone_entry_text()
         find the mappings out of sync and failing.
      
         Allow the second call to split the existing PMDs in the user mapping
         and synchronize with the kernel mapping.
      
       - Don't make acpi_mp_wake_mailbox read-only after init as the mail box
         must be writable in the case that CPU hotplug operations happen after
         boot. Otherwise the attempt to start a CPU crashes with a write to
         read only memory.
      
       - Add a missing sanity check in mtrr_save_state() to ensure that the
         fixed MTRR MSRs are supported.
      
         Otherwise mtrr_save_state() ends up in a #GP, which is fixed up, but
         the WARN_ON() can bring systems down when panic on warn is set.
      
      * tag 'x86-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mtrr: Check if fixed MTRRs exist before saving them
        x86/paravirt: Fix incorrect virt spinlock setting on bare metal
        x86/acpi: Remove __ro_after_init from acpi_mp_wake_mailbox
        x86/mm: Fix PTI for i386 some more
      7006fe2f
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7270e931
      Linus Torvalds authored
      Pull time keeping fixes from Thomas Gleixner:
      
       - Fix a couple of issues in the NTP code where user supplied values are
         neither sanity checked nor clamped to the operating range. This
         results in integer overflows and eventualy NTP getting out of sync.
      
         According to the history the sanity checks had been removed in favor
         of clamping the values, but the clamping never worked correctly under
         all circumstances. The NTP people asked to not bring the sanity
         checks back as it might break existing applications.
      
         Make the clamping work correctly and add it where it's missing
      
       - If adjtimex() sets the clock it has to trigger the hrtimer subsystem
         so it can adjust and if the clock was set into the future expire
         timers if needed. The caller should provide a bitmask to tell
         hrtimers which clocks have been adjusted.
      
         adjtimex() uses not the proper constant and uses CLOCK_REALTIME
         instead, which is 0. So hrtimers adjusts only the clocks, but does
         not check for expired timers, which might make them expire really
         late. Use the proper bitmask constant instead.
      
      * tag 'timers-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()
        ntp: Safeguard against time_constant overflow
        ntp: Clamp maxerror and esterror to operating range
      7270e931
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 56fe0a6a
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
       "Three small fixes for interrupt core and drivers:
      
         - The interrupt core fails to honor caller supplied affinity hints
           for non-managed interrupts and uses the system default affinity on
           startup instead. Set the missing flag in the descriptor to tell the
           core to use the provided affinity.
      
         - Fix a shift out of bounds error in the Xilinx driver
      
         - Handle switching to level trigger correctly in the RISCV APLIC
           driver. It failed to retrigger the interrupt which causes it to
           become stale"
      
      * tag 'irq-urgent-2024-08-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/riscv-aplic: Retrigger MSI interrupt on source configuration
        irqchip/xilinx: Fix shift out of bounds
        genirq/irqdesc: Honor caller provided affinity in alloc_desc()
      56fe0a6a
    • Linus Torvalds's avatar
      Merge tag 'usb-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · cb2e5ee8
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are a number of small USB driver fixes for reported issues for
        6.11-rc3. Included in here are:
      
         - usb serial driver MODULE_DESCRIPTION() updates
      
         - usb serial driver fixes
      
         - typec driver fixes
      
         - usb-ip driver fix
      
         - gadget driver fixes
      
         - dt binding update
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: ucsi: Fix a deadlock in ucsi_send_command_common()
        usb: typec: tcpm: avoid sink goto SNK_UNATTACHED state if not received source capability message
        usb: gadget: f_fs: pull out f->disable() from ffs_func_set_alt()
        usb: gadget: f_fs: restore ffs_func_disable() functionality
        USB: serial: debug: do not echo input by default
        usb: typec: tipd: Delete extra semi-colon
        usb: typec: tipd: Fix dereferencing freeing memory in tps6598x_apply_patch()
        usb: gadget: u_serial: Set start_delayed during suspend
        usb: typec: tcpci: Fix error code in tcpci_check_std_output_cap()
        usb: typec: fsa4480: Check if the chip is really there
        usb: gadget: core: Check for unset descriptor
        usb: vhci-hcd: Do not drop references before new references are gained
        usb: gadget: u_audio: Check return codes from usb_ep_enable and config_ep_by_speed.
        usb: gadget: midi2: Fix the response for FB info with block 0xff
        dt-bindings: usb: microchip,usb2514: Add USB2517 compatible
        USB: serial: garmin_gps: use struct_size() to allocate pkt
        USB: serial: garmin_gps: annotate struct garmin_packet with __counted_by
        USB: serial: add missing MODULE_DESCRIPTION() macros
        USB: serial: spcp8x5: remove unused struct 'spcp8x5_usb_ctrl_arg'
      cb2e5ee8
    • Linus Torvalds's avatar
      Merge tag 'tty-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 42b34a8d
      Linus Torvalds authored
      Pull tty / serial driver fixes from Greg KH:
       "Here are some small tty and serial driver fixes for reported problems
        for 6.11-rc3. Included in here are:
      
         - sc16is7xx serial driver fixes
      
         - uartclk bugfix for a divide by zero issue
      
         - conmakehash userspace build issue fix
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'tty-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: vt: conmakehash: cope with abs_srctree no longer in env
        serial: sc16is7xx: fix invalid FIFO access with special register set
        serial: sc16is7xx: fix TX fifo corruption
        serial: core: check uartclk for zero to avoid divide by zero
      42b34a8d
    • Linus Torvalds's avatar
      Merge tag 'driver-core-6.11-rc3' of... · 84e6da57
      Linus Torvalds authored
      Merge tag 'driver-core-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core / documentation fixes from Greg KH:
       "Here are some small fixes, and some documentation updates for
        6.11-rc3. Included in here are:
      
         - embargoed hardware documenation updates based on a lot of review by
           legal-types in lots of companies to try to make the process a _bit_
           easier for us to manage over time.
      
         - rust firmware documentation fix
      
         - driver detach race fix for the fix that went into 6.11-rc1
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'driver-core-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
        driver core: Fix uevent_show() vs driver detach race
        Documentation: embargoed-hardware-issues.rst: add a section documenting the "early access" process
        Documentation: embargoed-hardware-issues.rst: minor cleanups and fixes
        rust: firmware: fix invalid rustdoc link
      84e6da57
    • Linus Torvalds's avatar
      Merge tag 'char-misc-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 9221afb2
      Linus Torvalds authored
      Pull char/misc fixes from Greg KH:
       "Here are some small char/misc/other driver fixes for 6.11-rc3 for
        reported issues. Included in here are:
      
         - binder driver fixes
      
         - fsi MODULE_DESCRIPTION() additions (people seem to love them...)
      
         - eeprom driver fix
      
         - Kconfig dependency fix to resolve build issues
      
         - spmi driver fixes
      
        All of these have been in linux-next for a while with no reported
        problems"
      
      * tag 'char-misc-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        spmi: pmic-arb: add missing newline in dev_err format strings
        spmi: pmic-arb: Pass the correct of_node to irq_domain_add_tree
        binder_alloc: Fix sleeping function called from invalid context
        binder: fix descriptor lookup for context manager
        char: add missing NetWinder MODULE_DESCRIPTION() macros
        misc: mrvl-cn10k-dpi: add PCI_IOV dependency
        eeprom: ee1004: Fix locking issues in ee1004_probe()
        fsi: add missing MODULE_DESCRIPTION() macros
      9221afb2
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 04cc50c2
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Two core fixes: one to prevent discard type changes (seen on iSCSI)
        during intermittent errors and the other is fixing a lockdep problem
        caused by the queue limits change.
      
        And one driver fix in ufs"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: sd: Keep the discard mode stable
        scsi: sd: Move sd_read_cpr() out of the q->limits_lock region
        scsi: ufs: core: Fix hba->last_dme_cmd_tstamp timestamp updating logic
      04cc50c2
  5. 10 Aug, 2024 7 commits