An error occurred fetching the project authors.
- 11 Apr, 2024 2 commits
-
-
Paolo Bonzini authored
The only user was kvm_mmu_notifier_change_pte(), which is now gone. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Acked-by:
Anup Patel <anup@brainfault.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by:
Bibo Mao <maobibo@loongson.cn> Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 23 Feb, 2024 5 commits
-
-
Oliver Upton authored
The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Acked-by:
Marc Zyngier <maz@kernel.org> Acked-by:
Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.devSigned-off-by:
Oliver Upton <oliver.upton@linux.dev>
-
Sean Christopherson authored
Disallow creating read-only memslots that support GUEST_MEMFD, as GUEST_MEMFD is fundamentally incompatible with KVM's semantics for read-only memslots. Read-only memslots allow the userspace VMM to emulate option ROMs by filling the backing memory with readable, executable code and data, while triggering emulated MMIO on writes. GUEST_MEMFD doesn't currently support writes from userspace and KVM doesn't support emulated MMIO on private accesses, i.e. the guest can only ever read zeros, and writes will always be treated as errors. Cc: Fuad Tabba <tabba@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Isaku Yamahata <isaku.yamahata@gmail.com> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Fixes: a7800aa8 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Link: https://lore.kernel.org/r/20240222190612.2942589-2-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Arnd Bergmann authored
gcc-14 notices that the arguments to kvmalloc_array() are mixed up: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_mmu_topup_memory_cache': arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: error: 'kvmalloc_array' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Werror=calloc-transposed-args] 424 | mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp); | ^~~~ arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: note: earlier argument should specify number of elements, later size of each element The code still works correctly, but the incorrect order prevents the compiler from properly tracking the object sizes. Fixes: 837f66c7 ("KVM: Allow for different capacities in kvm_mmu_memory_cache structs") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Reviewed-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240212112419.1186065-1-arnd@kernel.orgSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Add a comment to explain why KVM treats vCPUs with pending interrupts as in-kernel when a vCPU wants to yield to a vCPU that was preempted while running in kernel mode. Link: https://lore.kernel.org/r/20240110003938.490206-5-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Plumb in a dedicated hook for querying whether or not a vCPU was preempted in-kernel. Unlike literally every other architecture, x86's VMX can check if a vCPU is in kernel context if and only if the vCPU is loaded on the current pCPU. x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has had at least one bug where KVM incorrectly used a stale preempted_in_kernel. No functional change intended. Reviewed-by:
Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 08 Feb, 2024 1 commit
-
-
Paolo Bonzini authored
KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask unused definitions in include/uapi/linux/kvm.h. __KVM_HAVE_READONLY_MEM however was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly return nonzero. This however does not make sense, and it prevented userspace from supporting this architecture-independent feature without recompilation. Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and is only used in virt/kvm/kvm_main.c. Userspace does not need to test it and there should be no need for it to exist. Remove it and replace it with a Kconfig symbol within Linux source code. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 29 Jan, 2024 1 commit
-
-
Sean Christopherson authored
When handling the end of an mmu_notifier invalidation, WARN if mn_active_invalidate_count is already 0 do not decrement it further, i.e. avoid causing mn_active_invalidate_count to underflow/wrap. In the worst case scenario, effectively corrupting mn_active_invalidate_count could cause kvm_swap_active_memslots() to hang indefinitely. end() calls are *supposed* to be paired with start(), i.e. underflow can only happen if there is a bug elsewhere in the kernel, but due to lack of lockdep assertions in the mmu_notifier helpers, it's all too easy for a bug to go unnoticed for some time, e.g. see the recently introduced PAGEMAP_SCAN ioctl(). Ideally, mmu_notifiers would incorporate lockdep assertions, but users of mmu_notifiers aren't required to hold any one specific lock, i.e. adding the necessary annotations to make lockdep aware of all locks that are mutally exclusive with mm_take_all_locks() isn't trivial. Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com Link: https://lore.kernel.org/r/20240110004239.491290-1-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 12 Dec, 2023 1 commit
-
-
Marc Zyngier authored
Instead of having a comment indicating the need to hold slots_lock when calling kvm_io_bus_register_dev(), make it explicit with a lockdep assertion. Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231207151201.3028710-6-maz@kernel.orgSigned-off-by:
Oliver Upton <oliver.upton@linux.dev>
-
- 08 Dec, 2023 2 commits
-
-
Paolo Bonzini authored
The deprecated interfaces were removed 15 years ago. KVM's device assignment was deprecated in 4.2 and removed 6.5 years ago; the only interest might be in compiling ancient versions of QEMU, but QEMU has been using its own imported copy of the kernel headers since June 2011. So again we go into archaeology territory; just remove the cruft. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
All platforms with a kernel irqchip have support for irqfd. Unify the two configuration items so that userspace can expect to use irqfd to inject interrupts into the irqchip. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 01 Dec, 2023 3 commits
-
-
Sean Christopherson authored
Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that was actually due to failure to flush a workqueue, not a lack of module refcounting. Pinning the KVM module until kvm_vm_destroy() doesn't prevent use-after-free due to the module being unloaded, as userspace can invoke delete_module() the instant the last reference to KVM is put, i.e. can cause all KVM code to be unmapped while KVM is actively executing said code. Generally speaking, the many instances of module_put(THIS_MODULE) notwithstanding, outside of a few special paths, a module can never safely put the last reference to itself without creating deadlock, i.e. something external to the module *must* put the last reference. In other words, having VMs grab a reference to the KVM module is futile, pointless, and as evidenced by the now-reverted commit 70375c2d ("Revert "KVM: set owner of cpu and vm file operations""), actively dangerous. This reverts commit 405294f2 and commit 5f6de5cb. Fixes: 405294f2 ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM") Fixes: 5f6de5cb ("KVM: Prevent module exit until all VMs are freed") Link: https://lore.kernel.org/r/20231018204624.1905300-4-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Set .owner for all KVM-owned filed types so that the KVM module is pinned until any files with callbacks back into KVM are completely freed. Using "struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive while there are active VMs, doesn't provide full protection. Userspace can invoke delete_module() the instant the last reference to KVM is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(), then it's possible for KVM to be preempted and deleted/unloaded before KVM fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back in, it will jump to a code page that is no longer mapped. Note, file types that can call into sub-module code, e.g. kvm-intel.ko or kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is reachable, e.g. VMs are active, then the vendor module is loaded. To reduce the probability of forgetting to set .owner entirely, use THIS_MODULE for stats files where KVM does not call back into vendor code. This reverts commit 70375c2d, and fixes several other file types that have been buggy since their introduction. Fixes: 70375c2d ("Revert "KVM: set owner of cpu and vm file operations"") Fixes: 3bcd0662 ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file") Reported-by:
Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Philipp Stanner authored
kvm_main.c utilizes vmemdup_user() and array_size() to copy a userspace array. Currently, this does not check for an overflow. Use the new wrapper vmemdup_array_user() to copy the array more safely. Note, KVM explicitly checks the number of entries before duplicating the array, i.e. adding the overflow check should be a glorified nop. Suggested-by:
Dave Airlie <airlied@redhat.com> Signed-off-by:
Philipp Stanner <pstanner@redhat.com> Link: https://lore.kernel.org/r/20231102181526.43279-4-pstanner@redhat.com [sean: call out that KVM pre-checks the number of entries] Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
- 30 Nov, 2023 1 commit
-
-
Wei Wang authored
KVM_CAP_DEVICE_CTRL allows userspace to check if the kvm_device framework (e.g. KVM_CREATE_DEVICE) is supported by KVM. Move KVM_CAP_DEVICE_CTRL to the generic check for the two reasons: 1) it already supports arch agnostic usages (i.e. KVM_DEV_TYPE_VFIO). For example, userspace VFIO implementation may needs to create KVM_DEV_TYPE_VFIO on x86, riscv, or arm etc. It is simpler to have it checked at the generic code than at each arch's code. 2) KVM_CREATE_DEVICE has been added to the generic code. Link: https://lore.kernel.org/all/20221215115207.14784-1-wei.w.wang@intel.comSigned-off-by:
Wei Wang <wei.w.wang@intel.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Acked-by: Anup Patel <anup@brainfault.org> (riscv) Reviewed-by:
Oliver Upton <oliver.upton@linux.dev> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/r/20230315101606.10636-1-wei.w.wang@intel.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 14 Nov, 2023 2 commits
-
-
Sean Christopherson authored
Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-23-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based memory that is tied to a specific KVM virtual machine and whose primary purpose is to serve guest memory. A guest-first memory subsystem allows for optimizations and enhancements that are kludgy or outright infeasible to implement/support in a generic memory subsystem. With guest_memfd, guest protections and mapping sizes are fully decoupled from host userspace mappings. E.g. KVM currently doesn't support mapping memory as writable in the guest without it also being writable in host userspace, as KVM's ABI uses VMA protections to define the allow guest protection. Userspace can fudge this by establishing two mappings, a writable mapping for the guest and readable one for itself, but that’s suboptimal on multiple fronts. Similarly, KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB guest mapping unless userspace also has a 1GiB guest mapping. Decoupling the mappings sizes would allow userspace to precisely map only what is needed without impacting guest performance, e.g. to harden against unintentional accesses to guest memory. Decoupling guest and userspace mappings may also allow for a cleaner alternative to high-granularity mappings for HugeTLB, which has reached a bit of an impasse and is unlikely to ever be merged. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to mmap() guest memory). More immediately, being able to map memory into KVM guests without mapping said memory into the host is critical for Confidential VMs (CoCo VMs), the initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent untrusted software from reading guest private data by encrypting guest memory with a key that isn't usable by the untrusted host, projects such as Protected KVM (pKVM) provide confidentiality and integrity *without* relying on memory encryption. And with SEV-SNP and TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as being mappable only by KVM (or a similarly enlightened kernel subsystem). That approach was abandoned largely due to it needing to play games with PROT_NONE to prevent userspace from accessing guest memory. Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping guest private memory into userspace, but that approach failed to meet several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel wouldn't easily be able to enforce a 1:1 page:guest association, let alone a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory that isn't backed by 'struct page', e.g. if devices gain support for exposing encrypted memory regions to guests. Attempt #3 was to extend the memfd() syscall and wrap shmem to provide dedicated file-based guest memory. That approach made it as far as v10 before feedback from Hugh Dickins and Christian Brauner (and others) led to it demise. Hugh's objection was that piggybacking shmem made no sense for KVM's use case as KVM didn't actually *want* the features provided by shmem. I.e. KVM was using memfd() and shmem to avoid having to manage memory directly, not because memfd() and shmem were the optimal solution, e.g. things like read/write/mmap in shmem were dead weight. Christian pointed out flaws with implementing a partial overlay (wrapping only _some_ of shmem), e.g. poking at inode_operations or super_operations would show shmem stuff, but address_space_operations and file_operations would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM stop being lazy and create a proper API. Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org Cc: Fuad Tabba <tabba@google.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Maciej Szmigiero <mail@maciej.szmigiero.name> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Quentin Perret <qperret@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Wang <wei.w.wang@intel.com> Cc: Liam Merwick <liam.merwick@oracle.com> Cc: Isaku Yamahata <isaku.yamahata@gmail.com> Co-developed-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Co-developed-by:
Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by:
Yu Zhang <yu.c.zhang@linux.intel.com> Co-developed-by:
Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by:
Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by:
Ackerley Tng <ackerleytng@google.com> Signed-off-by:
Ackerley Tng <ackerleytng@google.com> Co-developed-by:
Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by:
Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Co-developed-by:
Michael Roth <michael.roth@amd.com> Signed-off-by:
Michael Roth <michael.roth@amd.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-17-seanjc@google.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 13 Nov, 2023 9 commits
-
-
Chao Peng authored
In confidential computing usages, whether a page is private or shared is necessary information for KVM to perform operations like page fault handling, page zapping etc. There are other potential use cases for per-page memory attributes, e.g. to make memory read-only (or no-exec, or exec-only, etc.) without having to modify memslots. Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory attributes to a guest memory range. Use an xarray to store the per-page attributes internally, with a naive, not fully optimized implementation, i.e. prioritize correctness over performance for the initial implementation. Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX attributes/protections in the future, e.g. to give userspace fine-grained control over read, write, and execute protections for guest memory. Provide arch hooks for handling attribute changes before and after common code sets the new attributes, e.g. x86 will use the "pre" hook to zap all relevant mappings, and the "post" hook to track whether or not hugepages can be used to map the range. To simplify the implementation wrap the entire sequence with kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly guaranteed to be an invalidation. For the initial use case, x86 *will* always invalidate memory, and preventing arch code from creating new mappings while the attributes are in flux makes it much easier to reason about the correctness of consuming attributes. It's possible that future usages may not require an invalidation, e.g. if KVM ends up supporting RWX protections and userspace grants _more_ protections, but again opt for simplicity and punt optimizations to if/when they are needed. Suggested-by:
Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Fuad Tabba <tabba@google.com> Cc: Xu Yilun <yilun.xu@intel.com> Cc: Mickaël Salaün <mic@digikod.net> Signed-off-by:
Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-14-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop the .on_unlock() mmu_notifer hook now that it's no longer used for notifying arch code that memory has been reclaimed. Adding .on_unlock() and invoking it *after* dropping mmu_lock was a terrible idea, as doing so resulted in .on_lock() and .on_unlock() having divergent and asymmetric behavior, and set future developers up for failure, i.e. all but asked for bugs where KVM relied on using .on_unlock() to try to run a callback while holding mmu_lock. Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to guard against future bugs of this nature. Reported-by:
Isaku Yamahata <isaku.yamahata@intel.com> Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-12-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having __kvm_handle_hva_range() return whether or not an overlapping memslot was found, i.e. mmu_lock was acquired. Using the .on_unlock() hook works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical. Use a small struct to return the tuple of the notifier-specific return, plus whether or not overlap was found. Because the iteration helpers are __always_inlined, practically speaking, the struct will never actually be returned from a function call (not to mention the size of the struct will be two bytes in practice). Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-11-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional information can be supplied without setting userspace up to fail. The padding in the new kvm_userspace_memory_region2 structure will be used to pass a file descriptor in addition to the userspace_addr, i.e. allow userspace to point at a file descriptor and map memory into a guest that is NOT mapped into host userspace. Alternatively, KVM could simply add "struct kvm_userspace_memory_region2" without a new ioctl(), but as Paolo pointed out, adding a new ioctl() makes detection of bad flags a bit more robust, e.g. if the new fd field is guarded only by a flag and not a new ioctl(), then a userspace bug (setting a "bad" flag) would generate out-of-bounds access instead of an -EINVAL error. Cc: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-9-seanjc@google.com> Acked-by:
Kai Huang <kai.huang@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where appropriate to effectively maintain existing behavior. Using a proper Kconfig will simplify building more functionality on top of KVM's mmu_notifier infrastructure. Add a forward declaration of kvm_gfn_range to kvm_types.h so that including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't generate warnings due to kvm_gfn_range being undeclared. PPC defines hooks for PR vs. HV without guarding them via #ifdeffery, e.g. bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); Alternatively, PPC could forward declare kvm_gfn_range, but there's no good reason not to define it in common KVM. Acked-by:
Anup Patel <anup@brainfault.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-8-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add an assertion that there are no in-progress MMU invalidations when a VM is being destroyed, with the exception of the scenario where KVM unregisters its MMU notifier between an .invalidate_range_start() call and the corresponding .invalidate_range_end(). KVM can't detect unpaired calls from the mmu_notifier due to the above exception waiver, but the assertion can detect KVM bugs, e.g. such as the bug that *almost* escaped initial guest_memfd development. Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.comSigned-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-5-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Chao Peng authored
Currently in mmu_notifier invalidate path, hva range is recorded and then checked against by mmu_invalidate_retry_hva() in the page fault handling path. However, for the soon-to-be-introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing hva based shared memory, gfn is expected to also work. The only downside is when aliasing multiple gfns to a single hva, the current algorithm of checking multiple ranges could result in a much larger range being rejected. Such aliasing should be uncommon, so the impact is expected small. Suggested-by:
Sean Christopherson <seanjc@google.com> Cc: Xu Yilun <yilun.xu@intel.com> Signed-off-by:
Chao Peng <chao.p.peng@linux.intel.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API] Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Xu Yilun <yilun.xu@linux.intel.com> Message-Id: <20231027182217.3615211-4-seanjc@google.com> Reviewed-by:
Kai Huang <kai.huang@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the assertion on the in-progress invalidation count from the primary MMU's notifier path to KVM's common notification path, i.e. assert that the count doesn't go negative even when the invalidation is coming from KVM itself. Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only the affected VM, not the entire kernel. A corrupted count is fatal to the VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry() to block any and all attempts to install new mappings. But it's far from guaranteed that an end() without a start() is fatal or even problematic to anything other than the target VM, e.g. the underlying bug could simply be a duplicate call to end(). And it's much more likely that a missed invalidation, i.e. a potential use-after-free, would manifest as no notification whatsoever, not an end() without a start(). Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-3-seanjc@google.com> Reviewed-by:
Kai Huang <kai.huang@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so that the structure can be used to handle notifications that operate on gfn context, i.e. that aren't tied to a host virtual address. Rename the handler typedef too (arguably it should always have been gfn_handler_t). Practically speaking, this is a nop for 64-bit kernels as the only meaningful change is to store start+end as u64s instead of unsigned longs. Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Fuad Tabba <tabba@google.com> Tested-by:
Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-2-seanjc@google.com> Reviewed-by:
Kai Huang <kai.huang@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 21 Aug, 2023 1 commit
-
-
David Hildenbrand authored
KVM is *the* case we know that really wants to honor NUMA hinting falls. As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU to map them into a secondary MMU, and add a comment why. Do that unconditionally in hva_to_pfn_slow() when calling get_user_pages_unlocked(). kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and gfn_to_page_many_atomic() are similarly used to map pages into a secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always implicitly honor NUMA hinting faults -- as documented for FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location for now. Don't set it in check_user_page_hwpoison(), where we really only want to check if the mapped page is HW-poisoned. We won't set it for other KVM users of get_user_pages()/pin_user_pages() * arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a secondary MMU. * arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace * arch/s390/kvm/*: s390x only supports a single NUMA node either way * arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU. This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer implicitly be set by get_user_pages() and friends. Link: https://lkml.kernel.org/r/20230803143208.383663-4-david@redhat.comSigned-off-by:
David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: liubo <liubo254@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- 17 Aug, 2023 5 commits
-
-
Sean Christopherson authored
Wrap kvm_{gfn,hva}_range.pte in a union so that future notifier events can pass event specific information up and down the stack without needing to constantly expand and churn the APIs. Lockless aging of SPTEs will pass around a bitmap, and support for memory attributes will pass around the new attributes for the range. Add a "KVM_NO_ARG" placeholder to simplify handling events without an argument (creating a dummy union variable is midly annoying). Opportunstically drop explicit zero-initialization of the "pte" field, as omitting the field (now a union) has the same effect. Cc: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/all/CAOUHufagkd2Jk3_HrVoFFptRXM=hX2CV8f+M-dka-hJU4bP8kw@mail.gmail.comReviewed-by:
Oliver Upton <oliver.upton@linux.dev> Acked-by:
Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20230729004144.1054885-1-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
David Matlack authored
Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop "arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a range-based TLB invalidation where the range is defined by the memslot. Now that kvm_flush_remote_tlbs_range() can be called from common code we can just use that and drop a bunch of duplicate code from the arch directories. Note this adds a lockdep assertion for slots_lock being held when calling kvm_flush_remote_tlbs_memslot(), which was previously only asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(), but they all hold the slots_lock, so the lockdep assertion continues to hold true. Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating kvm_flush_remote_tlbs_memslot(), since it is no longer necessary. Signed-off-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Raghavendra Rao Ananta <rananta@google.com> Reviewed-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Shaoqin Huang <shahuang@redhat.com> Acked-by:
Anup Patel <anup@brainfault.org> Acked-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com
-
David Matlack authored
Make kvm_flush_remote_tlbs_range() visible in common code and create a default implementation that just invalidates the whole TLB. This paves the way for several future features/cleanups: - Introduction of range-based TLBI on ARM. - Eliminating kvm_arch_flush_remote_tlbs_memslot() - Moving the KVM/x86 TDP MMU to common code. No functional change intended. Signed-off-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Raghavendra Rao Ananta <rananta@google.com> Reviewed-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Shaoqin Huang <shahuang@redhat.com> Reviewed-by:
Anup Patel <anup@brainfault.org> Acked-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
-
Raghavendra Rao Ananta authored
kvm_arch_flush_remote_tlbs() or CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL are two mechanisms to solve the same problem, allowing architecture-specific code to provide a non-IPI implementation of remote TLB flushing. Dropping CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL allows KVM to standardize all architectures on kvm_arch_flush_remote_tlbs() instead of maintaining two mechanisms. Signed-off-by:
Raghavendra Rao Ananta <rananta@google.com> Reviewed-by:
Shaoqin Huang <shahuang@redhat.com> Reviewed-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-5-rananta@google.com
-
David Matlack authored
Rename kvm_arch_flush_remote_tlb() and the associated macro __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB to kvm_arch_flush_remote_tlbs() and __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS respectively. Making the name plural matches kvm_flush_remote_tlbs() and makes it more clear that this function can affect more than one remote TLB. No functional change intended. Signed-off-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Raghavendra Rao Ananta <rananta@google.com> Reviewed-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by:
Shaoqin Huang <shahuang@redhat.com> Acked-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-2-rananta@google.com
-
- 29 Jul, 2023 1 commit
-
-
Sean Christopherson authored
Grab a reference to KVM prior to installing VM and vCPU stats file descriptors to ensure the underlying VM and vCPU objects are not freed until the last reference to any and all stats fds are dropped. Note, the stats paths manually invoke fd_install() and so don't need to grab a reference before creating the file. Fixes: ce55c049 ("KVM: stats: Support binary stats retrieval for a VCPU") Fixes: fcfe1bae ("KVM: stats: Support binary stats retrieval for a VM") Reported-by:
Zheng Zhang <zheng.zhang@email.ucr.edu> Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com Cc: stable@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Kees Cook <keescook@chromium.org> Message-Id: <20230711230131.648752-2-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 22 Jun, 2023 1 commit
-
-
Gavin Shan authored
We run into guest hang in edk2 firmware when KSM is kept as running on the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash device (TYPE_PFLASH_CFI01) during the operation of sector erasing or buffered write. The status is returned by reading the memory region of the pflash device and the read request should have been forwarded to QEMU and emulated by it. Unfortunately, the read request is covered by an illegal stage2 mapping when the guest hang issue occurs. The read request is completed with QEMU bypassed and wrong status is fetched. The edk2 firmware runs into an infinite loop with the wrong status. The illegal stage2 mapping is populated due to same page sharing by KSM at (C) even the associated memory slot has been marked as invalid at (B) when the memory slot is requested to be deleted. It's notable that the active and inactive memory slots can't be swapped when we're in the middle of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until it reaches to zero again. Besides, the swapping from the active to the inactive memory slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(), corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots(). CPU-A CPU-B ----- ----- ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) kvm_vm_ioctl_set_memory_region kvm_set_memory_region __kvm_set_memory_region kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) kvm_invalidate_memslot kvm_copy_memslot kvm_replace_memslot kvm_swap_active_memslots (A) kvm_arch_flush_shadow_memslot (B) same page sharing by KSM kvm_mmu_notifier_invalidate_range_start : kvm_mmu_notifier_change_pte kvm_handle_hva_range __kvm_handle_hva_range kvm_set_spte_gfn (C) : kvm_mmu_notifier_invalidate_range_end Fix the issue by skipping the invalid memory slot at (C) to avoid the illegal stage2 mapping so that the read request for the pflash's status is forwarded to QEMU and emulated by it. In this way, the correct pflash's status can be returned from QEMU to break the infinite loop in the edk2 firmware. We tried a git-bisect and the first problematic commit is cd4c7183 (" KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, clean_dcache_guest_page() is called after the memory slots are iterated in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called before the iteration on the memory slots before this commit. This change literally enlarges the racy window between kvm_mmu_notifier_change_pte() and memory slot removal so that we're able to reproduce the issue in a practical test case. However, the issue exists since commit d5d8184d ("KVM: ARM: Memory virtualization setup"). Cc: stable@vger.kernel.org # v3.9+ Fixes: d5d8184d ("KVM: ARM: Memory virtualization setup") Reported-by:
Shuai Hu <hshuai@redhat.com> Reported-by:
Zhenyu Zhang <zhenyzha@redhat.com> Signed-off-by:
Gavin Shan <gshan@redhat.com> Reviewed-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Oliver Upton <oliver.upton@linux.dev> Reviewed-by:
Peter Xu <peterx@redhat.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Shaoqin Huang <shahuang@redhat.com> Message-Id: <20230615054259.14911-1-gshan@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 19 Jun, 2023 1 commit
-
-
Ryan Roberts authored
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.comReported-by:
kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/Signed-off-by:
Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- 13 Jun, 2023 1 commit
-
-
Wei Wang authored
Current usage of kvm_io_device requires users to destruct it with an extra call of kvm_iodevice_destructor after the device gets unregistered from kvm_io_bus. This is not necessary and can cause errors if a user forgot to make the extra call. Simplify the usage by combining kvm_iodevice_destructor into kvm_io_bus_unregister_dev. This reduces LOCs a bit for users and can avoid the leakage of destructing the device explicitly. Signed-off-by:
Wei Wang <wei.w.wang@intel.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230207123713.3905-2-wei.w.wang@intel.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
- 09 Jun, 2023 1 commit
-
-
Lorenzo Stoakes authored
Patch series "remove the vmas parameter from GUP APIs", v6. (pin_/get)_user_pages[_remote]() each provide an optional output parameter for an array of VMA objects associated with each page in the input range. These provide the means for VMAs to be returned, as long as mm->mmap_lock is never released during the GUP operation (i.e. the internal flag FOLL_UNLOCKABLE is not specified). In addition, these VMAs can only be accessed with the mmap_lock held and become invalidated the moment it is released. The vast majority of invocations do not use this functionality and of those that do, all but one case retrieve a single VMA to perform checks upon. It is not egregious in the single VMA cases to simply replace the operation with a vma_lookup(). In these cases we duplicate the (fast) lookup on a slow path already under the mmap_lock, abstracted to a new get_user_page_vma_remote() inline helper function which also performs error checking and reference count maintenance. The special case is io_uring, where io_pin_pages() specifically needs to assert that the VMAs underlying the range do not result in broken long-term GUP file-backed mappings. As GUP now internally asserts that FOLL_LONGTERM mappings are not file-backed in a broken fashion (i.e. requiring dirty tracking) - as implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings" - this logic is no longer required and so we can simply remove it altogether from io_uring. Eliminating the vmas parameter eliminates an entire class of danging pointer errors that might have occured should the lock have been incorrectly released. In addition, the API is simplified and now clearly expresses what it is intended for - applying the specified GUP flags and (if pinning) returning pinned pages. This change additionally opens the door to further potential improvements in GUP and the possible marrying of disparate code paths. I have run this series against gup_test with no issues. Thanks to Matthew Wilcox for suggesting this refactoring! This patch (of 6): No invocation of get_user_pages() use the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.comSigned-off-by:
Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts) Acked-by:
Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Acked-by: Sean Christopherson <seanjc@google.com> (KVM) Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org>
-
- 06 Jun, 2023 2 commits
-
-
Michal Luczaj authored
Since c9d60154 ("KVM: allow KVM_BUG/KVM_BUG_ON to handle 64-bit cond") 'cond' is internally converted to boolean, so caller's explicit conversion from void* is unnecessary. Remove the double bang. Signed-off-by:
Michal Luczaj <mhal@rbox.co> Reviewed-by:
Yuan Yao <yuan.yao@intel.com> base-commit: 76a17bf03a268bc342e08c05d8ddbe607d294eb4 Link: https://lore.kernel.org/r/20230605114852.288964-1-mhal@rbox.coSigned-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by:
Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by:
Alistair Popple <apopple@nvidia.com> Reviewed-by:
Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.comSigned-off-by:
Sean Christopherson <seanjc@google.com>
-