1. 02 Apr, 2022 40 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 38904911
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
      
       - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
      
       - Documentation improvements
      
       - Prevent module exit until all VMs are freed
      
       - PMU Virtualization fixes
      
       - Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences
      
       - Other miscellaneous bugfixes
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
        KVM: x86: fix sending PV IPI
        KVM: x86/mmu: do compare-and-exchange of gPTE via the user address
        KVM: x86: Remove redundant vm_entry_controls_clearbit() call
        KVM: x86: cleanup enter_rmode()
        KVM: x86: SVM: fix tsc scaling when the host doesn't support it
        kvm: x86: SVM: remove unused defines
        KVM: x86: SVM: move tsc ratio definitions to svm.h
        KVM: x86: SVM: fix avic spec based definitions again
        KVM: MIPS: remove reference to trap&emulate virtualization
        KVM: x86: document limitations of MSR filtering
        KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr
        KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
        KVM: x86/pmu: Fix and isolate TSX-specific performance event logic
        KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set
        KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
        KVM: x86: Trace all APICv inhibit changes and capture overall status
        KVM: x86: Add wrappers for setting/clearing APICv inhibits
        KVM: x86: Make APICv inhibit reasons an enum and cleanup naming
        KVM: X86: Handle implicit supervisor access with SMAP
        KVM: X86: Rename variable smap to not_smap in permission_fault()
        ...
      38904911
    • Linus Torvalds's avatar
      Merge tag 'for-5.18/drivers-2022-04-02' of git://git.kernel.dk/linux-block · 6f34f8c3
      Linus Torvalds authored
      Pull block driver fix from Jens Axboe:
       "Got two reports on nbd spewing warnings on load now, which is a
        regression from a commit that went into your tree yesterday.
      
        Revert the problematic change for now"
      
      * tag 'for-5.18/drivers-2022-04-02' of git://git.kernel.dk/linux-block:
        Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()"
      6f34f8c3
    • Linus Torvalds's avatar
      Merge tag 'pci-v5.18-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 9a212aaf
      Linus Torvalds authored
      Pull pci fix from Bjorn Helgaas:
      
       - Fix Hyper-V "defined but not used" build issue added during merge
         window (YueHaibing)
      
      * tag 'pci-v5.18-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: hv: Remove unused hv_set_msi_entry_from_desc()
      9a212aaf
    • Linus Torvalds's avatar
      Merge tag 'tag-chrome-platform-for-v5.18' of... · 02d4f8a3
      Linus Torvalds authored
      Merge tag 'tag-chrome-platform-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
      
      Pull chrome platform updates from Benson Leung:
       "cros_ec_typec:
      
         - Check for EC device - Fix a crash when using the cros_ec_typec
           driver on older hardware not capable of typec commands
      
         - Make try power role optional
      
         - Mux configuration reorganization series from Prashant
      
        cros_ec_debugfs:
      
         - Fix use after free. Thanks Tzung-bi
      
        sensorhub:
      
         - cros_ec_sensorhub fixup - Split trace include file
      
        misc:
      
         - Add new mailing list for chrome-platform development:
      
      	chrome-platform@lists.linux.dev
      
           Now with patchwork!"
      
      * tag 'tag-chrome-platform-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
        platform/chrome: cros_ec_debugfs: detach log reader wq from devm
        platform: chrome: Split trace include file
        platform/chrome: cros_ec_typec: Update mux flags during partner removal
        platform/chrome: cros_ec_typec: Configure muxes at start of port update
        platform/chrome: cros_ec_typec: Get mux state inside configure_mux
        platform/chrome: cros_ec_typec: Move mux flag checks
        platform/chrome: cros_ec_typec: Check for EC device
        platform/chrome: cros_ec_typec: Make try power role optional
        MAINTAINERS: platform-chrome: Add new chrome-platform@lists.linux.dev list
      02d4f8a3
    • Jens Axboe's avatar
      Revert "nbd: fix possible overflow on 'first_minor' in nbd_dev_add()" · 7198bfc2
      Jens Axboe authored
      This reverts commit 6d35d04a.
      
      Both Gabriel and Borislav report that this commit casues a regression
      with nbd:
      
      sysfs: cannot create duplicate filename '/dev/block/43:0'
      
      Revert it before 5.18-rc1 and we'll investigage this separately in
      due time.
      
      Link: https://lore.kernel.org/all/YkiJTnFOt9bTv6A2@zn.tnic/Reported-by: default avatarGabriel L. Somlo <somlo@cmu.edu>
      Reported-by: default avatarBorislav Petkov <bp@alien8.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7198bfc2
    • Eric Dumazet's avatar
      watch_queue: Free the page array when watch_queue is dismantled · b4902070
      Eric Dumazet authored
      Commit 7ea1a012 ("watch_queue: Free the alloc bitmap when the
      watch_queue is torn down") took care of the bitmap, but not the page
      array.
      
        BUG: memory leak
        unreferenced object 0xffff88810d9bc140 (size 32):
        comm "syz-executor335", pid 3603, jiffies 4294946994 (age 12.840s)
        hex dump (first 32 bytes):
          40 a7 40 04 00 ea ff ff 00 00 00 00 00 00 00 00  @.@.............
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           kmalloc_array include/linux/slab.h:621 [inline]
           kcalloc include/linux/slab.h:652 [inline]
           watch_queue_set_size+0x12f/0x2e0 kernel/watch_queue.c:251
           pipe_ioctl+0x82/0x140 fs/pipe.c:632
           vfs_ioctl fs/ioctl.c:51 [inline]
           __do_sys_ioctl fs/ioctl.c:874 [inline]
           __se_sys_ioctl fs/ioctl.c:860 [inline]
           __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
           do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      
      Reported-by: syzbot+25ea042ae28f3888727a@syzkaller.appspotmail.com
      Fixes: c73be61c ("pipe: Add general notification queue support")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Link: https://lore.kernel.org/r/20220322004654.618274-1-eric.dumazet@gmail.com/Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4902070
    • Steven Rostedt (Google)'s avatar
      tracing: mark user_events as BROKEN · 1cd927ad
      Steven Rostedt (Google) authored
      After being merged, user_events become more visible to a wider audience
      that have concerns with the current API.
      
      It is too late to fix this for this release, but instead of a full
      revert, just mark it as BROKEN (which prevents it from being selected in
      make config).  Then we can work finding a better API.  If that fails,
      then it will need to be completely reverted.
      
      To not have the code silently bitrot, still allow building it with
      COMPILE_TEST.
      
      And to prevent the uapi header from being installed, then later changed,
      and then have an old distro user space see the old version, move the
      header file out of the uapi directory.
      
      Surround the include with CONFIG_COMPILE_TEST to the current location,
      but when the BROKEN tag is taken off, it will use the uapi directory,
      and fail to compile.  This is a good way to remind us to move the header
      back.
      
      Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home
      Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.comSuggested-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cd927ad
    • Li RongQing's avatar
      KVM: x86: fix sending PV IPI · c15e0ae4
      Li RongQing authored
      If apic_id is less than min, and (max - apic_id) is greater than
      KVM_IPI_CLUSTER_SIZE, then the third check condition is satisfied but
      the new apic_id does not fit the bitmask.  In this case __send_ipi_mask
      should send the IPI.
      
      This is mostly theoretical, but it can happen if the apic_ids on three
      iterations of the loop are for example 1, KVM_IPI_CLUSTER_SIZE, 0.
      
      Fixes: aaffcfd1 ("KVM: X86: Implement PV IPIs in linux guest")
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Message-Id: <1646814944-51801-1-git-send-email-lirongqing@baidu.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c15e0ae4
    • Paolo Bonzini's avatar
      KVM: x86/mmu: do compare-and-exchange of gPTE via the user address · 2a8859f3
      Paolo Bonzini authored
      FNAME(cmpxchg_gpte) is an inefficient mess.  It is at least decent if it
      can go through get_user_pages_fast(), but if it cannot then it tries to
      use memremap(); that is not just terribly slow, it is also wrong because
      it assumes that the VM_PFNMAP VMA is contiguous.
      
      The right way to do it would be to do the same thing as
      hva_to_pfn_remapped() does since commit add6a0cd ("KVM: MMU: try to
      fix up page faults before giving up", 2016-07-05), using follow_pte()
      and fixup_user_fault() to determine the correct address to use for
      memremap().  To do this, one could for example extract hva_to_pfn()
      for use outside virt/kvm/kvm_main.c.  But really there is no reason to
      do that either, because there is already a perfectly valid address to
      do the cmpxchg() on, only it is a userspace address.  That means doing
      user_access_begin()/user_access_end() and writing the code in assembly
      to handle exceptions correctly.  Worse, the guest PTE can be 8-byte
      even on i686 so there is the extra complication of using cmpxchg8b to
      account for.  But at least it is an efficient mess.
      
      (Thanks to Linus for suggesting improvement on the inline assembly).
      Reported-by: default avatarQiuhao Li <qiuhao@sysec.org>
      Reported-by: default avatarGaoning Pan <pgn@zju.edu.cn>
      Reported-by: default avatarYongkang Jia <kangel@zju.edu.cn>
      Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
      Debugged-by: default avatarTadeusz Struk <tadeusz.struk@linaro.org>
      Tested-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: bd53cb35 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2a8859f3
    • Zhenzhong Duan's avatar
      KVM: x86: Remove redundant vm_entry_controls_clearbit() call · 4335edbb
      Zhenzhong Duan authored
      When emulating exit from long mode, EFER_LMA is cleared with
      vmx_set_efer().  This will already unset the VM_ENTRY_IA32E_MODE control
      bit as requested by SDM, so there is no need to unset VM_ENTRY_IA32E_MODE
      again in exit_lmode() explicitly.  In case EFER isn't supported by
      hardware, long mode isn't supported, so exit_lmode() cannot be reached.
      
      Note that, thanks to the shadow controls mechanism, this change doesn't
      eliminate vmread or vmwrite.
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Message-Id: <20220311102643.807507-3-zhenzhong.duan@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4335edbb
    • Zhenzhong Duan's avatar
      KVM: x86: cleanup enter_rmode() · b76edfe9
      Zhenzhong Duan authored
      vmx_set_efer() sets uret->data but, in fact if the value of uret->data
      will be used vmx_setup_uret_msrs() will have rewritten it with the value
      returned by update_transition_efer().  uret->data is consumed if and only
      if uret->load_into_hardware is true, and vmx_setup_uret_msrs() takes care
      of (a) updating uret->data before setting uret->load_into_hardware to true
      (b) setting uret->load_into_hardware to false if uret->data isn't updated.
      
      Opportunistically use "vmx" directly instead of redoing to_vmx().
      Signed-off-by: default avatarZhenzhong Duan <zhenzhong.duan@intel.com>
      Message-Id: <20220311102643.807507-2-zhenzhong.duan@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b76edfe9
    • Maxim Levitsky's avatar
      KVM: x86: SVM: fix tsc scaling when the host doesn't support it · 88099313
      Maxim Levitsky authored
      It was decided that when TSC scaling is not supported,
      the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
      value.
      
      However in this case kvm_max_tsc_scaling_ratio is not set,
      which breaks various assumptions.
      
      Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
      host support.  For consistency, do the same for VMX.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      88099313
    • Maxim Levitsky's avatar
      kvm: x86: SVM: remove unused defines · f37b735e
      Maxim Levitsky authored
      Remove some unused #defines from svm.c
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-7-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f37b735e
    • Maxim Levitsky's avatar
      KVM: x86: SVM: move tsc ratio definitions to svm.h · bb2aa78e
      Maxim Levitsky authored
      Another piece of SVM spec which should be in the header file
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-6-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb2aa78e
    • Maxim Levitsky's avatar
      KVM: x86: SVM: fix avic spec based definitions again · 0dacc3df
      Maxim Levitsky authored
      Due to wrong rebase, commit
      4a204f78 ("KVM: SVM: Allow AVIC support on system w/ physical APIC ID > 255")
      
      moved avic spec #defines back to avic.c.
      
      Move them back, and while at it extend AVIC_DOORBELL_PHYSICAL_ID_MASK to 12
      bits as well (it will be used in nested avic)
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-5-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0dacc3df
    • Paolo Bonzini's avatar
      KVM: MIPS: remove reference to trap&emulate virtualization · fe5f6914
      Paolo Bonzini authored
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220313140522.1307751-1-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe5f6914
    • Paolo Bonzini's avatar
      KVM: x86: document limitations of MSR filtering · ce2f72e2
      Paolo Bonzini authored
      MSR filtering requires an exit to userspace that is hard to implement and
      would be very slow in the case of nested VMX vmexit and vmentry MSR
      accesses.  Document the limitation.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce2f72e2
    • Hou Wenlong's avatar
      KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr · ac8d6cad
      Hou Wenlong authored
      If MSR access is rejected by MSR filtering,
      kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
      and the return value is only handled well for rdmsr/wrmsr.
      However, some instruction emulation and state transition also
      use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
      some unexpected results if MSR access is rejected, E.g. RDPID
      emulation would inject a #UD but RDPID wouldn't cause a exit
      when RDPID is supported in hardware and ENABLE_RDTSCP is set.
      And it would also cause failure when load MSR at nested entry/exit.
      Since msr filtering is based on MSR bitmap, it is better to only
      do MSR filtering for rdmsr/wrmsr.
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ac8d6cad
    • Hou Wenlong's avatar
      KVM: x86/emulator: Emulate RDPID only if it is enabled in guest · a836839c
      Hou Wenlong authored
      When RDTSCP is supported but RDPID is not supported in host,
      RDPID emulation is available. However, __kvm_get_msr() would
      only fail when RDTSCP/RDPID both are disabled in guest, so
      the emulator wouldn't inject a #UD when RDPID is disabled but
      RDTSCP is enabled in guest.
      
      Fixes: fb6d4d34 ("KVM: x86: emulate RDPID")
      Signed-off-by: default avatarHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a836839c
    • Like Xu's avatar
      KVM: x86/pmu: Fix and isolate TSX-specific performance event logic · e644896f
      Like Xu authored
      HSW_IN_TX* bits are used in generic code which are not supported on
      AMD. Worse, these bits overlap with AMD EventSelect[11:8] and hence
      using HSW_IN_TX* bits unconditionally in generic code is resulting in
      unintentional pmu behavior on AMD. For example, if EventSelect[11:8]
      is 0x2, pmc_reprogram_counter() wrongly assumes that
      HSW_IN_TX_CHECKPOINTED is set and thus forces sampling period to be 0.
      
      Also per the SDM, both bits 32 and 33 "may only be set if the processor
      supports HLE or RTM" and for "IN_TXCP (bit 33): this bit may only be set
      for IA32_PERFEVTSEL2."
      
      Opportunistically eliminate code redundancy, because if the HSW_IN_TX*
      bit is set in pmc->eventsel, it is already set in attr.config.
      Reported-by: default avatarRavi Bangoria <ravi.bangoria@amd.com>
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Fixes: 103af0a9 ("perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5")
      Co-developed-by: default avatarRavi Bangoria <ravi.bangoria@amd.com>
      Signed-off-by: default avatarRavi Bangoria <ravi.bangoria@amd.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20220309084257.88931-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e644896f
    • Maxim Levitsky's avatar
      KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set · 5959ff4a
      Maxim Levitsky authored
      It makes more sense to print new SPTE value than the
      old value.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5959ff4a
    • Jim Mattson's avatar
      KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs · 9b026073
      Jim Mattson authored
      AMD EPYC CPUs never raise a #GP for a WRMSR to a PerfEvtSeln MSR. Some
      reserved bits are cleared, and some are not. Specifically, on
      Zen3/Milan, bits 19 and 42 are not cleared.
      
      When emulating such a WRMSR, KVM should not synthesize a #GP,
      regardless of which bits are set. However, undocumented bits should
      not be passed through to the hardware MSR. So, rather than checking
      for reserved bits and synthesizing a #GP, just clear the reserved
      bits.
      
      This may seem pedantic, but since KVM currently does not support the
      "Host/Guest Only" bits (41:40), it is necessary to clear these bits
      rather than synthesizing #GP, because some popular guests (e.g Linux)
      will set the "Host Only" bit even on CPUs that don't support
      EFER.SVME, and they don't expect a #GP.
      
      For example,
      
      root@Ubuntu1804:~# perf stat -e r26 -a sleep 1
      
       Performance counter stats for 'system wide':
      
                       0      r26
      
             1.001070977 seconds time elapsed
      
      Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379957] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000130026) at rIP: 0xffffffff9b276a28 (native_write_msr+0x8/0x30)
      Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379958] Call Trace:
      Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379963]  amd_pmu_disable_event+0x27/0x90
      
      Fixes: ca724305 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
      Reported-by: default avatarLotus Fenn <lotusf@google.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarLike Xu <likexu@tencent.com>
      Reviewed-by: default avatarDavid Dunn <daviddunn@google.com>
      Message-Id: <20220226234131.2167175-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9b026073
    • Sean Christopherson's avatar
      KVM: x86: Trace all APICv inhibit changes and capture overall status · 4f4c4a3e
      Sean Christopherson authored
      Trace all APICv inhibit changes instead of just those that result in
      APICv being (un)inhibited, and log the current state.  Debugging why
      APICv isn't working is frustrating as it's hard to see why APICv is still
      inhibited, and logging only the first inhibition means unnecessary onion
      peeling.
      
      Opportunistically drop the export of the tracepoint, it is not and should
      not be used by vendor code due to the need to serialize toggling via
      apicv_update_lock.
      
      Note, using the common flow means kvm_apicv_init() switched from atomic
      to non-atomic bitwise operations.  The VM is unreachable at init, so
      non-atomic is perfectly ok.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f4c4a3e
    • Sean Christopherson's avatar
      KVM: x86: Add wrappers for setting/clearing APICv inhibits · 320af55a
      Sean Christopherson authored
      Add set/clear wrappers for toggling APICv inhibits to make the call sites
      more readable, and opportunistically rename the inner helpers to align
      with the new wrappers and to make them more readable as well.  Invert the
      flag from "activate" to "set"; activate is painfully ambiguous as it's
      not obvious if the inhibit is being activated, or if APICv is being
      activated, in which case the inhibit is being deactivated.
      
      For the functions that take @set, swap the order of the inhibit reason
      and @set so that the call sites are visually similar to those that bounce
      through the wrapper.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      320af55a
    • Sean Christopherson's avatar
      KVM: x86: Make APICv inhibit reasons an enum and cleanup naming · 7491b7b2
      Sean Christopherson authored
      Use an enum for the APICv inhibit reasons, there is no meaning behind
      their values and they most definitely are not "unsigned longs".  Rename
      the various params to "reason" for consistency and clarity (inhibit may
      be confused as a command, i.e. inhibit APICv, instead of the reason that
      is getting toggled/checked).
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7491b7b2
    • Lai Jiangshan's avatar
      KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e
      Lai Jiangshan authored
      There are two kinds of implicit supervisor access
      	implicit supervisor access when CPL = 3
      	implicit supervisor access when CPL < 3
      
      Current permission_fault() handles only the first kind for SMAP.
      
      But if the access is implicit when SMAP is on, data may not be read
      nor write from any user-mode address regardless the current CPL.
      
      So the second kind should be also supported.
      
      The first kind can be detect via CPL and access mode: if it is
      supervisor access and CPL = 3, it must be implicit supervisor access.
      
      But it is not possible to detect the second kind without extra
      information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
      into @access. This extra information also works for the first kind, so
      the logic is changed to use this information for both cases.
      
      The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
      which is in the most significant 16 bits of u64 and less likely to be
      forced to change due to future hardware uses it.
      
      This patch removes the call to ->get_cpl() for access mode is determined
      by @access.  Not only does it reduce a function call, but also remove
      confusions when the permission is checked for nested TDP.  The nested
      TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
      on it.  The original code works just because it is always user walk for
      NPT and SMAP fault is not set for EPT in update_permission_bitmask.
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f4aa80e
    • Lai Jiangshan's avatar
      KVM: X86: Rename variable smap to not_smap in permission_fault() · 8873c143
      Lai Jiangshan authored
      Comments above the variable says the bit is set when SMAP is overridden
      or the same meaning in update_permission_bitmask(): it is not subjected
      to SMAP restriction.
      
      Renaming it to reflect the negative implication and make the code better
      readability.
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-4-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8873c143
    • Lai Jiangshan's avatar
      KVM: X86: Fix comments in update_permission_bitmask · 94b4a2f1
      Lai Jiangshan authored
      The commit 09f037aa ("KVM: MMU: speedup update_permission_bitmask")
      refactored the code of update_permission_bitmask() and change the
      comments.  It added a condition into a list to match the new code,
      so the number/order for conditions in the comments should be updated
      too.
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      94b4a2f1
    • Lai Jiangshan's avatar
      KVM: X86: Change the type of access u32 to u64 · 5b22bbe7
      Lai Jiangshan authored
      Change the type of access u32 to u64 for FNAME(walk_addr) and
      ->gva_to_gpa().
      
      The kinds of accesses are usually combinations of UWX, and VMX/SVM's
      nested paging adds a new factor of access: is it an access for a guest
      page table or for a final guest physical address.
      
      And SMAP relies a factor for supervisor access: explicit or implicit.
      
      So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
      all these information to do the walk.
      
      Although @access(u32) has enough bits to encode all the kinds, this
      patch extends it to u64:
      	o Extra bits will be in the higher 32 bits, so that we can
      	  easily obtain the traditional access mode (UWX) by converting
      	  it to u32.
      	o Reuse the value for the access kind defined by SVM's nested
      	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
      	  @error_code in kvm_handle_page_fault().
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5b22bbe7
    • David Woodhouse's avatar
      KVM: Remove dirty handling from gfn_to_pfn_cache completely · cf1d88b3
      David Woodhouse authored
      It isn't OK to cache the dirty status of a page in internal structures
      for an indefinite period of time.
      
      Any time a vCPU exits the run loop to userspace might be its last; the
      VMM might do its final check of the dirty log, flush the last remaining
      dirty pages to the destination and complete a live migration. If we
      have internal 'dirty' state which doesn't get flushed until the vCPU
      is finally destroyed on the source after migration is complete, then
      we have lost data because that will escape the final copy.
      
      This problem already exists with the use of kvm_vcpu_unmap() to mark
      pages dirty in e.g. VMX nesting.
      
      Note that the actual Linux MM already considers the page to be dirty
      since we have a writeable mapping of it. This is just about the KVM
      dirty logging.
      
      For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to
      track which gfn_to_pfn_caches have been used and explicitly mark the
      corresponding pages dirty before returning to userspace. But we would
      have needed external tracking of that anyway, rather than walking the
      full list of GPCs to find those belonging to this vCPU which are dirty.
      
      So let's rely *solely* on that external tracking, and keep it simple
      rather than laying a tempting trap for callers to fall into.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-3-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cf1d88b3
    • Sean Christopherson's avatar
      KVM: Use enum to track if cached PFN will be used in guest and/or host · d0d96121
      Sean Christopherson authored
      Replace the guest_uses_pa and kernel_map booleans in the PFN cache code
      with a unified enum/bitmask. Using explicit names makes it easier to
      review and audit call sites.
      
      Opportunistically add a WARN to prevent passing garbage; instantating a
      cache without declaring its usage is either buggy or pointless.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-2-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0d96121
    • Peter Gonda's avatar
      KVM: SVM: Fix kvm_cache_regs.h inclusions for is_guest_mode() · 4a9e7b9e
      Peter Gonda authored
      Include kvm_cache_regs.h to pick up the definition of is_guest_mode(),
      which is referenced by nested_svm_virtualize_tpr() in svm.h. Remove
      include from svm_onhpyerv.c which was done only because of lack of
      include in svm.h.
      
      Fixes: 883b0a91 ("KVM: SVM: Move Nested SVM Implementation to nested.c")
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      Message-Id: <20220304161032.2270688-1-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a9e7b9e
    • Jim Mattson's avatar
      KVM: x86/pmu: Use different raw event masks for AMD and Intel · 95b065bf
      Jim Mattson authored
      The third nybble of AMD's event select overlaps with Intel's IN_TX and
      IN_TXCP bits. Therefore, we can't use AMD64_RAW_EVENT_MASK on Intel
      platforms that support TSX.
      
      Declare a raw_event_mask in the kvm_pmu structure, initialize it in
      the vendor-specific pmu_refresh() functions, and use that mask for
      PERF_TYPE_RAW configurations in reprogram_gp_counter().
      
      Fixes: 710c4765 ("KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW")
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220308012452.3468611-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      95b065bf
    • Sean Christopherson's avatar
      KVM: Don't actually set a request when evicting vCPUs for GFN cache invd · df06dae3
      Sean Christopherson authored
      Don't actually set a request bit in vcpu->requests when making a request
      purely to force a vCPU to exit the guest.  Logging a request but not
      actually consuming it would cause the vCPU to get stuck in an infinite
      loop during KVM_RUN because KVM would see the pending request and bail
      from VM-Enter to service the request.
      
      Note, it's currently impossible for KVM to set KVM_REQ_GPC_INVALIDATE as
      nothing in KVM is wired up to set guest_uses_pa=true.  But, it'd be all
      too easy for arch code to introduce use of kvm_gfn_to_pfn_cache_init()
      without implementing handling of the request, especially since getting
      test coverage of MMU notifier interaction with specific KVM features
      usually requires a directed test.
      
      Opportunistically rename gfn_to_pfn_cache_invalidate_start()'s wake_vcpus
      to evict_vcpus.  The purpose of the request is to get vCPUs out of guest
      mode, it's supposed to _avoid_ waking vCPUs that are blocking.
      
      Opportunistically rename KVM_REQ_GPC_INVALIDATE to be more specific as to
      what it wants to accomplish, and to genericize the name so that it can
      used for similar but unrelated scenarios, should they arise in the future.
      Add a comment and documentation to explain why the "no action" request
      exists.
      
      Add compile-time assertions to help detect improper usage.  Use the inner
      assertless helper in the one s390 path that makes requests without a
      hardcoded request.
      
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220223165302.3205276-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      df06dae3
    • David Woodhouse's avatar
      KVM: avoid double put_page with gfn-to-pfn cache · 79593c08
      David Woodhouse authored
      If the cache's user host virtual address becomes invalid, there
      is still a path from kvm_gfn_to_pfn_cache_refresh() where __release_gpc()
      could release the pfn but the gpc->pfn field has not been overwritten
      with an error value.  If this happens, kvm_gfn_to_pfn_cache_unmap will
      call put_page again on the same page.
      
      Cc: stable@vger.kernel.org
      Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support")
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      79593c08
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap · f47e5bbb
      Sean Christopherson authored
      Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
      kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
      flush when processing multiple roots (including nested TDP shadow roots).
      Dropping the TLB flush resulted in random crashes when running Hyper-V
      Server 2019 in a guest with KSM enabled in the host (or any source of
      mmu_notifier invalidations, KSM is just the easiest to force).
      
      This effectively revert commits 873dd122
      and fcb93eb6, and thus restores commit
      cf3e2642, plus this delta on top:
      
      bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
              struct kvm_mmu_page *root;
      
              for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
      -               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
      +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
      
              return flush;
       }
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220325230348.2587437-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f47e5bbb
    • Yi Wang's avatar
      KVM: SVM: fix panic on out-of-bounds guest IRQ · a80ced6e
      Yi Wang authored
      As guest_irq is coming from KVM_IRQFD API call, it may trigger
      crash in svm_update_pi_irte() due to out-of-bounds:
      
      crash> bt
      PID: 22218  TASK: ffff951a6ad74980  CPU: 73  COMMAND: "vcpu8"
       #0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397
       #1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d
       #2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d
       #3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d
       #4 [ffffb1ba6707fb90] no_context at ffffffff856692c9
       #5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51
       #6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace
          [exception RIP: svm_update_pi_irte+227]
          RIP: ffffffffc0761b53  RSP: ffffb1ba6707fd08  RFLAGS: 00010086
          RAX: ffffb1ba6707fd78  RBX: ffffb1ba66d91000  RCX: 0000000000000001
          RDX: 00003c803f63f1c0  RSI: 000000000000019a  RDI: ffffb1ba66db2ab8
          RBP: 000000000000019a   R8: 0000000000000040   R9: ffff94ca41b82200
          R10: ffffffffffffffcf  R11: 0000000000000001  R12: 0000000000000001
          R13: 0000000000000001  R14: ffffffffffffffcf  R15: 000000000000005f
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm]
       #8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm]
       #9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm]
          RIP: 00007f143c36488b  RSP: 00007f143a4e04b8  RFLAGS: 00000246
          RAX: ffffffffffffffda  RBX: 00007f05780041d0  RCX: 00007f143c36488b
          RDX: 00007f05780041d0  RSI: 000000004008ae6a  RDI: 0000000000000020
          RBP: 00000000000004e8   R8: 0000000000000008   R9: 00007f05780041e0
          R10: 00007f0578004560  R11: 0000000000000246  R12: 00000000000004e0
          R13: 000000000000001a  R14: 00007f1424001c60  R15: 00007f0578003bc0
          ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
      
      Vmx have been fix this in commit 3a8b0677 (KVM: VMX: Do not BUG() on
      out-of-bounds guest IRQ), so we can just copy source from that to fix
      this.
      Co-developed-by: default avatarYi Liu <liu.yi24@zte.com.cn>
      Signed-off-by: default avatarYi Liu <liu.yi24@zte.com.cn>
      Signed-off-by: default avatarYi Wang <wang.yi59@zte.com.cn>
      Message-Id: <20220309113025.44469-1-wang.yi59@zte.com.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a80ced6e
    • Paolo Bonzini's avatar
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini authored
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
    • Linus Torvalds's avatar
      Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 88e6c020
      Linus Torvalds authored
      Pull vfs updates from Al Viro:
       "Assorted bits and pieces"
      
      * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        aio: drop needless assignment in aio_read()
        clean overflow checks in count_mounts() a bit
        seq_file: fix NULL pointer arithmetic warning
        uml/x86: use x86 load_unaligned_zeropad()
        asm/user.h: killed unused macros
        constify struct path argument of finish_automount()/do_add_mount()
        fs: Remove FIXME comment in generic_write_checks()
      88e6c020
    • Linus Torvalds's avatar
      Merge tag 'vfs-5.18-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · a4251ab9
      Linus Torvalds authored
      Pull vfs fix from Darrick Wong:
       "The erofs developers felt that FIEMAP should handle ranged requests
        starting at s_maxbytes by returning EFBIG instead of passing the
        filesystem implementation a nonsense 0-byte request.
      
        Not sure why they keep tagging this 'iomap', but the VFS shouldn't be
        asking for information about ranges of a file that the filesystem
        already declared that it does not support.
      
         - Fix a potential infinite loop in FIEMAP by fixing an off by one
           error when comparing the requested range against s_maxbytes"
      
      * tag 'vfs-5.18-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        fs: fix an infinite loop in iomap_fiemap
      a4251ab9