1. 27 Dec, 2022 37 commits
  2. 23 Dec, 2022 3 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level · 50a9ac25
      Sean Christopherson authored
      Don't install a leaf TDP MMU SPTE if the parent page's level doesn't
      match the target level of the fault, and instead have the vCPU retry the
      faulting instruction after warning.  Continuing on is completely
      unnecessary as the absolute worst case scenario of retrying is DoSing
      the vCPU, whereas continuing on all but guarantees bigger explosions, e.g.
      
        ------------[ cut here ]------------
        kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
        invalid opcode: 0000 [#1] SMP
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
        RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
        RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
        RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
        R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
        R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_mmu_map+0x3b0/0x510
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        Modules linked in: kvm_intel
        ---[ end trace 0000000000000000 ]---
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      50a9ac25
    • Sean Christopherson's avatar
      KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed · 21a36ac6
      Sean Christopherson authored
      Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
      when adding a new shadow page in the TDP MMU.  To ensure the NX reclaim
      kthread can't see a not-yet-linked shadow page, the page fault path links
      the new page table prior to adding the page to possible_nx_huge_pages.
      
      If the page is zapped by different task, e.g. because dirty logging is
      disabled, between linking the page and adding it to the list, KVM can end
      up triggering use-after-free by adding the zapped SP to the aforementioned
      list, as the zapped SP's memory is scheduled for removal via RCU callback.
      The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
      i.e. the below splat is just one possible signature.
      
        ------------[ cut here ]------------
        list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
        WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
        Modules linked in: kvm_intel
        CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #71
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__list_add_valid+0x79/0xa0
        RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
        RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
        R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
        R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
        FS:  00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         track_possible_nx_huge_page+0x53/0x80
         kvm_tdp_mmu_map+0x242/0x2c0
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 61f94478 ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
      Reported-by: default avatarGreg Thelen <gthelen@google.com>
      Analyzed-by: default avatarDavid Matlack <dmatlack@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21a36ac6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached · 80a3e4ae
      Sean Christopherson authored
      Map the leaf SPTE when handling a TDP MMU page fault if and only if the
      target level is reached.  A recent commit reworked the retry logic and
      incorrectly assumed that walking SPTEs would never "fail", as the loop
      either bails (retries) or installs parent SPs.  However, the iterator
      itself will bail early if it detects a frozen (REMOVED) SPTE when
      stepping down.   The TDP iterator also rereads the current SPTE before
      stepping down specifically to avoid walking into a part of the tree that
      is being removed, which means it's possible to terminate the loop without
      the guts of the loop observing the frozen SPTE, e.g. if a different task
      zaps a parent SPTE between the initial read and try_step_down()'s refresh.
      
      Mapping a leaf SPTE at the wrong level results in all kinds of badness as
      page table walkers interpret the SPTE as a page table, not a leaf, and
      walk into the weeds.
      
        ------------[ cut here ]------------
        WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
        Modules linked in: kvm_intel
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
        RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
        R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
        R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
        Invalid SPTE change: cannot replace a present leaf
        SPTE with another present leaf SPTE mapping a
        different PFN!
        as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
        ------------[ cut here ]------------
        kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
        invalid opcode: 0000 [#1] SMP
        CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #64
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
        RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
        RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
        RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
        RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
        R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
        R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
        FS:  00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         kvm_tdp_mmu_map+0x3b0/0x510
         kvm_tdp_page_fault+0x10c/0x130
         kvm_mmu_page_fault+0x103/0x680
         vmx_handle_exit+0x132/0x5a0 [kvm_intel]
         vcpu_enter_guest+0x60c/0x16f0
         kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
         kvm_vcpu_ioctl+0x271/0x660
         __x64_sys_ioctl+0x80/0xb0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        Modules linked in: kvm_intel
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 63d28a25 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
      Cc: Robert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221213033030.83345-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      80a3e4ae