• David Woodhouse's avatar
    KVM: pfncache: simplify locking and make more self-contained · 6addfcf2
    David Woodhouse authored
    The locking on the gfn_to_pfn_cache is... interesting. And awful.
    
    There is a rwlock in ->lock which readers take to ensure protection
    against concurrent changes. But __kvm_gpc_refresh() makes assumptions
    that certain fields will not change even while it drops the write lock
    and performs MM operations to revalidate the target PFN and kernel
    mapping.
    
    Commit 93984f19 ("KVM: Fully serialize gfn=>pfn cache refresh via
    mutex") partly addressed that — not by fixing it, but by adding a new
    mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
    calls on a given gfn_to_pfn_cache, but is still only a partial solution.
    
    There is still a theoretical race where __kvm_gpc_refresh() runs in
    parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
    dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
    and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
    ->pfn and ->khva are still valid, and reinstalls those values into the
    structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
    but ->active clear. And a ->khva which looks like a reasonable kernel
    address but is actually unmapped.
    
    All it takes is a subsequent reactivation to cause that ->khva to be
    dereferenced. This would theoretically cause an oops which would look
    something like this:
    
    [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
    [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0
    
    I say "theoretically" because theoretically, that oops that was seen in
    production cannot happen. The code which uses the gfn_to_pfn_cache is
    supposed to have its *own* locking, to further paper over the fact that
    the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
    rwlock abuse is not sufficient.
    
    For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
    shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
    the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
    but the cases where the vcpu or kvm object is being *destroyed*, in
    which case the subsequent reactivation should never happen.
    
    Theoretically.
    
    Nevertheless, this locking abuse is awful and should be fixed, even if
    no clear explanation can be found for how the oops happened. So expand
    the use of the ->refresh_lock mutex to ensure serialization of
    activate/deactivate vs. refresh and make the pfncache locking entirely
    self-sufficient.
    
    This means that a future commit can simplify the locking in the callers,
    such as the Xen emulation code which has an outstanding problem with
    recursive locking of kvm->arch.xen.xen_lock, which will no longer be
    necessary.
    
    The rwlock abuse described above is still not best practice, although
    it's harmless now that the ->refresh_lock is held for the entire duration
    while the offending code drops the write lock, does some other stuff,
    then takes the write lock again and assumes nothing changed. That can
    also be fixed^W cleaned up in a subsequent commit, but this commit is
    a simpler basis for the Xen deadlock fix mentioned above.
    Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
    Reviewed-by: default avatarPaul Durrant <paul@xen.org>
    Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org
    [sean: use guard(mutex) to fix a missed unlock]
    Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
    6addfcf2
pfncache.c 11.2 KB