• Like Xu's avatar
    KVM: x86: Don't sync user-written TSC against startup values · bf328e22
    Like Xu authored
    The legacy API for setting the TSC is fundamentally broken, and only
    allows userspace to set a TSC "now", without any way to account for
    time lost between the calculation of the value, and the kernel eventually
    handling the ioctl.
    
    To work around this, KVM has a hack which, if a TSC is set with a value
    which is within a second's worth of the last TSC "written" to any vCPU in
    the VM, assumes that userspace actually intended the two TSC values to be
    in sync and adjusts the newly-written TSC value accordingly.
    
    Thus, when a VMM restores a guest after suspend or migration using the
    legacy API, the TSCs aren't necessarily *right*, but at least they're
    in sync.
    
    This trick falls down when restoring a guest which genuinely has been
    running for less time than the 1 second of imprecision KVM allows for in
    in the legacy API.  On *creation*, the first vCPU starts its TSC counting
    from zero, and the subsequent vCPUs synchronize to that.  But then when
    the VMM tries to restore a vCPU's intended TSC, because the VM has been
    alive for less than 1 second and KVM's default TSC value for new vCPU's is
    '0', the intended TSC is within a second of the last "written" TSC and KVM
    incorrectly adjusts the intended TSC in an attempt to synchronize.
    
    But further hacks can be piled onto KVM's existing hackish ABI, and
    declare that the *first* value written by *userspace* (on any vCPU)
    should not be subject to this "correction", i.e. KVM can assume that the
    first write from userspace is not an attempt to sync up with TSC values
    that only come from the kernel's default vCPU creation.
    
    To that end: Add a flag, kvm->arch.user_set_tsc, protected by
    kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in
    the VM *has* been set by userspace, and make the 1-second slop hack only
    trigger if user_set_tsc is already set.
    
    Note that userspace can explicitly request a *synchronization* of the
    TSC by writing zero. For the purpose of user_set_tsc, an explicit
    synchronization counts as "setting" the TSC, i.e. if userspace then
    subsequently writes an explicit non-zero value which happens to be within
    1 second of the previous value, the new value will be "corrected".  This
    behavior is deliberate, as treating explicit synchronization as "setting"
    the TSC preserves KVM's existing behaviour inasmuch as possible (KVM
    always applied the 1-second "correction" regardless of whether the write
    came from userspace vs. the kernel).
    Reported-by: default avatarYong He <alexyonghe@tencent.com>
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423Suggested-by: default avatarOliver Upton <oliver.upton@linux.dev>
    Original-by: default avatarOliver Upton <oliver.upton@linux.dev>
    Original-by: default avatarSean Christopherson <seanjc@google.com>
    Signed-off-by: default avatarLike Xu <likexu@tencent.com>
    Tested-by: default avatarYong He <alexyonghe@tencent.com>
    Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
    Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
    bf328e22
kvm_host.h 68.4 KB