• David Woodhouse's avatar
    KVM: x86/xen: improve accuracy of Xen timers · 451a7078
    David Woodhouse authored
    A test program such as http://david.woodhou.se/timerlat.c confirms user
    reports that timers are increasingly inaccurate as the lifetime of a
    guest increases. Reporting the actual delay observed when asking for
    100µs of sleep, it starts off OK on a newly-launched guest but gets
    worse over time, giving incorrect sleep times:
    
    root@ip-10-0-193-21:~# ./timerlat -c -n 5
    00000000 latency 103243/100000 (3.2430%)
    00000001 latency 103243/100000 (3.2430%)
    00000002 latency 103242/100000 (3.2420%)
    00000003 latency 103245/100000 (3.2450%)
    00000004 latency 103245/100000 (3.2450%)
    
    The biggest problem is that get_kvmclock_ns() returns inaccurate values
    when the guest TSC is scaled. The guest sees a TSC value scaled from the
    host TSC by a mul/shift conversion (hopefully done in hardware). The
    guest then converts that guest TSC value into nanoseconds using the
    mul/shift conversion given to it by the KVM pvclock information.
    
    But get_kvmclock_ns() performs only a single conversion directly from
    host TSC to nanoseconds, giving a different result. A test program at
    http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
    over a day.
    
    It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
    that. The actual guest hv_clock is per-CPU, and *theoretically* each
    vCPU could be running at a *different* frequency. But this patch is
    needed anyway because...
    
    The other issue with Xen timers was that the code would snapshot the
    host CLOCK_MONOTONIC at some point in time, and then... after a few
    interrupts may have occurred, some preemption perhaps... would also read
    the guest's kvmclock. Then it would proceed under the false assumption
    that those two happened at the *same* time. Any time which *actually*
    elapsed between reading the two clocks was introduced as inaccuracies
    in the time at which the timer fired.
    
    Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
    host TSC just *once*, then use the returned TSC value to calculate the
    kvmclock (making sure to do that the way the guest would instead of
    making the same mistake get_kvmclock_ns() does).
    
    Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
    timers still have to use CLOCK_MONOTONIC. In practice the difference
    between the two won't matter over the timescales involved, as the
    *absolute* values don't matter; just the delta.
    
    This does mean a new variant of kvm_get_time_and_clockread() is needed;
    called kvm_get_monotonic_and_clockread() because that's what it does.
    
    Fixes: 53639526 ("KVM: x86/xen: handle PV timers oneshot mode")
    Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
    Reviewed-by: default avatarPaul Durrant <paul@xen.org>
    Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org
    [sean: massage moved comment, tweak if statement formatting]
    Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
    451a7078
x86.c 372 KB