1. 08 Mar, 2012 10 commits
    • Takuya Yoshikawa's avatar
      KVM: Fix write protection race during dirty logging · 6dbf79e7
      Takuya Yoshikawa authored
      This patch fixes a race introduced by:
      
        commit 95d4c16c
        KVM: Optimize dirty logging by rmap_write_protect()
      
      During protecting pages for dirty logging, other threads may also try
      to protect a page in mmu_sync_children() or kvm_mmu_get_page().
      
      In such a case, because get_dirty_log releases mmu_lock before flushing
      TLB's, the following race condition can happen:
      
        A (get_dirty_log)     B (another thread)
      
        lock(mmu_lock)
        clear pte.w
        unlock(mmu_lock)
                              lock(mmu_lock)
                              pte.w is already cleared
                              unlock(mmu_lock)
                              skip TLB flush
                              return
        ...
        TLB flush
      
      Though thread B assumes the page has already been protected when it
      returns, the remaining TLB entry will break that assumption.
      
      This patch fixes this problem by making get_dirty_log hold the mmu_lock
      until it flushes the TLB's.
      Signed-off-by: default avatarTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      6dbf79e7
    • Raghavendra K T's avatar
      KVM: VMX: remove yield_on_hlt · 10166744
      Raghavendra K T authored
      yield_on_hlt was introduced for CPU bandwidth capping. Now it is
      redundant with CFS hardlimit.
      
      yield_on_hlt also complicates the scenario in paravirtual environment,
      that needs to trap halt. for e.g. paravirtualized ticket spinlocks.
      Acked-by: default avatarAnthony Liguori <aliguori@us.ibm.com>
      Signed-off-by: default avatarRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      10166744
    • Zachary Amsden's avatar
      KVM: Track TSC synchronization in generations · e26101b1
      Zachary Amsden authored
      This allows us to track the original nanosecond and counter values
      at each phase of TSC writing by the guest.  This gets us perfect
      offset matching for stable TSC systems, and perfect software
      computed TSC matching for machines with unstable TSC.
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      e26101b1
    • Zachary Amsden's avatar
      KVM: Dont mark TSC unstable due to S4 suspend · 0dd6a6ed
      Zachary Amsden authored
      During a host suspend, TSC may go backwards, which KVM interprets
      as an unstable TSC.  Technically, KVM should not be marking the
      TSC unstable, which causes the TSC clocksource to go bad, but we
      need to be adjusting the TSC offsets in such a case.
      
      Dealing with this issue is a little tricky as the only place we
      can reliably do it is before much of the timekeeping infrastructure
      is up and running.  On top of this, we are not in a KVM thread
      context, so we may not be able to safely access VCPU fields.
      Instead, we compute our best known hardware offset at power-up and
      stash it to be applied to all VCPUs when they actually start running.
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      0dd6a6ed
    • Marcelo Tosatti's avatar
      KVM: Allow adjust_tsc_offset to be in host or guest cycles · f1e2b260
      Marcelo Tosatti authored
      Redefine the API to take a parameter indicating whether an
      adjustment is in host or guest cycles.
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      f1e2b260
    • Zachary Amsden's avatar
      KVM: Add last_host_tsc tracking back to KVM · 6f526ec5
      Zachary Amsden authored
      The variable last_host_tsc was removed from upstream code.  I am adding
      it back for two reasons.  First, it is unnecessary to use guest TSC
      computation to conclude information about the host TSC.  The guest may
      set the TSC backwards (this case handled by the previous patch), but
      the computation of guest TSC (and fetching an MSR) is significanlty more
      work and complexity than simply reading the hardware counter.  In addition,
      we don't actually need the guest TSC for any part of the computation,
      by always recomputing the offset, we can eliminate the need to deal with
      the current offset and any scaling factors that may apply.
      
      The second reason is that later on, we are going to be using the host
      TSC value to restore TSC offsets after a host S4 suspend, so we need to
      be reading the host values, not the guest values here.
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      6f526ec5
    • Zachary Amsden's avatar
      KVM: Fix last_guest_tsc / tsc_offset semantics · b183aa58
      Zachary Amsden authored
      The variable last_guest_tsc was being used as an ad-hoc indicator
      that guest TSC has been initialized and recorded correctly.  However,
      it may not have been, it could be that guest TSC has been set to some
      large value, the back to a small value (by, say, a software reboot).
      
      This defeats the logic and causes KVM to falsely assume that the
      guest TSC has gone backwards, marking the host TSC unstable, which
      is undesirable behavior.
      
      In addition, rather than try to compute an offset adjustment for the
      TSC on unstable platforms, just recompute the whole offset.  This
      allows us to get rid of one callsite for adjust_tsc_offset, which
      is problematic because the units it takes are in guest units, but
      here, the computation was originally being done in host units.
      
      Doing this, and also recording last_guest_tsc when the TSC is written
      allow us to remove the tricky logic which depended on last_guest_tsc
      being zero to indicate a reset of uninitialized value.
      
      Instead, we now have the guarantee that the guest TSC offset is
      always at least something which will get us last_guest_tsc.
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      b183aa58
    • Zachary Amsden's avatar
      KVM: Leave TSC synchronization window open with each new sync · 4dd7980b
      Zachary Amsden authored
      Currently, when the TSC is written by the guest, the variable
      ns is updated to force the current write to appear to have taken
      place at the time of the first write in this sync phase.  This
      leaves a cliff at the end of the match window where updates will
      fall of the end.  There are two scenarios where this can be a
      problem in practe - first, on a system with a large number of
      VCPUs, the sync period may last for an extended period of time.
      
      The second way this can happen is if the VM reboots very rapidly
      and we catch a VCPU TSC synchronization just around the edge.
      We may be unaware of the reboot, and thus the first VCPU might
      synchronize with an old set of the timer (at, say 0.97 seconds
      ago, when first powered on).  The second VCPU can come in 0.04
      seconds later to try to synchronize, but it misses the window
      because it is just over the threshold.
      
      Instead, stop doing this artificial setback of the ns variable
      and just update it with every write of the TSC.
      
      It may be observed that doing so causes values computed by
      compute_guest_tsc to diverge slightly across CPUs - note that
      the last_tsc_ns and last_tsc_write variable are used here, and
      now they last_tsc_ns will be different for each VCPU, reflecting
      the actual time of the update.
      
      However, compute_guest_tsc is used only for guests which already
      have TSC stability issues, and further, note that the previous
      patch has caused last_tsc_write to be incremented by the difference
      in nanoseconds, converted back into guest cycles.  As such, only
      boundary rounding errors should be visible, which given the
      resolution in nanoseconds, is going to only be a few cycles and
      only visible in cross-CPU consistency tests.  The problem can be
      fixed by adding a new set of variables to track the start offset
      and start write value for the current sync cycle.
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      4dd7980b
    • Zachary Amsden's avatar
      KVM: Improve TSC offset matching · 5d3cb0f6
      Zachary Amsden authored
      There are a few improvements that can be made to the TSC offset
      matching code.  First, we don't need to call the 128-bit multiply
      (especially on a constant number), the code works much nicer to
      do computation in nanosecond units.
      
      Second, the way everything is setup with software TSC rate scaling,
      we currently have per-cpu rates.  Obviously this isn't too desirable
      to use in practice, but if for some reason we do change the rate of
      all VCPUs at runtime, then reset the TSCs, we will only want to
      match offsets for VCPUs running at the same rate.
      
      Finally, for the case where we have an unstable host TSC, but
      rate scaling is being done in hardware, we should call the platform
      code to compute the TSC offset, so the math is reorganized to recompute
      the base instead, then transform the base into an offset using the
      existing API.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      
      KVM: Fix 64-bit division in kvm_write_tsc()
      
      Breaks i386 build.
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      5d3cb0f6
    • Zachary Amsden's avatar
      KVM: Infrastructure for software and hardware based TSC rate scaling · cc578287
      Zachary Amsden authored
      This requires some restructuring; rather than use 'virtual_tsc_khz'
      to indicate whether hardware rate scaling is in effect, we consider
      each VCPU to always have a virtual TSC rate.  Instead, there is new
      logic above the vendor-specific hardware scaling that decides whether
      it is even necessary to use and updates all rate variables used by
      common code.  This means we can simply query the virtual rate at
      any point, which is needed for software rate scaling.
      
      There is also now a threshold added to the TSC rate scaling; minor
      differences and variations of measured TSC rate can accidentally
      provoke rate scaling to be used when it is not needed.  Instead,
      we have a tolerance variable called tsc_tolerance_ppm, which is
      the maximum variation from user requested rate at which scaling
      will be used.  The default is 250ppm, which is the half the
      threshold for NTP adjustment, allowing for some hardware variation.
      
      In the event that hardware rate scaling is not available, we can
      kludge a bit by forcing TSC catchup to turn on when a faster than
      hardware speed has been requested, but there is nothing available
      yet for the reverse case; this requires a trap and emulate software
      implementation for RDTSC, which is still forthcoming.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: default avatarZachary Amsden <zamsden@gmail.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      cc578287
  2. 05 Mar, 2012 30 commits