1. 21 Feb, 2017 9 commits
    • Waiman Long's avatar
      x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 · dd0fd8bc
      Waiman Long authored
      It was found when running fio sequential write test with a XFS ramdisk
      on a KVM guest running on a 2-socket x86-64 system, the %CPU times
      as reported by perf were as follows:
      
       69.75%  0.59%  fio  [k] down_write
       69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
       67.12%  1.12%  fio  [k] rwsem_down_write_failed
       63.48% 52.77%  fio  [k] osq_lock
        9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
        3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted
      
      Making vcpu_is_preempted() a callee-save function has a relatively
      high cost on x86-64 primarily due to at least one more cacheline of
      data access from the saving and restoring of registers (8 of them)
      to and from stack as well as one more level of function call.
      
      To reduce this performance overhead, an optimized assembly version
      of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
      provided for x86-64.
      
      With this patch applied on a KVM guest on a 2-socket 16-core 32-thread
      system with 16 parallel jobs (8 on each socket), the aggregrate
      bandwidth of the fio test on an XFS ramdisk were as follows:
      
         I/O Type      w/o patch    with patch
         --------      ---------    ----------
         random read   8141.2 MB/s  8497.1 MB/s
         seq read      8229.4 MB/s  8304.2 MB/s
         random write  1675.5 MB/s  1701.5 MB/s
         seq write     1681.3 MB/s  1699.9 MB/s
      
      There are some increases in the aggregated bandwidth because of
      the patch.
      
      The perf data now became:
      
       70.78%  0.58%  fio  [k] down_write
       70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
       69.70%  1.17%  fio  [k] rwsem_down_write_failed
       59.91% 55.42%  fio  [k] osq_lock
       10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted
      
      The assembly code was verified by using a test kernel module to
      compare the output of C __kvm_vcpu_is_preempted() and that of assembly
      __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd0fd8bc
    • Waiman Long's avatar
      x86/paravirt: Change vcp_is_preempted() arg type to long · 6c62985d
      Waiman Long authored
      The cpu argument in the function prototype of vcpu_is_preempted()
      is changed from int to long. That makes it easier to provide a better
      optimized assembly version of that function.
      
      For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
      downcast from long to int is not a problem as vCPU number won't exceed
      32 bits.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c62985d
    • Chao Peng's avatar
      KVM: VMX: use correct vmcs_read/write for guest segment selector/base · 96794e4e
      Chao Peng authored
      Guest segment selector is 16 bit field and guest segment base is natural
      width field. Fix two incorrect invocations accordingly.
      
      Without this patch, build fails when aggressive inlining is used with ICC.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      96794e4e
    • Andy Lutomirski's avatar
      x86/kvm/vmx: Defer TR reload after VM exit · b7ffc44d
      Andy Lutomirski authored
      Intel's VMX is daft and resets the hidden TSS limit register to 0x67
      on VMX reload, and the 0x67 is not configurable.  KVM currently
      reloads TR using the LTR instruction on every exit, but this is quite
      slow because LTR is serializing.
      
      The 0x67 limit is entirely harmless unless ioperm() is in use, so
      defer the reload until a task using ioperm() is actually running.
      
      Here's some poorly done benchmarking using kvm-unit-tests:
      
      Before:
      
      cpuid 1313
      vmcall 1195
      mov_from_cr8 11
      mov_to_cr8 17
      inl_from_pmtimer 6770
      inl_from_qemu 6856
      inl_from_kernel 2435
      outl_to_kernel 1402
      
      After:
      
      cpuid 1291
      vmcall 1181
      mov_from_cr8 11
      mov_to_cr8 16
      inl_from_pmtimer 6457
      inl_from_qemu 6209
      inl_from_kernel 2339
      outl_to_kernel 1391
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      [Force-reload TR in invalidate_tss_limit. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b7ffc44d
    • Andy Lutomirski's avatar
      x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss · d3273dea
      Andy Lutomirski authored
      Historically, the entire TSS + io bitmap structure was cacheline
      aligned, but commit ca241c75 ("x86: unify tss_struct") changed it
      (presumably inadvertently) so that the fixed-layout hardware part is
      cacheline-aligned and the io bitmap is after the padding.  This wastes
      24 bytes (the hardware part should be 104 bytes, but this pads it to
      128 bytes) and, serves no purpose, and causes sizeof(struct
      x86_hw_tss) to have a confusing value.
      
      Drop the pointless alignment.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d3273dea
    • Andy Lutomirski's avatar
      x86/kvm/vmx: Simplify segment_base() · 8c2e41f7
      Andy Lutomirski authored
      Use actual pointer types for pointers (instead of unsigned long) and
      replace hardcoded constants with the appropriate self-documenting
      macros.
      
      The function is still a bit messy, but this seems a lot better than
      before to me.
      
      This is mostly borrowed from a patch by Thomas Garnier.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8c2e41f7
    • Andy Lutomirski's avatar
      x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels · e28baead
      Andy Lutomirski authored
      It was a bit buggy (it didn't list all segment types that needed
      64-bit fixups), but the bug was irrelevant because it wasn't called
      in any interesting context on 64-bit kernels and was only used for
      data segents on 32-bit kernels.
      
      To avoid confusion, make it explicitly 32-bit only.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e28baead
    • Andy Lutomirski's avatar
      x86/kvm/vmx: Don't fetch the TSS base from the GDT · e0c23063
      Andy Lutomirski authored
      The current CPU's TSS base is a foregone conclusion, so there's no need
      to parse it out of the segment tables.  This should save a couple cycles
      (as STR is surely microcoded and poorly optimized) but, more importantly,
      it's a cleanup and it means that segment_base() will never be called on
      64-bit kernels.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0c23063
    • Andy Lutomirski's avatar
      x86/asm: Define the kernel TSS limit in a macro · 4f53ab14
      Andy Lutomirski authored
      Rather than open-coding the kernel TSS limit in set_tss_desc(), make
      it a real macro near the TSS layout definition.
      
      This is purely a cleanup.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f53ab14
  2. 20 Feb, 2017 2 commits
  3. 18 Feb, 2017 1 commit
  4. 17 Feb, 2017 7 commits
  5. 16 Feb, 2017 6 commits
  6. 15 Feb, 2017 15 commits