1. 11 Mar, 2011 4 commits
    • Christoph Lameter's avatar
      Lockless (and preemptless) fastpaths for slub · 8a5ec0ba
      Christoph Lameter authored
      Use the this_cpu_cmpxchg_double functionality to implement a lockless
      allocation algorithm on arches that support fast this_cpu_ops.
      
      Each of the per cpu pointers is paired with a transaction id that ensures
      that updates of the per cpu information can only occur in sequence on
      a certain cpu.
      
      A transaction id is a "long" integer that is comprised of an event number
      and the cpu number. The event number is incremented for every change to the
      per cpu state. This means that the cmpxchg instruction can verify for an
      update that nothing interfered and that we are updating the percpu structure
      for the processor where we picked up the information and that we are also
      currently on that processor when we update the information.
      
      This results in a significant decrease of the overhead in the fastpaths. It
      also makes it easy to adopt the fast path for realtime kernels since this
      is lockless and does not require the use of the current per cpu area
      over the critical section. It is only important that the per cpu area is
      current at the beginning of the critical section and at the end.
      
      So there is no need even to disable preemption.
      
      Test results show that the fastpath cycle count is reduced by up to ~ 40%
      (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
      adds a few cycles.
      
      Sadly this does nothing for the slowpath which is where the main issues with
      performance in slub are but the best case performance rises significantly.
      (For that see the more complex slub patches that require cmpxchg_double)
      
      Kmalloc: alloc/free test
      
      Before:
      
      10000 times kmalloc(8)/kfree -> 134 cycles
      10000 times kmalloc(16)/kfree -> 152 cycles
      10000 times kmalloc(32)/kfree -> 144 cycles
      10000 times kmalloc(64)/kfree -> 142 cycles
      10000 times kmalloc(128)/kfree -> 142 cycles
      10000 times kmalloc(256)/kfree -> 132 cycles
      10000 times kmalloc(512)/kfree -> 132 cycles
      10000 times kmalloc(1024)/kfree -> 135 cycles
      10000 times kmalloc(2048)/kfree -> 135 cycles
      10000 times kmalloc(4096)/kfree -> 135 cycles
      10000 times kmalloc(8192)/kfree -> 144 cycles
      10000 times kmalloc(16384)/kfree -> 754 cycles
      
      After:
      
      10000 times kmalloc(8)/kfree -> 78 cycles
      10000 times kmalloc(16)/kfree -> 78 cycles
      10000 times kmalloc(32)/kfree -> 82 cycles
      10000 times kmalloc(64)/kfree -> 88 cycles
      10000 times kmalloc(128)/kfree -> 79 cycles
      10000 times kmalloc(256)/kfree -> 79 cycles
      10000 times kmalloc(512)/kfree -> 85 cycles
      10000 times kmalloc(1024)/kfree -> 82 cycles
      10000 times kmalloc(2048)/kfree -> 82 cycles
      10000 times kmalloc(4096)/kfree -> 85 cycles
      10000 times kmalloc(8192)/kfree -> 82 cycles
      10000 times kmalloc(16384)/kfree -> 706 cycles
      
      Kmalloc: Repeatedly allocate then free test
      
      Before:
      
      10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
      10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
      10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
      10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
      10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
      10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
      10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
      10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
      10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
      10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
      10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
      10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
      
      After:
      
      10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
      10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
      10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
      10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
      10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
      10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
      10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
      10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
      10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
      10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
      10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
      10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      8a5ec0ba
    • Christoph Lameter's avatar
      slub: Get rid of slab_free_hook_irq() · d3f661d6
      Christoph Lameter authored
      The following patch will make the fastpaths lockless and will no longer
      require interrupts to be disabled. Calling the free hook with irq disabled
      will no longer be possible.
      
      Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
      interrupts if the features are selected that require callbacks with
      interrupts off and reenable after calls have been made.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      d3f661d6
    • Christoph Lameter's avatar
      slub: min_partial needs to be in first cacheline · 1a757fe5
      Christoph Lameter authored
      It is used in unfreeze_slab() which is a performance critical
      function.
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      1a757fe5
    • Pekka Enberg's avatar
      Merge branch 'for-2.6.39' of... · 2a6c5176
      Pekka Enberg authored
      Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into slub/lockless
      2a6c5176
  2. 08 Mar, 2011 5 commits
  3. 07 Mar, 2011 9 commits
  4. 06 Mar, 2011 5 commits
  5. 05 Mar, 2011 17 commits