• Christoph Lameter's avatar
    [PATCH] Numa-aware slab allocator V5 · e498be7d
    Christoph Lameter authored
    The NUMA API change that introduced kmalloc_node was accepted for
    2.6.12-rc3.  Now it is possible to do slab allocations on a node to
    localize memory structures.  This API was used by the pageset localization
    patch and the block layer localization patch now in mm.  The existing
    kmalloc_node is slow since it simply searches through all pages of the slab
    to find a page that is on the node requested.  The two patches do a one
    time allocation of slab structures at initialization and therefore the
    speed of kmalloc node does not matter.
    
    This patch allows kmalloc_node to be as fast as kmalloc by introducing node
    specific page lists for partial, free and full slabs.  Slab allocation
    improves in a NUMA system so that we are seeing a performance gain in AIM7
    of about 5% with this patch alone.
    
    More NUMA localizations are possible if kmalloc_node operates in an fast
    way like kmalloc.
    
    Test run on a 32p systems with 32G Ram.
    
    w/o patch
    Tasks    jobs/min  jti  jobs/min/task      real       cpu
        1      485.36  100       485.3640     11.99      1.91   Sat Apr 30 14:01:51 2005
      100    26582.63   88       265.8263     21.89    144.96   Sat Apr 30 14:02:14 2005
      200    29866.83   81       149.3342     38.97    286.08   Sat Apr 30 14:02:53 2005
      300    33127.16   78       110.4239     52.71    426.54   Sat Apr 30 14:03:46 2005
      400    34889.47   80        87.2237     66.72    568.90   Sat Apr 30 14:04:53 2005
      500    35654.34   76        71.3087     81.62    714.55   Sat Apr 30 14:06:15 2005
      600    36460.83   75        60.7681     95.77    853.42   Sat Apr 30 14:07:51 2005
      700    35957.00   75        51.3671    113.30    990.67   Sat Apr 30 14:09:45 2005
      800    33380.65   73        41.7258    139.48   1140.86   Sat Apr 30 14:12:05 2005
      900    35095.01   76        38.9945    149.25   1281.30   Sat Apr 30 14:14:35 2005
     1000    36094.37   74        36.0944    161.24   1419.66   Sat Apr 30 14:17:17 2005
    
    w/patch
    Tasks    jobs/min  jti  jobs/min/task      real       cpu
        1      484.27  100       484.2736     12.02      1.93   Sat Apr 30 15:59:45 2005
      100    28262.03   90       282.6203     20.59    143.57   Sat Apr 30 16:00:06 2005
      200    32246.45   82       161.2322     36.10    282.89   Sat Apr 30 16:00:42 2005
      300    37945.80   83       126.4860     46.01    418.75   Sat Apr 30 16:01:28 2005
      400    40000.69   81       100.0017     58.20    561.48   Sat Apr 30 16:02:27 2005
      500    40976.10   78        81.9522     71.02    696.95   Sat Apr 30 16:03:38 2005
      600    41121.54   78        68.5359     84.92    834.86   Sat Apr 30 16:05:04 2005
      700    44052.77   78        62.9325     92.48    971.53   Sat Apr 30 16:06:37 2005
      800    41066.89   79        51.3336    113.38   1111.15   Sat Apr 30 16:08:31 2005
      900    38918.77   79        43.2431    134.59   1252.57   Sat Apr 30 16:10:46 2005
     1000    41842.21   76        41.8422    139.09   1392.33   Sat Apr 30 16:13:05 2005
    
    These are measurement taken directly after boot and show a greater
    improvement than 5%.  However, the performance improvements become less
    over time if the AIM7 runs are repeated and settle down at around 5%.
    
    Links to earlier discussions:
    http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2
    http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2
    
    Changelog V4-V5:
    - alloc_arraycache and alloc_aliencache take node parameter instead of cpu
    - fix initialization so that nodes without cpus are properly handled.
    - simplify code in kmem_cache_init
    - patch against Andrews temp mm3 release
    - Add Shai to credits
    - fallback to __cache_alloc from __cache_alloc_node if the node's cache
      is not available yet.
    
    Changelog V3-V4:
    - Patch against 2.6.12-rc5-mm1
    - Cleanup patch integrated
    - More and better use of for_each_node and for_each_cpu
    - GCC 2.95 fix (do not use [] use [0])
    - Correct determination of INDEX_AC
    - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes.
    - Remove list3_data and list3_data_ptr macros for better readability
    
    Changelog V2-V3:
    - Made to patch against 2.6.12-rc4-mm1
    - Revised bootstrap mechanism so that larger size kmem_list3 structs can be
      supported. Do a generic solution so that the right slab can be found
      for the internal structs.
    - use for_each_online_node
    
    Changelog V1-V2:
    - Batching for freeing of wrong-node objects (alien caches)
    - Locking changes and NUMA #ifdefs as requested by Manfred
    Signed-off-by: default avatarAlok N Kataria <alokk@calsoftinc.com>
    Signed-off-by: default avatarShobhit Dayal <shobhit@calsoftinc.com>
    Signed-off-by: default avatarShai Fultheim <Shai@Scalex86.org>
    Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    e498be7d
slab.c 94.1 KB