• Vlastimil Babka's avatar
    mm/slub: optimize alloc fastpath code layout · 3450a0e5
    Vlastimil Babka authored
    With allocation fastpaths no longer divided between two .c files, we
    have better inlining, however checking the disassembly of
    kmem_cache_alloc() reveals we can do better to make the fastpaths
    smaller and move the less common situations out of line or to separate
    functions, to reduce instruction cache pressure.
    
    - split memcg pre/post alloc hooks to inlined checks that use likely()
      to assume there will be no objcg handling necessary, and non-inline
      functions doing the actual handling
    
    - add some more likely/unlikely() to pre/post alloc hooks to indicate
      which scenarios should be out of line
    
    - change gfp_allowed_mask handling in slab_post_alloc_hook() so the
      code can be optimized away when kasan/kmsan/kmemleak is configured out
    
    bloat-o-meter shows:
    add/remove: 4/2 grow/shrink: 1/8 up/down: 521/-2924 (-2403)
    Function                                     old     new   delta
    __memcg_slab_post_alloc_hook                   -     461    +461
    kmem_cache_alloc_bulk                        775     791     +16
    __pfx_should_failslab.constprop                -      16     +16
    __pfx___memcg_slab_post_alloc_hook             -      16     +16
    should_failslab.constprop                      -      12     +12
    __pfx_memcg_slab_post_alloc_hook              16       -     -16
    kmem_cache_alloc_lru                        1295    1023    -272
    kmem_cache_alloc_node                       1118     817    -301
    kmem_cache_alloc                            1076     772    -304
    kmalloc_node_trace                          1149     838    -311
    kmalloc_trace                               1102     789    -313
    __kmalloc_node_track_caller                 1393    1080    -313
    __kmalloc_node                              1397    1082    -315
    __kmalloc                                   1374    1059    -315
    memcg_slab_post_alloc_hook                   464       -    -464
    
    Note that gcc still decided to inline __memcg_pre_alloc_hook(), but the
    code is out of line. Forcing noinline did not improve the results. As a
    result the fastpaths are shorter and overal code size is reduced.
    Acked-by: default avatarDavid Rientjes <rientjes@google.com>
    Tested-by: default avatarDavid Rientjes <rientjes@google.com>
    Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
    Tested-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
    Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
    3450a0e5
slub.c 176 KB