Commit cc09ee80 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux

Pull SLUB updates from Vlastimil Babka:
 "SLUB: reduce irq disabled scope and make it RT compatible

  This series was initially inspired by Mel's pcplist local_lock
  rewrite, and also interest to better understand SLUB's locking and the
  new primitives and RT variants and implications. It makes SLUB
  compatible with PREEMPT_RT and generally more preemption-friendly,
  apparently without significant regressions, as the fast paths are not
  affected.

  The main changes to SLUB by this series:

   - irq disabling is now only done for minimum amount of time needed to
     protect the strict kmem_cache_cpu fields, and as part of spin lock,
     local lock and bit lock operations to make them irq-safe

   - SLUB is fully PREEMPT_RT compatible

  The series should now be sufficiently tested in both RT and !RT
  configs, mainly thanks to Mike.

  The RFC/v1 version also got basic performance screening by Mel that
  didn't show major regressions. Mike's testing with hackbench of v2 on
  !RT reported negligible differences [6]:

    virgin(ish) tip
    5.13.0.g60ab3ed-tip
              7,320.67 msec task-clock                #    7.792 CPUs utilized            ( +-  0.31% )
               221,215      context-switches          #    0.030 M/sec                    ( +-  3.97% )
                16,234      cpu-migrations            #    0.002 M/sec                    ( +-  4.07% )
                13,233      page-faults               #    0.002 M/sec                    ( +-  0.91% )
        27,592,205,252      cycles                    #    3.769 GHz                      ( +-  0.32% )
         8,309,495,040      instructions              #    0.30  insn per cycle           ( +-  0.37% )
         1,555,210,607      branches                  #  212.441 M/sec                    ( +-  0.42% )
             5,484,209      branch-misses             #    0.35% of all branches          ( +-  2.13% )

               0.93949 +- 0.00423 seconds time elapsed  ( +-  0.45% )
               0.94608 +- 0.00384 seconds time elapsed  ( +-  0.41% ) (repeat)
               0.94422 +- 0.00410 seconds time elapsed  ( +-  0.43% )

    5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
              7,343.57 msec task-clock                #    7.776 CPUs utilized            ( +-  0.44% )
               223,044      context-switches          #    0.030 M/sec                    ( +-  3.02% )
                16,057      cpu-migrations            #    0.002 M/sec                    ( +-  4.03% )
                13,164      page-faults               #    0.002 M/sec                    ( +-  0.97% )
        27,684,906,017      cycles                    #    3.770 GHz                      ( +-  0.45% )
         8,323,273,871      instructions              #    0.30  insn per cycle           ( +-  0.28% )
         1,556,106,680      branches                  #  211.901 M/sec                    ( +-  0.31% )
             5,463,468      branch-misses             #    0.35% of all branches          ( +-  1.33% )

               0.94440 +- 0.00352 seconds time elapsed  ( +-  0.37% )
               0.94830 +- 0.00228 seconds time elapsed  ( +-  0.24% ) (repeat)
               0.93813 +- 0.00440 seconds time elapsed  ( +-  0.47% ) (repeat)

  RT configs showed some throughput regressions, but that's expected
  tradeoff for the preemption improvements through the RT mutex. It
  didn't prevent the v2 to be incorporated to the 5.13 RT tree [7],
  leading to testing exposure and bugfixes.

  Before the series, SLUB is lockless in both allocation and free fast
  paths, but elsewhere, it's disabling irqs for considerable periods of
  time - especially in allocation slowpath and the bulk allocation,
  where IRQs are re-enabled only when a new page from the page allocator
  is needed, and the context allows blocking. The irq disabled sections
  can then include deactivate_slab() which walks a full freelist and
  frees the slab back to page allocator or unfreeze_partials() going
  through a list of percpu partial slabs. The RT tree currently has some
  patches mitigating these, but we can do much better in mainline too.

  Patches 1-6 are straightforward improvements or cleanups that could
  exist outside of this series too, but are prerequsities.

  Patches 7-9 are also preparatory code changes without functional
  changes, but not so useful without the rest of the series.

  Patch 10 simplifies the fast paths on systems with preemption, based
  on (hopefully correct) observation that the current loops to verify
  tid are unnecessary.

  Patches 11-20 focus on reducing irq disabled scope in the allocation
  slowpath:

   - patch 11 moves disabling of irqs into ___slab_alloc() from its
     callers, which are the allocation slowpath, and bulk allocation.
     Instead these callers only disable preemption to stabilize the cpu.

   - The following patches then gradually reduce the scope of disabled
     irqs in ___slab_alloc() and the functions called from there. As of
     patch 14, the re-enabling of irqs based on gfp flags before calling
     the page allocator is removed from allocate_slab(). As of patch 17,
     it's possible to reach the page allocator (in case of existing
     slabs depleted) without disabling and re-enabling irqs a single
     time.

  Pathces 21-26 reduce the scope of disabled irqs in functions related
  to unfreezing percpu partial slab.

  Patch 27 is preparatory. Patch 28 is adopted from the RT tree and
  converts the flushing of percpu slabs on all cpus from using IPI to
  workqueue, so that the processing isn't happening with irqs disabled
  in the IPI handler. The flushing is not performance critical so it
  should be acceptable.

  Patch 29 also comes from RT tree and makes object_map_lock RT
  compatible.

  Patch 30 make slab_lock irq-safe on RT where we cannot rely on having
  irq disabled from the list_lock spin lock usage.

  Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial()
  from cmpxchg loop to a short irq disabled section, which is used by
  all other code modifying the field. This addresses a theoretical race
  scenario pointed out by Jann, and makes the critical section safe wrt
  with RT local_lock semantics after the conversion in patch 35.

  Patch 32 changes preempt disable to migrate disable, so that the
  nested list_lock spinlock is safe to take on RT. Because
  migrate_disable() is a function call even on !RT, a small set of
  private wrappers is introduced to keep using the cheaper
  preempt_disable() on !PREEMPT_RT configurations. As of this patch,
  SLUB should be already compatible with RT's lock semantics.

  Finally, patch 33 changes irq disabled sections that protect
  kmem_cache_cpu fields in the slow paths, with a local lock. However on
  PREEMPT_RT it means the lockless fast paths can now preempt slow paths
  which don't expect that, so the local lock has to be taken also in the
  fast paths and they are no longer lockless. RT folks seem to not mind
  this tradeoff. The patch also updates the locking documentation in the
  file's comment"

Mike Galbraith and Mel Gorman verified that their earlier testing
observations still hold for the final series:

Link: https://lore.kernel.org/lkml/89ba4f783114520c167cc915ba949ad2c04d6790.camel@gmx.de/
Link: https://lore.kernel.org/lkml/20210907082010.GB3959@techsingularity.net/

* tag 'mm-slub-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux: (33 commits)
  mm, slub: convert kmem_cpu_slab protection to local_lock
  mm, slub: use migrate_disable() on PREEMPT_RT
  mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  mm, slub: make slab_lock() disable irqs with PREEMPT_RT
  mm: slub: make object_map_lock a raw_spinlock_t
  mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
  mm, slab: split out the cpu offline variant of flush_slab()
  mm, slub: don't disable irqs in slub_cpu_dead()
  mm, slub: only disable irq with spin_lock in __unfreeze_partials()
  mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
  mm, slub: detach whole partial list at once in unfreeze_partials()
  mm, slub: discard slabs in unfreeze_partials() without irqs disabled
  mm, slub: move irq control into unfreeze_partials()
  mm, slub: call deactivate_slab() without disabling irqs
  mm, slub: make locking in deactivate_slab() irq-safe
  mm, slub: move reset of c->page and freelist out of deactivate_slab()
  mm, slub: stop disabling irqs around get_partial()
  mm, slub: check new pages with restored irqs
  mm, slub: validate slab from partial list or page allocator before making it cpu slab
  mm, slub: restore irqs around calling new_slab()
  ...
parents 49832c81 bd0e7491
......@@ -778,6 +778,15 @@ static inline int PageSlabPfmemalloc(struct page *page)
return PageActive(page);
}
/*
* A version of PageSlabPfmemalloc() for opportunistic checks where the page
* might have been freed under us and not be a PageSlab anymore.
*/
static inline int __PageSlabPfmemalloc(struct page *page)
{
return PageActive(page);
}
static inline void SetPageSlabPfmemalloc(struct page *page)
{
VM_BUG_ON_PAGE(!PageSlab(page), page);
......
......@@ -10,6 +10,7 @@
#include <linux/kfence.h>
#include <linux/kobject.h>
#include <linux/reciprocal_div.h>
#include <linux/local_lock.h>
enum stat_item {
ALLOC_FASTPATH, /* Allocation from cpu slab */
......@@ -40,6 +41,10 @@ enum stat_item {
CPU_PARTIAL_DRAIN, /* Drain cpu partial to node partial */
NR_SLUB_STAT_ITEMS };
/*
* When changing the layout, make sure freelist and tid are still compatible
* with this_cpu_cmpxchg_double() alignment requirements.
*/
struct kmem_cache_cpu {
void **freelist; /* Pointer to next available object */
unsigned long tid; /* Globally unique transaction id */
......@@ -47,6 +52,7 @@ struct kmem_cache_cpu {
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page *partial; /* Partially allocated frozen slabs */
#endif
local_lock_t lock; /* Protects the fields above */
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
......
......@@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (unlikely(!s))
return;
cpus_read_lock();
mutex_lock(&slab_mutex);
s->refcount--;
......@@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
}
out_unlock:
mutex_unlock(&slab_mutex);
cpus_read_unlock();
}
EXPORT_SYMBOL(kmem_cache_destroy);
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment