• Qi Zheng's avatar
    mm: shrinker: add infrastructure for dynamically allocating shrinker · c42d50ae
    Qi Zheng authored
    Patch series "use refcount+RCU method to implement lockless slab shrink",
    v6.
    
    1. Background
    =============
    
    We used to implement the lockless slab shrink with SRCU [1], but then kernel
    test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test
    case [2], so we reverted it [3].
    
    This patch series aims to re-implement the lockless slab shrink using the
    refcount+RCU method proposed by Dave Chinner [4].
    
    [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
    [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
    [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
    [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
    
    2. Implementation
    =================
    
    Currently, the shrinker instances can be divided into the following three types:
    
    a) global shrinker instance statically defined in the kernel, such as
       workingset_shadow_shrinker.
    
    b) global shrinker instance statically defined in the kernel modules, such as
       mmu_shrinker in x86.
    
    c) shrinker instance embedded in other structures.
    
    For case a, the memory of shrinker instance is never freed. For case b, the
    memory of shrinker instance will be freed after synchronize_rcu() when the
    module is unloaded. For case c, the memory of shrinker instance will be freed
    along with the structure it is embedded in.
    
    In preparation for implementing lockless slab shrink, we need to dynamically
    allocate those shrinker instances in case c, then the memory can be dynamically
    freed alone by calling kfree_rcu().
    
    This patchset adds the following new APIs for dynamically allocating shrinker,
    and add a private_data field to struct shrinker to record and get the original
    embedded structure.
    
    1. shrinker_alloc()
    2. shrinker_register()
    3. shrinker_free()
    
    In order to simplify shrinker-related APIs and make shrinker more independent of
    other kernel mechanisms, this patchset uses the above APIs to convert all
    shrinkers (including case a and b) to dynamically allocated, and then remove all
    existing APIs. This will also have another advantage mentioned by Dave Chinner:
    
    ```
    The other advantage of this is that it will break all the existing out of tree
    code and third party modules using the old API and will no longer work with a
    kernel using lockless slab shrinkers. They need to break (both at the source and
    binary levels) to stop bad things from happening due to using uncoverted
    shrinkers in the new setup.
    ```
    
    Then we free the shrinker by calling call_rcu(), and use rcu_read_{lock,unlock}()
    to ensure that the shrinker instance is valid. And the shrinker::refcount
    mechanism ensures that the shrinker instance will not be run again after
    unregistration. So the structure that records the pointer of shrinker instance
    can be safely freed without waiting for the RCU read-side critical section.
    
    In this way, while we implement the lockless slab shrink, we don't need to be
    blocked in unregister_shrinker() to wait RCU read-side critical section.
    
    PATCH 1: introduce new APIs
    PATCH 2~38: convert all shrinnkers to use new APIs
    PATCH 39: remove old APIs
    PATCH 40~41: some cleanups and preparations
    PATCH 42-43: implement the lockless slab shrink
    PATCH 44~45: convert shrinker_rwsem to mutex
    
    3. Testing
    ==========
    
    3.1 slab shrink stress test
    ---------------------------
    
    We can reproduce the down_read_trylock() hotspot through the following script:
    
    ```
    
    DIR="/root/shrinker/memcg/mnt"
    
    do_create()
    {
        mkdir -p /sys/fs/cgroup/memory/test
        echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
        for i in `seq 0 $1`;
        do
            mkdir -p /sys/fs/cgroup/memory/test/$i;
            echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
            mkdir -p $DIR/$i;
        done
    }
    
    do_mount()
    {
        for i in `seq $1 $2`;
        do
            mount -t tmpfs $i $DIR/$i;
        done
    }
    
    do_touch()
    {
        for i in `seq $1 $2`;
        do
            echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
            dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
        done
    }
    
    case "$1" in
      touch)
        do_touch $2 $3
        ;;
      test)
        do_create 4000
        do_mount 0 4000
        do_touch 0 3000
        ;;
      *)
        exit 1
        ;;
    esac
    ```
    
    Save the above script, then run test and touch commands. Then we can use the
    following perf command to view hotspots:
    
    perf top -U -F 999
    
    1) Before applying this patchset:
    
      33.15%  [kernel]          [k] down_read_trylock
      25.38%  [kernel]          [k] shrink_slab
      21.75%  [kernel]          [k] up_read
       4.45%  [kernel]          [k] _find_next_bit
       2.27%  [kernel]          [k] do_shrink_slab
       1.80%  [kernel]          [k] intel_idle_irq
       1.79%  [kernel]          [k] shrink_lruvec
       0.67%  [kernel]          [k] xas_descend
       0.41%  [kernel]          [k] mem_cgroup_iter
       0.40%  [kernel]          [k] shrink_node
       0.38%  [kernel]          [k] list_lru_count_one
    
    2) After applying this patchset:
    
      64.56%  [kernel]          [k] shrink_slab
      12.18%  [kernel]          [k] do_shrink_slab
       3.30%  [kernel]          [k] __rcu_read_unlock
       2.61%  [kernel]          [k] shrink_lruvec
       2.49%  [kernel]          [k] __rcu_read_lock
       1.93%  [kernel]          [k] intel_idle_irq
       0.89%  [kernel]          [k] shrink_node
       0.81%  [kernel]          [k] mem_cgroup_iter
       0.77%  [kernel]          [k] mem_cgroup_calculate_protection
       0.66%  [kernel]          [k] list_lru_count_one
    
    We can see that the first perf hotspot becomes shrink_slab, which is what we
    expect.
    
    3.2 registration and unregistration stress test
    -----------------------------------------------
    
    Run the command below to test:
    
    stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
    
    1) Before applying this patchset:
    
    setting to a 60 second run per stressor
    dispatching hogs: 9 ramfs
    stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                              (secs)    (secs)    (secs)   (real time) (usr+sys time)
    ramfs            473062     60.00      8.00    279.13      7884.12        1647.59
    for a 60.01s run time:
       1440.34s available CPU time
          7.99s user time   (  0.55%)
        279.13s system time ( 19.38%)
        287.12s total time  ( 19.93%)
    load average: 7.12 2.99 1.15
    successful run completed in 60.01s (1 min, 0.01 secs)
    
    2) After applying this patchset:
    
    setting to a 60 second run per stressor
    dispatching hogs: 9 ramfs
    stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                              (secs)    (secs)    (secs)   (real time) (usr+sys time)
    ramfs            477165     60.00      8.13    281.34      7952.55        1648.40
    for a 60.01s run time:
       1440.33s available CPU time
          8.12s user time   (  0.56%)
        281.34s system time ( 19.53%)
        289.46s total time  ( 20.10%)
    load average: 6.98 3.03 1.19
    successful run completed in 60.01s (1 min, 0.01 secs)
    
    We can see that the ops/s has hardly changed.
    
    
    This patch (of 45):
    
    Currently, the shrinker instances can be divided into the following three
    types:
    
    a) global shrinker instance statically defined in the kernel, such as
       workingset_shadow_shrinker.
    
    b) global shrinker instance statically defined in the kernel modules, such
       as mmu_shrinker in x86.
    
    c) shrinker instance embedded in other structures.
    
    For case a, the memory of shrinker instance is never freed. For case b,
    the memory of shrinker instance will be freed after synchronize_rcu() when
    the module is unloaded. For case c, the memory of shrinker instance will
    be freed along with the structure it is embedded in.
    
    In preparation for implementing lockless slab shrink, we need to
    dynamically allocate those shrinker instances in case c, then the memory
    can be dynamically freed alone by calling kfree_rcu().
    
    So this commit adds the following new APIs for dynamically allocating
    shrinker, and add a private_data field to struct shrinker to record and
    get the original embedded structure.
    
    1. shrinker_alloc()
    
    Used to allocate shrinker instance itself and related memory, it will
    return a pointer to the shrinker instance on success and NULL on failure.
    
    2. shrinker_register()
    
    Used to register the shrinker instance, which is same as the current
    register_shrinker_prepared().
    
    3. shrinker_free()
    
    Used to unregister (if needed) and free the shrinker instance.
    
    In order to simplify shrinker-related APIs and make shrinker more
    independent of other kernel mechanisms, subsequent submissions will use
    the above API to convert all shrinkers (including case a and b) to
    dynamically allocated, and then remove all existing APIs.
    
    This will also have another advantage mentioned by Dave Chinner:
    
    ```
    The other advantage of this is that it will break all the existing
    out of tree code and third party modules using the old API and will
    no longer work with a kernel using lockless slab shrinkers. They
    need to break (both at the source and binary levels) to stop bad
    things from happening due to using unconverted shrinkers in the new
    setup.
    ```
    
    [zhengqi.arch@bytedance.com: mm: shrinker: some cleanup]
      Link: https://lkml.kernel.org/r/20230919024607.65463-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20230911094444.68966-1-zhengqi.arch@bytedance.com
    Link: https://lkml.kernel.org/r/20230911094444.68966-2-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
    Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Chuck Lever <cel@kernel.org>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Kirill Tkhai <tkhai@ya.ru>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Roman Gushchin <roman.gushchin@linux.dev>
    Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
    Cc: Steven Price <steven.price@arm.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
    Cc: Alasdair Kergon <agk@redhat.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
    Cc: Andreas Dilger <adilger.kernel@dilger.ca>
    Cc: Andreas Gruenbacher <agruenba@redhat.com>
    Cc: Anna Schumaker <anna@kernel.org>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Bob Peterson <rpeterso@redhat.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Carlos Llamas <cmllamas@google.com>
    Cc: Chandan Babu R <chandan.babu@oracle.com>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Chris Mason <clm@fb.com>
    Cc: Christian Koenig <christian.koenig@amd.com>
    Cc: Coly Li <colyli@suse.de>
    Cc: Dai Ngo <Dai.Ngo@oracle.com>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Airlie <airlied@gmail.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Sterba <dsterba@suse.com>
    Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
    Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
    Cc: Huang Rui <ray.huang@amd.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Jani Nikula <jani.nikula@linux.intel.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jason Wang <jasowang@redhat.com>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Juergen Gross <jgross@suse.com>
    Cc: Kent Overstreet <kent.overstreet@gmail.com>
    Cc: Marijn Suijten <marijn.suijten@somainline.org>
    Cc: "Michael S. Tsirkin" <mst@redhat.com>
    Cc: Mike Snitzer <snitzer@kernel.org>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
    Cc: Olga Kornievskaia <kolga@netapp.com>
    Cc: Richard Weinberger <richard@nod.at>
    Cc: Rob Clark <robdclark@gmail.com>
    Cc: Rob Herring <robh@kernel.org>
    Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Cc: Sean Paul <sean@poorly.run>
    Cc: Song Liu <song@kernel.org>
    Cc: Stefano Stabellini <sstabellini@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
    Cc: Tom Talpey <tom@talpey.com>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
    Cc: Yue Hu <huyue2@coolpad.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    c42d50ae
internal.h 36.9 KB