• Luis Chamberlain's avatar
    module: avoid allocation if module is already present and ready · 064f4536
    Luis Chamberlain authored
    The finit_module() system call can create unnecessary virtual memory
    pressure for duplicate modules. This is because load_module() can in
    the worse case allocate more than twice the size of a module in virtual
    memory. This saves at least a full size of the module in wasted vmalloc
    space memory by trying to avoid duplicates as soon as we can validate
    the module name in the read module structure.
    
    This can only be an issue if a system is getting hammered with userspace
    loading modules. There are two ways to load modules typically on systems,
    one is the kernel moduile auto-loading (*request_module*() calls in-kernel)
    and the other is things like udev. The auto-loading is in-kernel, but that
    pings back to userspace to just call modprobe. We already have a way to
    restrict the amount of concurrent kernel auto-loads in a given time, however
    that still allows multiple requests for the same module to go through
    and force two threads in userspace racing to call modprobe for the same
    exact module. Even though libkmod which both modprobe and udev does check
    if a module is already loaded prior calling finit_module() races are
    still possible and this is clearly evident today when you have multiple
    CPUs.
    
    To avoid memory pressure for such stupid cases put a stop gap for them.
    The *earliest* we can detect duplicates from the modules side of things
    is once we have blessed the module name, sadly after the first vmalloc
    allocation. We can check for the module being present *before* a secondary
    vmalloc() allocation.
    
    There is a linear relationship between wasted virtual memory bytes and
    the number of CPU counts. The reason is that udev ends up racing to call
    tons of the same modules for each of the CPUs.
    
    We can see the different linear relationships between wasted virtual
    memory and CPU count during after boot in the following graph:
    
             +----------------------------------------------------------------------------+
        14GB |-+          +            +            +           +           *+          +-|
             |                                                          ****              |
             |                                                       ***                  |
             |                                                     **                     |
        12GB |-+                                                 **                     +-|
             |                                                 **                         |
             |                                               **                           |
             |                                             **                             |
             |                                           **                               |
        10GB |-+                                       **                               +-|
             |                                       **                                   |
             |                                     **                                     |
             |                                   **                                       |
         8GB |-+                               **                                       +-|
    waste    |                               **                             ###           |
             |                             **                           ####              |
             |                           **                      #######                  |
         6GB |-+                     ****                    ####                       +-|
             |                      *                    ####                             |
             |                     *                 ####                                 |
             |                *****              ####                                     |
         4GB |-+            **               ####                                       +-|
             |            **             ####                                             |
             |          **           ####                                                 |
             |        **         ####                                                     |
         2GB |-+    **      #####                                                       +-|
             |     *    ####                                                              |
             |    * ####                                                   Before ******* |
             |  **##      +            +            +           +           After ####### |
             +----------------------------------------------------------------------------+
             0            50          100          150         200          250          300
                                              CPUs count
    
    On the y-axis we can see gigabytes of wasted virtual memory during boot
    due to duplicate module requests which just end up failing. Trying to
    infer the slope this ends up being about ~463 MiB per CPU lost prior
    to this patch. After this patch we only loose about ~230 MiB per CPU, for
    a total savings of about ~233 MiB per CPU. This is all *just on bootup*!
    
    On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests
    kmod.sh -t 0008 I see a saving in the *highest* side of memory
    consumption of up to ~ 84 MiB with the Linux kernel selftests kmod
    test 0008. With the new stress-ng module test I see a 145 MiB difference
    in max memory consumption with 100 ops. The stress-ng module ops tests can be
    pretty pathalogical -- it is not realistic, however it was used to
    finally successfully reproduce issues which are only reported to happen on
    system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM
    system. Running out of virtual memory space is no surprise given the
    above graph, since at least on x86_64 we're capped at 128 MiB, eventually
    we'd hit a series of errors and once can use the above graph to
    guestimate when. This of course will vary depending on the features
    you have enabled. So for instance, enabling KASAN seems to make this
    much worse.
    
    The results with kmod and stress-ng can be observed and visualized below.
    The time it takes to run the test is also not affected.
    
    The kmod tests 0008:
    
    The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib)
    given the tests peak around that range.
    
    cat kmod.plot
    set term dumb
    set output fileout
    set yrange [400000:580000]
    plot filein with linespoints title "Memory usage (KiB)"
    
    Before:
    root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
    root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C
    root@kmod ~ # sort -n -r log-0008-before.txt | head -1
    528732
    
    So ~516.33 MiB
    
    After:
    
    root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
    root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C
    
    root@kmod ~ # sort -n -r log-0008-after.txt | head -1
    442516
    
    So ~432.14 MiB
    
    That's about 84 ~MiB in savings in the worst case. The graphs:
    
    root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot
    root@kmod ~ # gnuplot -e "filein='log-0008-after.txt';  fileout='graph-0008-after.txt'"  kmod.plot
    
    root@kmod ~ # cat graph-0008-before.txt
    
      580000 +-----------------------------------------------------------------+
             |       +        +       +       +       +        +       +       |
      560000 |-+                                    Memory usage (KiB) ***A***-|
             |                                                                 |
      540000 |-+                                                             +-|
             |                                                                 |
             |        *A     *AA*AA*A*AA          *A*AA    A*A*A *AA*A*AA*A  A |
      520000 |-+A*A*AA  *AA*A           *A*AA*A*AA     *A*A     A          *A+-|
             |*A                                                               |
      500000 |-+                                                             +-|
             |                                                                 |
      480000 |-+                                                             +-|
             |                                                                 |
      460000 |-+                                                             +-|
             |                                                                 |
             |                                                                 |
      440000 |-+                                                             +-|
             |                                                                 |
      420000 |-+                                                             +-|
             |       +        +       +       +       +        +       +       |
      400000 +-----------------------------------------------------------------+
             0       5        10      15      20      25       30      35      40
    
    root@kmod ~ # cat graph-0008-after.txt
    
      580000 +-----------------------------------------------------------------+
             |       +        +       +       +       +        +       +       |
      560000 |-+                                    Memory usage (KiB) ***A***-|
             |                                                                 |
      540000 |-+                                                             +-|
             |                                                                 |
             |                                                                 |
      520000 |-+                                                             +-|
             |                                                                 |
      500000 |-+                                                             +-|
             |                                                                 |
      480000 |-+                                                             +-|
             |                                                                 |
      460000 |-+                                                             +-|
             |                                                                 |
             |          *A              *A*A                                   |
      440000 |-+A*A*AA*A  A       A*A*AA    A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-|
             |*A           *A*AA*A                                             |
      420000 |-+                                                             +-|
             |       +        +       +       +       +        +       +       |
      400000 +-----------------------------------------------------------------+
             0       5        10      15      20      25       30      35      40
    
    The stress-ng module tests:
    
    This is used to run the test to try to reproduce the vmap issues
    reported by David:
    
      echo 0 > /proc/sys/vm/oom_dump_tasks
      ./stress-ng --module 100 --module-name xfs
    
    Prior to this commit:
    root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt
    root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1
    5046456
    
    After this commit:
    root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt
    root@kmod ~ # sort -n -r after-stress-ng.txt | head -1
    4896972
    
    5046456 - 4896972
    149484
    149484/1024
    145.98046875000000000000
    
    So this commit using stress-ng reveals saving about 145 MiB in memory
    using 100 ops from stress-ng which reproduced the vmap issue reported.
    
    cat kmod.plot
    set term dumb
    set output fileout
    set yrange [4700000:5070000]
    plot filein with linespoints title "Memory usage (KiB)"
    
    root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'"  kmod-simple-stress-ng.plot
    root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'"  kmod-simple-stress-ng.plot
    
    root@kmod ~ # cat graph-stress-ng-before.txt
    
               +---------------------------------------------------------------+
      5.05e+06 |-+     + A     +       +       +       +       +       +     +-|
               |         *                          Memory usage (KiB) ***A*** |
               |         *                             A                       |
         5e+06 |-+      **                            **                     +-|
               |        **                            * *    A                 |
      4.95e+06 |-+      * *                          A  *   A*               +-|
               |        * *      A       A           *  *  *  *             A  |
               |       *  *     * *     * *        *A   *  *  *      A      *  |
       4.9e+06 |-+     *  *     * A*A   * A*AA*A  A      *A    **A   **A*A  *+-|
               |       A  A*A  A    *  A       *  *      A     A *  A    * **  |
               |      *      **      **         * *              *  *    * * * |
      4.85e+06 |-+   A       A       A          **               *  *     ** *-|
               |     *                           *               * *      ** * |
               |     *                           A               * *      *  * |
       4.8e+06 |-+   *                                           * *      A  A-|
               |     *                                           * *           |
      4.75e+06 |-+  *                                            * *         +-|
               |    *                                            **            |
               |    *  +       +       +       +       +       + **    +       |
       4.7e+06 +---------------------------------------------------------------+
               0       5       10      15      20      25      30      35      40
    
    root@kmod ~ # cat graph-stress-ng-after.txt
    
               +---------------------------------------------------------------+
      5.05e+06 |-+     +       +       +       +       +       +       +     +-|
               |                                    Memory usage (KiB) ***A*** |
               |                                                               |
         5e+06 |-+                                                           +-|
               |                                                               |
      4.95e+06 |-+                                                           +-|
               |                                                               |
               |                                                               |
       4.9e+06 |-+                                      *AA                  +-|
               |  A*AA*A*A  A  A*AA*AA*A*AA*A  A  A  A*A   *AA*A*A  A  A*AA*AA |
               |  *      * **  *            *  *  ** *            ***  *       |
      4.85e+06 |-+*       ***  *            * * * ***             A *  *     +-|
               |  *       A *  *             ** * * A               *  *       |
               |  *         *  *             *  **                  *  *       |
       4.8e+06 |-+*         *  *             A   *                  *  *     +-|
               | *          * *                  A                  * *        |
      4.75e+06 |-*          * *                                     * *      +-|
               | *          * *                                     * *        |
               | *     +    * *+       +       +       +       +    * *+       |
       4.7e+06 +---------------------------------------------------------------+
               0       5       10      15      20      25      30      35      40
    
    [0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.comReported-by: default avatarDavid Hildenbrand <david@redhat.com>
    Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
    064f4536
stats.c 17.7 KB