1. 06 Oct, 2020 5 commits
  2. 18 Sep, 2020 18 commits
  3. 16 Sep, 2020 17 commits
    • Michael Ellerman's avatar
      Merge coregroup support into next · b5c8a293
      Michael Ellerman authored
      From Srikar's cover letter, with some reformatting:
      
      Cleanup of existing powerpc topologies and add coregroup support on
      powerpc. Coregroup is a group of (subset of) cores of a DIE that share
      a resource.
      
      Summary of some of the testing done with coregroup patchset.
      
      It includes ebizzy, schbench, perf bench sched pipe and topology
      verification. On the left side are results from powerpc/next tree and
      on the right are the results with the patchset applied. Topological
      verification clearly shows that there is no change in topology with
      and without the patches on all the 3 class of systems that were
      tested.
      
      Power 9 PowerNV (2 Node/ 160 Cpu System)
      ----------------------------------------
      
      Baseline                                                                Baseline + Coregroup Support
      
        N      Min       Max    Median       Avg        Stddev                  N      Min       Max    Median       Avg      Stddev
      100   993884   1276090   1173476   1165914     54867.201                100   910470   1279820   1171095   1162091    67363.28
      
      ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
      
      schbench (latency hence lower is better)
      Latency percentiles (usec)                                              Latency percentiles (usec)
              50.0th: 455                                                             50.0th: 454
              75.0th: 533                                                             75.0th: 543
              90.0th: 683                                                             90.0th: 701
              95.0th: 743                                                             95.0th: 737
              *99.0th: 815                                                            *99.0th: 805
              99.5th: 839                                                             99.5th: 835
              99.9th: 913                                                             99.9th: 893
              min=0, max=1011                                                         min=0, max=2833
      
      perf bench sched pipe (lesser time and higher ops/sec is better)
      Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
      Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes
      
           Total time: 6.083 [sec]                                                 Total time: 6.303 [sec]
      
             6.083576 usecs/op                                                       6.303318 usecs/op
               164377 ops/sec                                                          158646 ops/sec
      
      Power 9 LPAR (2 Node/ 128 Cpu System)
      -------------------------------------
      
      Baseline                                                                Baseline + Coregroup Support
      
        N       Min       Max    Median         Avg      Stddev                 N       Min       Max    Median         Avg      Stddev
      100   1058029   1295393   1200414   1188306.7   56786.538               100    943264   1287619   1180522   1168473.2   64469.955
      
      ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
      
      schbench (latency hence lower is better)
      Latency percentiles (usec)                                              Latency percentiles (usec)
              50.0000th: 34                                                           50.0000th: 39
              75.0000th: 46                                                           75.0000th: 52
              90.0000th: 53                                                           90.0000th: 68
              95.0000th: 56                                                           95.0000th: 77
              *99.0000th: 61                                                          *99.0000th: 89
              99.5000th: 63                                                           99.5000th: 94
              99.9000th: 81                                                           99.9000th: 169
              min=0, max=8405                                                         min=0, max=23674
      
      perf bench sched pipe (lesser time and higher ops/sec is better)
      Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
      Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes
      
           Total time: 8.768 [sec]                                                 Total time: 5.217 [sec]
      
             8.768400 usecs/op                                                       5.217625 usecs/op
               114045 ops/sec                                                          191658 ops/sec
      
      Power 8 LPAR (8 Node/ 256 Cpu System)
      -------------------------------------
      
      Baseline                                                                Baseline + Coregroup Support
      
        N       Min       Max    Median         Avg      Stddev                 N      Min      Max   Median        Avg     Stddev
      100   1267615   1965234   1707423   1689137.6   144363.29               100  1175357  1924262  1691104  1664792.1   145876.4
      
      ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
      
      schbench (latency hence lower is better)
      Latency percentiles (usec)                                              Latency percentiles (usec)
              50.0th: 37                                                              50.0th: 36
              75.0th: 51                                                              75.0th: 48
              90.0th: 59                                                              90.0th: 55
              95.0th: 63                                                              95.0th: 59
              *99.0th: 71                                                             *99.0th: 67
              99.5th: 75                                                              99.5th: 72
              99.9th: 105                                                             99.9th: 170
              min=0, max=18560                                                        min=0, max=27031
      
      perf bench sched pipe (lesser time and higher ops/sec is better)
      Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
      Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes
      
           Total time: 6.013 [sec]                                                 Total time: 5.930 [sec]
      
             6.013963 usecs/op                                                       5.930724 usecs/op
               166279 ops/sec                                                          168613 ops/sec
      
      Topology verification on Power9
      Power9 / powernv / SMT4
      
        $ tail /proc/cpuinfo
        cpu             : POWER9, altivec supported
        clock           : 3600.000000MHz
        revision        : 2.2 (pvr 004e 1202)
      
        timebase        : 512000000
        platform        : PowerNV
        model           : 9006-22P
        machine         : PowerNV 9006-22P
        firmware        : OPAL
        MMU             : Radix
      
      Baseline                                                                Baseline + Coregroup Support
      
        lscpu                                                                 lscpu
        ------                                                                ------
        Architecture:        ppc64le                                          Architecture:        ppc64le
        Byte Order:          Little Endian                                    Byte Order:          Little Endian
        CPU(s):              160                                              CPU(s):              160
        On-line CPU(s) list: 0-159                                            On-line CPU(s) list: 0-159
        Thread(s) per core:  4                                                Thread(s) per core:  4
        Core(s) per socket:  20                                               Core(s) per socket:  20
        Socket(s):           2                                                Socket(s):           2
        NUMA node(s):        2                                                NUMA node(s):        2
        Model:               2.2 (pvr 004e 1202)                              Model:               2.2 (pvr 004e 1202)
        Model name:          POWER9, altivec supported                        Model name:          POWER9, altivec supported
        CPU max MHz:         3800.0000                                        CPU max MHz:         3800.0000
        CPU min MHz:         2166.0000                                        CPU min MHz:         2166.0000
        L1d cache:           32K                                              L1d cache:           32K
        L1i cache:           32K                                              L1i cache:           32K
        L2 cache:            512K                                             L2 cache:            512K
        L3 cache:            10240K                                           L3 cache:            10240K
        NUMA node0 CPU(s):   0-79                                             NUMA node0 CPU(s):   0-79
        NUMA node8 CPU(s):   80-159                                           NUMA node8 CPU(s):   80-159
      
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
        -----------------------------------------------------                 -----------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT                   /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT
        /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE                 /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE
        /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE                   /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE
        /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA                  /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA
      
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags               grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
        ------------------------------------------------------                ------------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391                 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391
        /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327                 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327
        /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071                 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071
        /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801                /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801
      
      Baseline
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4295043536
        cpu0 0 0 0 0 0 0 9597119314 2408913694 11897
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 4941435230 11106132 1583
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      Baseline + Coregroup Support
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4296311826
        cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
        Post sudo ppc64_cpu --smt=1                                           Post sudo ppc64_cpu --smt=1
        ---------------------                                                 ---------------------
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
        -----------------------------------------------------                 -----------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE                 /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE
        /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE                   /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE
        /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA                  /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA
      
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags               grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
        ------------------------------------------------------                ------------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327                 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327
        /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071                 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071
        /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801                /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801
      
      Baseline:
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4295046242
        cpu0 0 0 0 0 0 0 10978610020 2658997390 13068
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu4 0 0 0 0 0 0 5408663896 95701034 7697
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      Baseline + Coregroup Support
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4296314905
        cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and
      Power 9 (2 node 128 Cpu LPAR) and they showed the topology before and
      after the patch to be identical. If Interested, I could provide the
      same.
      
      On Power 9 (with device-tree enablement to show coregroups):
      
        $ tail /proc/cpuinfo
        processor     : 127
        cpu           : POWER9 (architected), altivec supported
        clock         : 3000.000000MHz
        revision      : 2.2 (pvr 004e 0202)
      
        timebase      : 512000000
        platform      : pSeries
        model         : IBM,9008-22L
        machine       : CHRP IBM,9008-22L
        MMU           : Hash
      
      Before patchset:
      
        $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
        SMT
        CACHE
        DIE
        NUMA
      
        $ head /proc/schedstat
        version 15
        timestamp 4318242208
        cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
        domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 24177439200 413887604 75393
        domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      After patchset:
      
        $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
        SMT
        CACHE
        MC
        DIE
        NUMA
      
        $ head /proc/schedstat
        version 15
        timestamp 4318242208
        cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
        domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 24177439200 413887604 75393
        domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      b5c8a293
    • Srikar Dronamraju's avatar
      powerpc/smp: Implement cpu_to_coregroup_id · fa35e868
      Srikar Dronamraju authored
      Lookup the coregroup id from the associativity array.
      
      If unable to detect the coregroup id, fallback on the core id.
      This way, ensure sched_domain degenerates and an extra sched domain is
      not created.
      
      Ideally this function should have been implemented in
      arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
      don't need to find the primary domain again.
      
      If the device-tree mentions more than one coregroup, then kernel
      implements only the last or the smallest coregroup, which currently
      corresponds to the penultimate domain in the device-tree.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
      fa35e868
    • Srikar Dronamraju's avatar
      powerpc/smp: Create coregroup domain · 72730bfc
      Srikar Dronamraju authored
      Add percpu coregroup maps and masks to create coregroup domain.
      If a coregroup doesn't exist, the coregroup domain will be degenerated
      in favour of SMT/CACHE domain. Do note this patch is only creating stubs
      for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
      would be in a subsequent patch.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
      72730bfc
    • Srikar Dronamraju's avatar
      powerpc/smp: Allocate cpumask only after searching thread group · 6e086302
      Srikar Dronamraju authored
      If allocated earlier and the search fails, then cpu_l1_cache_map cpumask
      is unnecessarily cleared. However cpu_l1_cache_map can be allocated /
      cleared after we search thread group.
      
      Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask
      allocated by zalloc_cpumask_var_node is never freed.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-9-srikar@linux.vnet.ibm.com
      6e086302
    • Srikar Dronamraju's avatar
      powerpc/numa: Detect support for coregroup · f9f130ff
      Srikar Dronamraju authored
      Add support for grouping cores based on the device-tree classification.
      - The last domain in the associativity domains always refers to the
      core.
      - If primary reference domain happens to be the penultimate domain in
      the associativity domains device-tree property, then there are no
      coregroups. However if its not a penultimate domain, then there are
      coregroups. There can be more than one coregroup. For now we would be
      interested in the last or the smallest coregroups, i.e one sub-group
      per DIE.
      
      Currently there are no firmwares that are exposing this grouping. Hence
      allow the basis for grouping to be abstract.  Once the firmware starts
      using this grouping, code would be added to detect the type of grouping
      and adjust the sd domain flags accordingly.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
      f9f130ff
    • Srikar Dronamraju's avatar
      powerpc/smp: Optimize start_secondary · caa8e29d
      Srikar Dronamraju authored
      In start_secondary, even if shared_cache was already set, system does a
      redundant match for cpumask. This redundant check can be removed by
      checking if shared_cache is already set.
      
      While here, localize the sibling_mask variable to within the if
      condition.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-7-srikar@linux.vnet.ibm.com
      caa8e29d
    • Srikar Dronamraju's avatar
      powerpc/smp: Dont assume l2-cache to be superset of sibling · f6606cfd
      Srikar Dronamraju authored
      Current code assumes that cpumask of cpus sharing a l2-cache mask will
      always be a superset of cpu_sibling_mask.
      
      Lets stop that assumption. cpu_l2_cache_mask is a superset of
      cpu_sibling_mask if and only if shared_caches is set.
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200913171038.GB11808@linux.vnet.ibm.com
      f6606cfd
    • Srikar Dronamraju's avatar
      powerpc/smp: Move topology fixups into a new function · 3c6032a8
      Srikar Dronamraju authored
      Move topology fixup based on the platform attributes into its own
      function which is called just before set_sched_topology.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-5-srikar@linux.vnet.ibm.com
      3c6032a8
    • Srikar Dronamraju's avatar
      powerpc/smp: Move powerpc_topology above · 5e93f16a
      Srikar Dronamraju authored
      Just moving the powerpc_topology description above.
      This will help in using functions in this file and avoid declarations.
      
      No other functional changes
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-4-srikar@linux.vnet.ibm.com
      5e93f16a
    • Srikar Dronamraju's avatar
      powerpc/smp: Merge Power9 topology with Power topology · 2ef0ca54
      Srikar Dronamraju authored
      A new sched_domain_topology_level was added just for Power9. However the
      same can be achieved by merging powerpc_topology with power9_topology
      and makes the code more simpler especially when adding a new sched
      domain.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-3-srikar@linux.vnet.ibm.com
      2ef0ca54
    • Srikar Dronamraju's avatar
      powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES · d0fd24bb
      Srikar Dronamraju authored
      Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
      "error: _numa_cpu_lookup_table_ undeclared"
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-2-srikar@linux.vnet.ibm.com
      d0fd24bb
    • Srikar Dronamraju's avatar
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju authored
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • Srikar Dronamraju's avatar
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju authored
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • Srikar Dronamraju's avatar
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju authored
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • Srikar Dronamraju's avatar
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju authored
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
    • Srikar Dronamraju's avatar
      powerpc/topology: Override cpu_smt_mask · f3232321
      Srikar Dronamraju authored
      On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8
      core for backward compatibility reasons, with the fusion of two SMT4 cores.
      Powerpc allows LPARs to be live migrated from Power8 to Power9.  Existing
      software developed/configured for Power8, expects to see a SMT8 core.
      
      In order to maintain userspace backward compatibility (with Power8 chips in
      case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
      has to be set to SMT8 core.
      
      cpu_smt_mask() should generally point to the cpu mask of the SMT4 core.
      Hence override the default cpu_smt_mask() to be powerpc specific
      allowing for better scheduling behaviour on Power.
      
      schbench
      (latency measured in usecs, so lesser is better)
      Without patch                   With patch
      Latency percentiles (usec)	Latency percentiles (usec)
      	50.0000th: 34           	50.0000th: 38
      	75.0000th: 47           	75.0000th: 52
      	90.0000th: 54           	90.0000th: 60
      	95.0000th: 57           	95.0000th: 64
      	*99.0000th: 62          	*99.0000th: 72
      	99.5000th: 65           	99.5000th: 75
      	99.9000th: 76           	99.9000th: 3452
      	min=0, max=9205         	min=0, max=9344
      
      schbench (With Cede disabled)
      Without patch                   With patch
      Latency percentiles (usec) 	Latency percentiles (usec)
      	50.0000th: 20           	50.0000th: 21
      	75.0000th: 28           	75.0000th: 29
      	90.0000th: 33           	90.0000th: 34
      	95.0000th: 35           	95.0000th: 37
      	*99.0000th: 40          	*99.0000th: 40
      	99.5000th: 48           	99.5000th: 42
      	99.9000th: 94           	99.9000th: 79
      	min=0, max=791          	min=0, max=791
      
      perf bench sched pipe
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      5.095113      5.595269      5.204842     5.2298776    0.10762713
      
      5.10 - 5.15 : ##################################################   23% (24)
      5.15 - 5.20 : #############################################        21% (22)
      5.20 - 5.25 : ##################################################   23% (24)
      5.25 - 5.30 : #########################                            11% (12)
      5.30 - 5.35 : ##########                                            4% (5)
      5.35 - 5.40 : ########                                              3% (4)
      5.40 - 5.45 : ########                                              3% (4)
      5.45 - 5.50 : ####                                                  1% (2)
      5.50 - 5.55 : ##                                                    0% (1)
      5.55 - 5.60 : ####                                                  1% (2)
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      5.134675      8.524719      5.207658     5.2780985    0.34911969
      
      5.1 - 5.5 : ##################################################   94% (95)
      5.5 - 5.8 : ##                                                    3% (4)
      5.8 - 6.2 :                                                       0% (1)
      6.2 - 6.5 :
      6.5 - 6.8 :
      6.8 - 7.2 :
      7.2 - 7.5 :
      7.5 - 7.8 :
      7.8 - 8.2 :
      8.2 - 8.5 :
      
      perf bench sched pipe (cede disabled)
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      7.884227     12.576538      7.956474     8.0170722    0.46159054
      
      7.9 - 8.4 : ##################################################   99% (100)
      8.4 - 8.8 :
      8.8 - 9.3 :
      9.3 - 9.8 :
      9.8 - 10.2 :
      10.2 - 10.7 :
      10.7 - 11.2 :
      11.2 - 11.6 :
      11.6 - 12.1 :
      12.1 - 12.6 :
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      7.956021      8.217284      8.015615     8.0283866   0.049844967
      
      7.96 - 7.98 : ######################                               12% (13)
      7.98 - 8.01 : ##################################################   28% (29)
      8.01 - 8.03 : ####################################                 20% (21)
      8.03 - 8.06 : #########################                            14% (15)
      8.06 - 8.09 : ######################                               12% (13)
      8.09 - 8.11 : ######                                                3% (4)
      8.11 - 8.14 : ###                                                   1% (2)
      8.14 - 8.17 : ###                                                   1% (2)
      8.17 - 8.19 :
      8.19 - 8.22 : #                                                     0% (1)
      
      Observations: With the patch, the initial run/iteration takes a slight
      longer time. This can be attributed to the fact that now we pick a CPU
      from a idle core which could be sleep mode. Once we remove the cede,
      state the numbers improve in favour of the patch.
      
      ebizzy:
      transactions per second (higher is better)
      without patch
        N           Min           Max        Median           Avg        Stddev
      100       1018433       1304470       1193208     1182315.7     60018.733
      
      1018433 - 1047037 : ######                                                3% (3)
      1047037 - 1075640 : ########                                              4% (4)
      1075640 - 1104244 : ########                                              4% (4)
      1104244 - 1132848 : ###############                                       7% (7)
      1132848 - 1161452 : ####################################                 17% (17)
      1161452 - 1190055 : ##########################                           12% (12)
      1190055 - 1218659 : #############################################        21% (21)
      1218659 - 1247263 : ##################################################   23% (23)
      1247263 - 1275866 : ########                                              4% (4)
      1275866 - 1304470 : ########                                              4% (4)
      
      with patch
        N           Min           Max        Median           Avg        Stddev
      100        967014       1292938       1208819     1185281.8     69815.851
      
       967014 - 999606  : ##                                                    1% (1)
       999606 - 1032199 : ##                                                    1% (1)
      1032199 - 1064791 : ############                                          6% (6)
      1064791 - 1097384 : ##########                                            5% (5)
      1097384 - 1129976 : ##################                                    9% (9)
      1129976 - 1162568 : ####################                                 10% (10)
      1162568 - 1195161 : ##########################                           13% (13)
      1195161 - 1227753 : ############################################         22% (22)
      1227753 - 1260346 : ##################################################   25% (25)
      1260346 - 1292938 : ##############                                        7% (7)
      
      Observations: Not much changes, ebizzy is not much impacted.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200807074517.27957-2-srikar@linux.vnet.ibm.com
      f3232321
    • Srikar Dronamraju's avatar
      sched/topology: Allow archs to override cpu_smt_mask · 3babbe44
      Srikar Dronamraju authored
      cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
      most architectures. One of the users of cpu_smt_mask(), would be to
      identify idle-cores. On Power9, a pair of SMT4 cores can be presented
      by the firmware as a SMT8 core for backward compatibility reasons.
      
      powerpc allows LPARs to be live migrated from Power8 to Power9. Do
      note Power8 had only SMT8 cores. Existing software which has been
      developed/configured for Power8 would expect to see SMT8 core.
      Maintaining the illusion of SMT8 core is a requirement to make that
      work.
      
      In order to maintain above userspace backward compatibility with
      previous versions of processor, Power9 onwards there is option to the
      firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8
      core. On Power9 this pair shares the L2 cache as well. However, from
      the scheduler's point of view, a core should be determined by SMT4,
      since its a completely independent unit of compute. Hence allow
      powerpc architecture to override the default cpu_smt_mask() to point
      to the SMT4 cores in a SMT8 mode.
      
      This will ensure the scheduler is always given the right information.
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200807074517.27957-1-srikar@linux.vnet.ibm.com
      3babbe44