1. 16 Sep, 2020 6 commits
    • Srikar Dronamraju's avatar
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju authored
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • Srikar Dronamraju's avatar
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju authored
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
    • Srikar Dronamraju's avatar
      powerpc/topology: Override cpu_smt_mask · f3232321
      Srikar Dronamraju authored
      On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8
      core for backward compatibility reasons, with the fusion of two SMT4 cores.
      Powerpc allows LPARs to be live migrated from Power8 to Power9.  Existing
      software developed/configured for Power8, expects to see a SMT8 core.
      
      In order to maintain userspace backward compatibility (with Power8 chips in
      case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
      has to be set to SMT8 core.
      
      cpu_smt_mask() should generally point to the cpu mask of the SMT4 core.
      Hence override the default cpu_smt_mask() to be powerpc specific
      allowing for better scheduling behaviour on Power.
      
      schbench
      (latency measured in usecs, so lesser is better)
      Without patch                   With patch
      Latency percentiles (usec)	Latency percentiles (usec)
      	50.0000th: 34           	50.0000th: 38
      	75.0000th: 47           	75.0000th: 52
      	90.0000th: 54           	90.0000th: 60
      	95.0000th: 57           	95.0000th: 64
      	*99.0000th: 62          	*99.0000th: 72
      	99.5000th: 65           	99.5000th: 75
      	99.9000th: 76           	99.9000th: 3452
      	min=0, max=9205         	min=0, max=9344
      
      schbench (With Cede disabled)
      Without patch                   With patch
      Latency percentiles (usec) 	Latency percentiles (usec)
      	50.0000th: 20           	50.0000th: 21
      	75.0000th: 28           	75.0000th: 29
      	90.0000th: 33           	90.0000th: 34
      	95.0000th: 35           	95.0000th: 37
      	*99.0000th: 40          	*99.0000th: 40
      	99.5000th: 48           	99.5000th: 42
      	99.9000th: 94           	99.9000th: 79
      	min=0, max=791          	min=0, max=791
      
      perf bench sched pipe
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      5.095113      5.595269      5.204842     5.2298776    0.10762713
      
      5.10 - 5.15 : ##################################################   23% (24)
      5.15 - 5.20 : #############################################        21% (22)
      5.20 - 5.25 : ##################################################   23% (24)
      5.25 - 5.30 : #########################                            11% (12)
      5.30 - 5.35 : ##########                                            4% (5)
      5.35 - 5.40 : ########                                              3% (4)
      5.40 - 5.45 : ########                                              3% (4)
      5.45 - 5.50 : ####                                                  1% (2)
      5.50 - 5.55 : ##                                                    0% (1)
      5.55 - 5.60 : ####                                                  1% (2)
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      5.134675      8.524719      5.207658     5.2780985    0.34911969
      
      5.1 - 5.5 : ##################################################   94% (95)
      5.5 - 5.8 : ##                                                    3% (4)
      5.8 - 6.2 :                                                       0% (1)
      6.2 - 6.5 :
      6.5 - 6.8 :
      6.8 - 7.2 :
      7.2 - 7.5 :
      7.5 - 7.8 :
      7.8 - 8.2 :
      8.2 - 8.5 :
      
      perf bench sched pipe (cede disabled)
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      7.884227     12.576538      7.956474     8.0170722    0.46159054
      
      7.9 - 8.4 : ##################################################   99% (100)
      8.4 - 8.8 :
      8.8 - 9.3 :
      9.3 - 9.8 :
      9.8 - 10.2 :
      10.2 - 10.7 :
      10.7 - 11.2 :
      11.2 - 11.6 :
      11.6 - 12.1 :
      12.1 - 12.6 :
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      7.956021      8.217284      8.015615     8.0283866   0.049844967
      
      7.96 - 7.98 : ######################                               12% (13)
      7.98 - 8.01 : ##################################################   28% (29)
      8.01 - 8.03 : ####################################                 20% (21)
      8.03 - 8.06 : #########################                            14% (15)
      8.06 - 8.09 : ######################                               12% (13)
      8.09 - 8.11 : ######                                                3% (4)
      8.11 - 8.14 : ###                                                   1% (2)
      8.14 - 8.17 : ###                                                   1% (2)
      8.17 - 8.19 :
      8.19 - 8.22 : #                                                     0% (1)
      
      Observations: With the patch, the initial run/iteration takes a slight
      longer time. This can be attributed to the fact that now we pick a CPU
      from a idle core which could be sleep mode. Once we remove the cede,
      state the numbers improve in favour of the patch.
      
      ebizzy:
      transactions per second (higher is better)
      without patch
        N           Min           Max        Median           Avg        Stddev
      100       1018433       1304470       1193208     1182315.7     60018.733
      
      1018433 - 1047037 : ######                                                3% (3)
      1047037 - 1075640 : ########                                              4% (4)
      1075640 - 1104244 : ########                                              4% (4)
      1104244 - 1132848 : ###############                                       7% (7)
      1132848 - 1161452 : ####################################                 17% (17)
      1161452 - 1190055 : ##########################                           12% (12)
      1190055 - 1218659 : #############################################        21% (21)
      1218659 - 1247263 : ##################################################   23% (23)
      1247263 - 1275866 : ########                                              4% (4)
      1275866 - 1304470 : ########                                              4% (4)
      
      with patch
        N           Min           Max        Median           Avg        Stddev
      100        967014       1292938       1208819     1185281.8     69815.851
      
       967014 - 999606  : ##                                                    1% (1)
       999606 - 1032199 : ##                                                    1% (1)
      1032199 - 1064791 : ############                                          6% (6)
      1064791 - 1097384 : ##########                                            5% (5)
      1097384 - 1129976 : ##################                                    9% (9)
      1129976 - 1162568 : ####################                                 10% (10)
      1162568 - 1195161 : ##########################                           13% (13)
      1195161 - 1227753 : ############################################         22% (22)
      1227753 - 1260346 : ##################################################   25% (25)
      1260346 - 1292938 : ##############                                        7% (7)
      
      Observations: Not much changes, ebizzy is not much impacted.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200807074517.27957-2-srikar@linux.vnet.ibm.com
      f3232321
    • Srikar Dronamraju's avatar
      sched/topology: Allow archs to override cpu_smt_mask · 3babbe44
      Srikar Dronamraju authored
      cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
      most architectures. One of the users of cpu_smt_mask(), would be to
      identify idle-cores. On Power9, a pair of SMT4 cores can be presented
      by the firmware as a SMT8 core for backward compatibility reasons.
      
      powerpc allows LPARs to be live migrated from Power8 to Power9. Do
      note Power8 had only SMT8 cores. Existing software which has been
      developed/configured for Power8 would expect to see SMT8 core.
      Maintaining the illusion of SMT8 core is a requirement to make that
      work.
      
      In order to maintain above userspace backward compatibility with
      previous versions of processor, Power9 onwards there is option to the
      firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8
      core. On Power9 this pair shares the L2 cache as well. However, from
      the scheduler's point of view, a core should be determined by SMT4,
      since its a completely independent unit of compute. Hence allow
      powerpc architecture to override the default cpu_smt_mask() to point
      to the SMT4 cores in a SMT8 mode.
      
      This will ensure the scheduler is always given the right information.
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200807074517.27957-1-srikar@linux.vnet.ibm.com
      3babbe44
    • Wang Wensheng's avatar
      drivers/macintosh/smu.c: Fix undeclared symbol warning · 3db8715e
      Wang Wensheng authored
      Make kernel with `C=2`:
      drivers/macintosh/smu.c:1018:30: warning: symbol
      '__smu_get_sdb_partition' was not declared. Should it be static?
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarWang Wensheng <wangwensheng4@huawei.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200914122615.65669-1-wangwensheng4@huawei.com
      3db8715e
    • Vaibhav Jain's avatar
      powerpc/papr_scm: Fix warning triggered by perf_stats_show() · ca78ef2f
      Vaibhav Jain authored
      A warning is reported by the kernel in case perf_stats_show() returns
      an error code. The warning is of the form below:
      
       papr_scm ibm,persistent-memory:ibm,pmemory@44100001:
       	  Failed to query performance stats, Err:-10
       dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
       fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
      
      On investigation it looks like that the compiler is silently
      truncating the return value of drc_pmem_query_stats() from 'long' to
      'int', since the variable used to store the return code 'rc' is an
      'int'. This truncated value is then returned back as a 'ssize_t' back
      from perf_stats_show() to 'dev_attr_show()' which thinks of it as a
      large unsigned number and triggers this warning..
      
      To fix this we update the type of variable 'rc' from 'int' to
      'ssize_t' that prevents the compiler from truncating the return value
      of drc_pmem_query_stats() and returning correct signed value back from
      perf_stats_show().
      
      Fixes: 2d02bf83 ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP")
      Signed-off-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200912081451.66225-1-vaibhav@linux.ibm.com
      ca78ef2f
  2. 15 Sep, 2020 34 commits