• Srikar Dronamraju's avatar
    powerpc/numa: Restrict possible nodes based on platform · 67df7784
    Srikar Dronamraju authored
    As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
    Abstraction Services (RTAS) Node" available at:
      https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
    
    ... there are 2 device tree properties:
    
      "ibm,max-associativity-domains"
       which defines the maximum number of domains that the firmware i.e
       PowerVM can support.
    
    and:
    
      "ibm,current-associativity-domains"
       which defines the maximum number of domains that the current
       platform can support.
    
    The value of "ibm,max-associativity-domains" is always greater than or
    equal to "ibm,current-associativity-domains" property. If the latter
    property is not available, use "ibm,max-associativity-domain" as a
    fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
    is mentioned in page 833 / B.5.3 which is covered under under
    "Appendix B. System Binding" section
    
    Currently powerpc uses the "ibm,max-associativity-domains" property
    while setting the possible number of nodes. This is currently set at
    32. However the possible number of nodes for a platform may be
    significantly less. Hence set the possible number of nodes based on
    "ibm,current-associativity-domains" property.
    
    Nathan Lynch had raised a valid concern that post LPM (Live Partition
    Migration), a user could DLPAR add processors and memory after LPM
    with "new" associativity properties:
      https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
    
    He also pointed out that "ibm,max-associativity-domains" has the same
    contents on all currently available PowerVM systems, unlike
    "ibm,current-associativity-domains" and hence may be better able to
    handle the new NUMA associativity properties.
    
    However with the recent commit dbce4562 ("powerpc/numa: Limit
    possible nodes to within num_possible_nodes"), all new NUMA
    associativity properties are capped to initially set nr_node_ids.
    Hence this commit should be safe with any new DLPAR add post LPM.
    
      $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
      /proc/device-tree/rtas/ibm,current-associativity-domains
      		 00000005 00000001 00000002 00000002 00000002 00000010
      /proc/device-tree/rtas/ibm,max-associativity-domains
      		 00000005 00000001 00000008 00000020 00000020 00000100
    
      $ cat /sys/devices/system/node/possible ##Before patch
      0-31
    
      $ cat /sys/devices/system/node/possible ##After patch
      0-1
    
    Note the maximum nodes this platform can support is only 2 but the
    possible nodes is set to 32.
    
    This is important because lot of kernel and user space code allocate
    structures for all possible nodes leading to a lot of memory that is
    allocated but not used.
    
    I ran a simple experiment to create and destroy 100 memory cgroups on
    boot on a 8 node machine (Power8 Alpine).
    
    Before patch:
      free -k at boot
                    total        used        free      shared  buff/cache   available
      Mem:      523498176     4106816   518820608       22272      570752   516606720
      Swap:       4194240           0     4194240
    
      free -k after creating 100 memory cgroups
                    total        used        free      shared  buff/cache   available
      Mem:      523498176     4628416   518246464       22336      623296   516058688
      Swap:       4194240           0     4194240
    
      free -k after destroying 100 memory cgroups
                    total        used        free      shared  buff/cache   available
      Mem:      523498176     4697408   518173760       22400      627008   515987904
      Swap:       4194240           0     4194240
    
    After patch:
      free -k at boot
                    total        used        free      shared  buff/cache   available
      Mem:      523498176     3969472   518933888       22272      594816   516731776
      Swap:       4194240           0     4194240
    
      free -k after creating 100 memory cgroups
                    total        used        free      shared  buff/cache   available
      Mem:      523498176     4181888   518676096       22208      640192   516496448
      Swap:       4194240           0     4194240
    
      free -k after destroying 100 memory cgroups
                    total        used        free      shared  buff/cache   available
      Mem:      523498176     4232320   518619904       22272      645952   516443264
      Swap:       4194240           0     4194240
    
    Observations:
      Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
      Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
    Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
    [mpe: Reformat change log a bit for readability]
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
    67df7784
numa.c 28.8 KB