• Srikar Dronamraju's avatar
    powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
    Srikar Dronamraju authored
    Currently Linux kernel with CONFIG_NUMA on a system with multiple
    possible nodes, marks node 0 as online at boot.  However in practice,
    there are systems which have node 0 as memoryless and cpuless.
    
    This can cause numa_balancing to be enabled on systems with only one node
    with memory and CPUs. The existence of this dummy node which is cpuless and
    memoryless node can confuse users/scripts looking at output of lscpu /
    numactl.
    
    By marking, node 0 as offline, lets stop assuming that node 0 is
    always online. If node 0 has CPU or memory that are online, node 0 will
    again be set as online.
    
    v5.8
     available: 2 nodes (0,2)
     node 0 cpus:
     node 0 size: 0 MB
     node 0 free: 0 MB
     node 2 cpus: 0 1 2 3 4 5 6 7
     node 2 size: 32625 MB
     node 2 free: 31490 MB
     node distances:
     node   0   2
       0:  10  20
       2:  20  10
    
    proc and sys files
    ------------------
     /sys/devices/system/node/online:            0,2
     /proc/sys/kernel/numa_balancing:            1
     /sys/devices/system/node/has_cpu:           2
     /sys/devices/system/node/has_memory:        2
     /sys/devices/system/node/has_normal_memory: 2
     /sys/devices/system/node/possible:          0-31
    
    v5.8 + patch
    ------------------
     available: 1 nodes (2)
     node 2 cpus: 0 1 2 3 4 5 6 7
     node 2 size: 32625 MB
     node 2 free: 31487 MB
     node distances:
     node   2
       2:  10
    
    proc and sys files
    ------------------
    /sys/devices/system/node/online:            2
    /proc/sys/kernel/numa_balancing:            0
    /sys/devices/system/node/has_cpu:           2
    /sys/devices/system/node/has_memory:        2
    /sys/devices/system/node/has_normal_memory: 2
    /sys/devices/system/node/possible:          0-31
    
    Example of a node with online CPUs/memory on node 0.
    (Same o/p with and without patch)
    numactl -H
    available: 4 nodes (0-3)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
    node 0 size: 32482 MB
    node 0 free: 22994 MB
    node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
    node 1 size: 0 MB
    node 1 free: 0 MB
    node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
    node 2 size: 0 MB
    node 2 free: 0 MB
    node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
    node 3 free: 0 MB
    node distances:
    node   0   1   2   3
      0:  10  20  40  40
      1:  20  10  40  40
      2:  40  40  10  20
      3:  40  40  20  10
    
    Note: On Powerpc, cpu_to_node of possible but not present cpus would
    previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
    numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
    queried from vphn"). Without the 2 commits, Powerpc system might crash.
    
    1. User space applications like Numactl, lscpu, that parse the sysfs tend to
    believe there is an extra online node. This tends to confuse users and
    applications. Other user space applications start believing that system was
    not able to use all the resources (i.e missing resources) or the system was
    not setup correctly.
    
    2. Also existence of dummy node also leads to inconsistent information. The
    number of online nodes is inconsistent with the information in the
    device-tree and resource-dump
    
    3. When the dummy node is present, single node non-Numa systems end up showing
    up as NUMA systems and numa_balancing gets enabled. This will mean we take
    the hit from the unnecessary numa hinting faults.
    Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
    e75130f2
numa.c 29.7 KB