• Valentin Schneider's avatar
    arm64: topology: Stop using MPIDR for topology information · 3102bc0e
    Valentin Schneider authored
    In the absence of ACPI or DT topology data, we fallback to haphazardly
    decoding *something* out of MPIDR. Sadly, the contents of that register are
    mostly unusable due to the implementation leniancy and things like Aff0
    having to be capped to 15 (despite being encoded on 8 bits).
    
    Consider a simple system with a single package of 32 cores, all under the
    same LLC. We ought to be shoving them in the same core_sibling mask, but
    MPIDR is going to look like:
    
      | CPU  | 0 | ... | 15 | 16 | ... | 31 |
      |------+---+-----+----+----+-----+----+
      | Aff0 | 0 | ... | 15 |  0 | ... | 15 |
      | Aff1 | 0 | ... |  0 |  1 | ... |  1 |
      | Aff2 | 0 | ... |  0 |  0 | ... |  0 |
    
    Which will eventually yield
    
      core_sibling(0-15)  == 0-15
      core_sibling(16-31) == 16-31
    
    NUMA woes
    =========
    
    If we try to play games with this and set up NUMA boundaries within those
    groups of 16 cores via e.g. QEMU:
    
      # Node0: 0-9; Node1: 10-19
      $ qemu-system-aarch64 <blah> \
        -smp 20 -numa node,cpus=0-9,nodeid=0 -numa node,cpus=10-19,nodeid=1
    
    The scheduler's MC domain (all CPUs with same LLC) is going to be built via
    
      arch_topology.c::cpu_coregroup_mask()
    
    In there we try to figure out a sensible mask out of the topology
    information we have. In short, here we'll pick the smallest of NUMA or
    core sibling mask.
    
      node_mask(CPU9)    == 0-9
      core_sibling(CPU9) == 0-15
    
    MC mask for CPU9 will thus be 0-9, not a problem.
    
      node_mask(CPU10)    == 10-19
      core_sibling(CPU10) == 0-15
    
    MC mask for CPU10 will thus be 10-19, not a problem.
    
      node_mask(CPU16)    == 10-19
      core_sibling(CPU16) == 16-19
    
    MC mask for CPU16 will thus be 16-19... Uh oh. CPUs 16-19 are in two
    different unique MC spans, and the scheduler has no idea what to make of
    that. That triggers the WARN_ON() added by commit
    
      ccf74128 ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
    
    Fixing MPIDR-derived topology
    =============================
    
    We could try to come up with some cleverer scheme to figure out which of
    the available masks to pick, but really if one of those masks resulted from
    MPIDR then it should be discarded because it's bound to be bogus.
    
    I was hoping to give MPIDR a chance for SMT, to figure out which threads are
    in the same core using Aff1-3 as core ID, but Sudeep and Robin pointed out
    to me that there are systems out there where *all* cores have non-zero
    values in their higher affinity fields (e.g. RK3288 has "5" in all of its
    cores' MPIDR.Aff1), which would expose a bogus core ID to userspace.
    
    Stop using MPIDR for topology information. When no other source of topology
    information is available, mark each CPU as its own core and its NUMA node
    as its LLC domain.
    Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
    Reviewed-by: default avatarSudeep Holla <sudeep.holla@arm.com>
    Link: https://lore.kernel.org/r/20200829130016.26106-1-valentin.schneider@arm.comSigned-off-by: default avatarWill Deacon <will@kernel.org>
    3102bc0e
topology.c 8.09 KB