• Yazen Ghannam's avatar
    EDAC/amd64: Cache and use GPU node map · 4251566e
    Yazen Ghannam authored
    AMD systems have historically provided an "AMD Node ID" that is a unique
    identifier for each die in a multi-die package. This was associated with
    a unique instance of the AMD Northbridge on a legacy system. And now it
    is associated with a unique instance of the AMD Data Fabric on modern
    systems. Each instance is referred to as a "Node"; this is an
    AMD-specific term not to be confused with NUMA nodes.
    
    The data fabric provides a number of interfaces accessible through a set
    of functions in a single PCI device. There is one PCI device per Data
    Fabric (AMD Node), and multi-die systems will see multiple such PCI
    devices. The AMD Node ID matches a Node's position in the PCI hierarchy.
    For example, the Node 0 is accessed using the first PCI device, Node 1
    is accessed using the second, and so on. A logical CPU can find its AMD
    Node ID using CPUID. Furthermore, the AMD Node ID is used within the
    hardware fabric, so it is not purely a logical value.
    
    Heterogeneous AMD systems, with a CPU Data Fabric connected to GPU data
    fabrics, follow a similar convention. Each CPU and GPU die has a unique
    AMD Node ID value, and each Node ID corresponds to PCI devices in
    sequential order.
    
    However, there are two caveats:
    1) GPUs are not x86, and they don't have CPUID to read their AMD Node ID
    like on CPUs. This means the value is more implicit and based on PCI
    enumeration and hardware-specifics.
    2) There is a gap in the hardware values for AMD Node IDs. Values 0-7
    are for CPUs and values 8-15 are for GPUs.
    
    For example, a system with one CPU die and two GPUs dies will have the
    following values:
      CPU0 -> AMD Node 0
      GPU0 -> AMD Node 8
      GPU1 -> AMD Node 9
    
    EDAC is the only subsystem where this has a practical effect. Memory
    errors on AMD systems are commonly reported through MCA to a CPU on the
    local AMD Node. The error information is passed along to EDAC where the
    AMD EDAC modules use the AMD Node ID of reporting logical CPU to access
    AMD Node information.
    
    However, memory errors from a GPU die will be reported to the CPU die.
    Therefore, the logical CPU's AMD Node ID can't be used since it won't
    match the AMD Node ID of the GPU die. The AMD Node ID of the GPU die is
    provided as part of the MCA information, and the value will match the
    hardware enumeration (e.g. 8-15).
    
    Handle this situation by discovering GPU dies the same way as CPU dies
    in the AMD NB code. But do a "node id" fixup in AMD64 EDAC where it's
    needed.
    
    The GPU data fabrics provide a register with the base AMD Node ID for
    their local "type", i.e. GPU data fabric. This value is the same for all
    fabrics of the same type in a system.
    
    Read and cache the base AMD Node ID from one of the GPU devices during
    module initialization. Use this to fixup the "node id" when reporting
    memory errors at runtime.
    
      [ bp: Squash a fix making gpu_node_map static as reported by
            Tom Rix <trix@redhat.com>.
        Link: https://lore.kernel.org/r/20230610210930.174074-1-trix@redhat.com ]
    Signed-off-by: default avatarYazen Ghannam <yazen.ghannam@amd.com>
    Co-developed-by: default avatarMuralidhara M K <muralidhara.mk@amd.com>
    Signed-off-by: default avatarMuralidhara M K <muralidhara.mk@amd.com>
    Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
    Link: https://lore.kernel.org/r/20230515113537.1052146-6-muralimk@amd.com
    4251566e
amd64_edac.c 112 KB