• Tony Luck's avatar
    sb_edac: Fix discovery of top-of-low-memory for Haswell · f7cf2a22
    Tony Luck authored
    Haswell moved the TOLM/TOHM registers to a different device and offset.
    The sb_edac driver accounted for the change of device, but not for the
    new offset.  There was also a typo in the constant to fill in the low
    26 bits (was 0x1ffffff, should be 0x3ffffff).
    
    This resulted in a bogus value for the top of low memory:
    
      EDAC DEBUG: get_memory_layout: TOLM: 0.032 GB (0x0000000001ffffff)
    
    which would result in EDAC refusing to translate addresses for
    errors above the bogus value and below 4GB:
    
       sbridge MC3: HANDLING MCE MEMORY ERROR
       sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
       sbridge MC3: TSC 0
       sbridge MC3: ADDR 2000000
       sbridge MC3: MISC 523eac86
       sbridge MC3: PROCESSOR 0:306f3 TIME 1414600951 SOCKET 0 APIC 0
       MC3: 1 CE Error at TOLM area, on addr 0x02000000 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0)
    
    With the fix we see the correct TOLM value:
    
       DEBUG: get_memory_layout: TOLM: 2.048 GB (0x000000007fffffff)
    
    and we decode address 2000000 correctly:
    
       sbridge MC3: HANDLING MCE MEMORY ERROR
       sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
       sbridge MC3: TSC 0
       sbridge MC3: ADDR 2000000
       sbridge MC3: MISC 523e1086
       sbridge MC3: PROCESSOR 0:306f3 TIME 1414601319 SOCKET 0 APIC 0
       DEBUG: get_memory_error_data: SAD interleave package: 0 = CPU socket 0, HA 0, shiftup: 0
       DEBUG: get_memory_error_data: TAD#0: address 0x0000000002000000 < 0x000000007fffffff, socket interleave 1, channel interleave 4 (offset 0x00000000), index 0, base ch: 0, ch mask: 0x01
       DEBUG: get_memory_error_data: RIR#0, limit: 4.095 GB (0x00000000ffffffff), way: 1
       DEBUG: get_memory_error_data: RIR#0: channel address 0x00200000 < 0xffffffff, RIR interleave 0, index 0
       DEBUG: sbridge_mce_output_error:  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0
       MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2000 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
    Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
    Acked-by: default avatarAristeu Rozanski <aris@redhat.com>
    Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
    f7cf2a22
sb_edac.c 61.1 KB