• Rik van Riel's avatar
    x86/mm: Print likely CPU at segfault time · c926087e
    Rik van Riel authored
    In a large enough fleet of computers, it is common to have a few bad CPUs.
    Those can often be identified by seeing that some commonly run kernel code,
    which runs fine everywhere else, keeps crashing on the same CPU core on one
    particular bad system.
    
    However, the failure modes in CPUs that have gone bad over the years are
    often oddly specific, and the only bad behavior seen might be segfaults
    in programs like bash, python, or various system daemons that run fine
    everywhere else.
    
    Add a printk() to show_signal_msg() to print the CPU, core, and socket
    at segfault time.
    
    This is not perfect, since the task might get rescheduled on another
    CPU between when the fault hit, and when the message is printed, but in
    practice this has been good enough to help people identify several bad
    CPU cores.
    
    For example:
    
      segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in \
    	  segfault[401000+1000] likely on CPU 0 (core 0, socket 0)
    
    This printk can be controlled through /proc/sys/debug/exception-trace.
    
      [ bp: Massage a bit, add "likely" to the printed line to denote that
        the CPU number is not always reliable. ]
    Signed-off-by: default avatarRik van Riel <riel@surriel.com>
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    Link: https://lore.kernel.org/r/20220805101644.2e674553@imladris.surriel.com
    c926087e
fault.c 42.2 KB