Commit 9f3b1300 authored by Zhiquan Li's avatar Zhiquan Li Committed by Borislav Petkov (AMD)

x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel

Memory errors don't happen very often, especially fatal ones. However,
in large-scale scenarios such as data centers, that probability
increases with the amount of machines present.

When a fatal machine check happens, mce_panic() is called based on the
severity grading of that error. The page containing the error is not
marked as poison.

However, when kexec is enabled, tools like makedumpfile understand when
pages are marked as poison and do not touch them so as not to cause
a fatal machine check exception again while dumping the previous
kernel's memory.

Therefore, mark the page containing the error as poisoned so that the
kexec'ed kernel can avoid accessing the page.

  [ bp: Rewrite commit message and comment. ]
Co-developed-by: default avatarYouquan Song <youquan.song@intel.com>
Signed-off-by: default avatarYouquan Song <youquan.song@intel.com>
Signed-off-by: default avatarZhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
Link: https://lore.kernel.org/r/20231014051754.3759099-1-zhiquan1.li@intel.com
parent b85ea95d
......@@ -44,6 +44,7 @@
#include <linux/sync_core.h>
#include <linux/task_work.h>
#include <linux/hardirq.h>
#include <linux/kexec.h>
#include <asm/intel-family.h>
#include <asm/processor.h>
......@@ -233,6 +234,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
struct llist_node *pending;
struct mce_evt_llist *l;
int apei_err = 0;
struct page *p;
/*
* Allow instrumentation around external facilities usage. Not that it
......@@ -286,6 +288,20 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
/*
* Kdump skips the poisoned page in order to avoid
* touching the error bits again. Poison the page even
* if the error is fatal and the machine is about to
* panic.
*/
if (kexec_crash_loaded()) {
if (final && (final->status & MCI_STATUS_ADDRV)) {
p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
if (p)
SetPageHWPoison(p);
}
}
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment