Commit 043b4318 authored by Borislav Petkov's avatar Borislav Petkov

EDAC: Update Documentation/edac.txt

Do some initial cleanup, more probably will come.

- Move credits section to the end
- Update maintainers
- Drop sourceforge reference - project is long upstream now
- Reformat sections
- Reformat paragraphs
- Clarify text
- Bring it up-to-date
- Drop useless "future hardware scanning" section
Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
parent 3aae9edd
EDAC - Error Detection And Correction EDAC - Error Detection And Correction
=====================================
Written by Doug Thompson <dougthompson@xmission.com>
7 Dec 2005
17 Jul 2007 Updated
(c) Mauro Carvalho Chehab
05 Aug 2009 Nehalem interface
EDAC is maintained and written by:
Doug Thompson, Dave Jiang, Dave Peterson et al,
original author: Thayne Harbaugh,
Contact:
website: bluesmoke.sourceforge.net
mailing list: bluesmoke-devel@lists.sourceforge.net
"bluesmoke" was the name for this device driver when it was "out-of-tree" "bluesmoke" was the name for this device driver when it was "out-of-tree"
and maintained at sourceforge.net. When it was pushed into 2.6.16 for the and maintained at sourceforge.net. When it was pushed into 2.6.16 for the
first time, it was renamed to 'EDAC'. first time, it was renamed to 'EDAC'.
The bluesmoke project at sourceforge.net is now utilized as a 'staging area' PURPOSE
for EDAC development, before it is sent upstream to kernel.org -------
At the bluesmoke/EDAC project site, there is a series of quilt patches against
recent kernels, stored in a SVN repository. For easier downloading, there
is also a tarball snapshot available.
============================================================================ The 'edac' kernel module's goal is to detect and report hardware errors
EDAC PURPOSE that occur within the computer system running under linux.
The 'edac' kernel module goal is to detect and report errors that occur
within the computer system running under linux.
MEMORY MEMORY
------
In the initial release, memory Correctable Errors (CE) and Uncorrectable Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
Errors (UE) are the primary errors being harvested. These types of errors primary errors being harvested. These types of errors are harvested by
are harvested by the 'edac_mc' class of device. the 'edac_mc' device.
Detecting CE events, then harvesting those events and reporting them, Detecting CE events, then harvesting those events and reporting them,
CAN be a predictor of future UE events. With CE events, the system can *can* but must not necessarily be a predictor of future UE events. With
continue to operate, but with less safety. Preventive maintenance and CE events only, the system can and will continue to operate as no data
proactive part replacement of memory DIMMs exhibiting CEs can reduce has been damaged yet.
the likelihood of the dreaded UE events and system 'panics'.
However, preventive maintenance and proactive part replacement of memory
DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events
and system panics.
NON-MEMORY OTHER HARDWARE ELEMENTS
-----------------------
A new feature for EDAC, the edac_device class of device, was added in A new feature for EDAC, the edac_device class of device, was added in
the 2.6.23 version of the kernel. the 2.6.23 version of the kernel.
...@@ -56,70 +37,57 @@ This new device type allows for non-memory type of ECC hardware detectors ...@@ -56,70 +37,57 @@ This new device type allows for non-memory type of ECC hardware detectors
to have their states harvested and presented to userspace via the sysfs to have their states harvested and presented to userspace via the sysfs
interface. interface.
Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA Some architectures have ECC detectors for L1, L2 and L3 caches,
engines, fabric switches, main data path switches, interconnections, along with DMA engines, fabric switches, main data path switches,
and various other hardware data paths. If the hardware reports it, then interconnections, and various other hardware data paths. If the hardware
a edac_device device probably can be constructed to harvest and present reports it, then a edac_device device probably can be constructed to
that to userspace. harvest and present that to userspace.
PCI BUS SCANNING PCI BUS SCANNING
----------------
In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
in order to determine if errors are occurring on data transfers. in order to determine if errors are occurring during data transfers.
The presence of PCI Parity errors must be examined with a grain of salt. The presence of PCI Parity errors must be examined with a grain of salt.
There are several add-in adapters that do NOT follow the PCI specification There are several add-in adapters that do *not* follow the PCI specification
with regards to Parity generation and reporting. The specification says with regards to Parity generation and reporting. The specification says
the vendor should tie the parity status bits to 0 if they do not intend the vendor should tie the parity status bits to 0 if they do not intend
to generate parity. Some vendors do not do this, and thus the parity bit to generate parity. Some vendors do not do this, and thus the parity bit
can "float" giving false positives. can "float" giving false positives.
In the kernel there is a PCI device attribute located in sysfs that is There is a PCI device attribute located in sysfs that is checked by
checked by the EDAC PCI scanning code. If that attribute is set, the EDAC PCI scanning code. If that attribute is set, PCI parity/error
PCI parity/error scanning is skipped for that device. The attribute scanning is skipped for that device. The attribute is:
is:
broken_parity_status broken_parity_status
as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for and is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for
PCI devices. PCI devices.
FUTURE HARDWARE SCANNING
EDAC will have future error detectors that will be integrated with VERSIONING
EDAC or added to it, in the following list: ----------
MCE Machine Check Exception
MCA Machine Check Architecture
NMI NMI notification of ECC errors
MSRs Machine Specific Register error cases
and other mechanisms.
These errors are usually bus errors, ECC errors, thermal throttling
and the like.
============================================================================
EDAC VERSIONING
EDAC is composed of a "core" module (edac_core.ko) and several Memory EDAC is composed of a "core" module (edac_core.ko) and several Memory
Controller (MC) driver modules. On a given system, the CORE Controller (MC) driver modules. On a given system, the CORE is loaded
is loaded and one MC driver will be loaded. Both the CORE and and one MC driver will be loaded. Both the CORE and the MC driver (or
the MC driver (or edac_device driver) have individual versions that reflect edac_device driver) have individual versions that reflect current
current release level of their respective modules. release level of their respective modules.
Thus, to "report" on what version a system is running, one must report both Thus, to "report" on what version a system is running, one must report
the CORE's and the MC driver's versions. both the CORE's and the MC driver's versions.
LOADING LOADING
-------
If 'edac' was statically linked with the kernel then no loading is If 'edac' was statically linked with the kernel then no loading
necessary. If 'edac' was built as modules then simply modprobe the is necessary. If 'edac' was built as modules then simply modprobe
'edac' pieces that you need. You should be able to modprobe the 'edac' pieces that you need. You should be able to modprobe
hardware-specific modules and have the dependencies load the necessary core hardware-specific modules and have the dependencies load the necessary
modules. core modules.
Example: Example:
...@@ -129,35 +97,33 @@ loads both the amd76x_edac.ko memory controller module and the edac_mc.ko ...@@ -129,35 +97,33 @@ loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
core module. core module.
============================================================================ SYSFS INTERFACE
EDAC sysfs INTERFACE ---------------
EDAC presents a 'sysfs' interface for control, reporting and attribute
reporting purposes.
EDAC lives in the /sys/devices/system/edac directory. EDAC presents a 'sysfs' interface for control and reporting purposes. It
lives in the /sys/devices/system/edac directory.
Within this directory there currently reside 2 'edac' components: Within this directory there currently reside 2 components:
mc memory controller(s) system mc memory controller(s) system
pci PCI control and status system pci PCI control and status system
============================================================================
Memory Controller (mc) Model Memory Controller (mc) Model
----------------------------
First a background on the memory controller's model abstracted in EDAC. Each 'mc' device controls a set of DIMM memory modules. These modules
Each 'mc' device controls a set of DIMM memory modules. These modules are are laid out in a Chip-Select Row (csrowX) and Channel table (chX).
laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can There can be multiple csrows and multiple channels.
be multiple csrows and multiple channels.
Memory controllers allow for several csrows, with 8 csrows being a typical value. Memory controllers allow for several csrows, with 8 csrows being a
Yet, the actual number of csrows depends on the electrical "loading" typical value. Yet, the actual number of csrows depends on the layout of
of a given motherboard, memory controller and DIMM characteristics. a given motherboard, memory controller and DIMM characteristics.
Dual channels allows for 128 bit data transfers to the CPU from memory. Dual channels allows for 128 bit data transfers to/from the CPU from/to
Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs memory. Some newer chipsets allow for more than 2 channels, like Fully
(FB-DIMMs). The following example will assume 2 channels: Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels:
Channel 0 Channel 1 Channel 0 Channel 1
...@@ -179,12 +145,12 @@ for memory DIMMs: ...@@ -179,12 +145,12 @@ for memory DIMMs:
DIMM_A1 DIMM_A1
DIMM_B1 DIMM_B1
Labels for these slots are usually silk screened on the motherboard. Slots Labels for these slots are usually silk-screened on the motherboard.
labeled 'A' are channel 0 in this example. Slots labeled 'B' Slots labeled 'A' are channel 0 in this example. Slots labeled 'B' are
are channel 1. Notice that there are two csrows possible on a channel 1. Notice that there are two csrows possible on a physical DIMM.
physical DIMM. These csrows are allocated their csrow assignment These csrows are allocated their csrow assignment based on the slot into
based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM which the memory DIMM is placed. Thus, when 1 DIMM is placed in each
is placed in each Channel, the csrows cross both DIMMs. Channel, the csrows cross both DIMMs.
Memory DIMMs come single or dual "ranked". A rank is a populated csrow. Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
...@@ -193,8 +159,8 @@ when 2 dual ranked DIMMs are similarly placed, then both csrow0 and ...@@ -193,8 +159,8 @@ when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
csrow1 will be populated. The pattern repeats itself for csrow2 and csrow1 will be populated. The pattern repeats itself for csrow2 and
csrow3. csrow3.
The representation of the above is reflected in the directory tree The representation of the above is reflected in the directory
in EDAC's sysfs interface. Starting in directory tree in EDAC's sysfs interface. Starting in directory
/sys/devices/system/edac/mc each memory controller will be represented /sys/devices/system/edac/mc each memory controller will be represented
by its own 'mcX' directory, where 'X' is the index of the MC. by its own 'mcX' directory, where 'X' is the index of the MC.
...@@ -217,19 +183,19 @@ Under each 'mcX' directory each 'csrowX' is again represented by a ...@@ -217,19 +183,19 @@ Under each 'mcX' directory each 'csrowX' is again represented by a
|->csrow3 |->csrow3
.... ....
Notice that there is no csrow1, which indicates that csrow0 is Notice that there is no csrow1, which indicates that csrow0 is composed
composed of a single ranked DIMMs. This should also apply in both of a single ranked DIMMs. This should also apply in both Channels, in
Channels, in order to have dual-channel mode be operational. Since order to have dual-channel mode be operational. Since both csrow2 and
both csrow2 and csrow3 are populated, this indicates a dual ranked csrow3 are populated, this indicates a dual ranked set of DIMMs for
set of DIMMs for channels 0 and 1. channels 0 and 1.
Within each of the 'mcX' and 'csrowX' directories are several Within each of the 'mcX' and 'csrowX' directories are several EDAC
EDAC control and attribute files. control and attribute files.
============================================================================
'mcX' DIRECTORIES
'mcX' directories
-----------------
In 'mcX' directories are EDAC control and attribute files for In 'mcX' directories are EDAC control and attribute files for
this 'X' instance of the memory controllers. this 'X' instance of the memory controllers.
...@@ -238,13 +204,14 @@ For a description of the sysfs API, please see: ...@@ -238,13 +204,14 @@ For a description of the sysfs API, please see:
Documentation/ABI/testing/sysfs-devices-edac Documentation/ABI/testing/sysfs-devices-edac
============================================================================
'csrowX' DIRECTORIES
When CONFIG_EDAC_LEGACY_SYSFS is enabled, the sysfs will contain the 'csrowX' directories
csrowX directories. As this API doesn't work properly for Rambus, FB-DIMMs --------------------
and modern Intel Memory Controllers, this is being deprecated in favor
of dimmX directories. When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX
directories. As this API doesn't work properly for Rambus, FB-DIMMs and
modern Intel Memory Controllers, this is being deprecated in favor of
dimmX directories.
In the 'csrowX' directories are EDAC control and attribute files for In the 'csrowX' directories are EDAC control and attribute files for
this 'X' instance of csrow: this 'X' instance of csrow:
...@@ -265,11 +232,11 @@ Total Correctable Errors count attribute file: ...@@ -265,11 +232,11 @@ Total Correctable Errors count attribute file:
'ce_count' 'ce_count'
This attribute file displays the total count of correctable This attribute file displays the total count of correctable
errors that have occurred on this csrow. This errors that have occurred on this csrow. This count is very
count is very important to examine. CEs provide early important to examine. CEs provide early indications that a
indications that a DIMM is beginning to fail. This count DIMM is beginning to fail. This count field should be
field should be monitored for non-zero values and report monitored for non-zero values and report such information
such information to the system administrator. to the system administrator.
Total memory managed by this csrow attribute file: Total memory managed by this csrow attribute file:
...@@ -377,11 +344,13 @@ Channel 1 DIMM Label control file: ...@@ -377,11 +344,13 @@ Channel 1 DIMM Label control file:
motherboard specific and determination of this information motherboard specific and determination of this information
must occur in userland at this time. must occur in userland at this time.
============================================================================
SYSTEM LOGGING SYSTEM LOGGING
--------------
If logging for UEs and CEs are enabled then system logs will have If logging for UEs and CEs is enabled, then system logs will contain
error notices indicating errors that have been detected: information indicating that errors have been detected:
EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
channel 1 "DIMM_B1": amd76x_edac channel 1 "DIMM_B1": amd76x_edac
...@@ -404,24 +373,23 @@ The structure of the message is: ...@@ -404,24 +373,23 @@ The structure of the message is:
and then an optional, driver-specific message that may and then an optional, driver-specific message that may
have additional information. have additional information.
Both UEs and CEs with no info will lack all but memory controller, Both UEs and CEs with no info will lack all but memory controller, error
error type, a notice of "no info" and then an optional, type, a notice of "no info" and then an optional, driver-specific error
driver-specific error message. message.
============================================================================
PCI Bus Parity Detection PCI Bus Parity Detection
------------------------
On Header Type 00 devices, the primary status is looked at for any
On Header Type 00 devices the primary status is looked at parity error regardless of whether parity is enabled on the device or
for any parity error regardless of whether Parity is enabled on the not. (The spec indicates parity is generated in some cases). On Header
device. (The spec indicates parity is generated in some cases). Type 01 bridges, the secondary status register is also looked at to see
On Header Type 01 bridges, the secondary status register is also if parity occurred on the bus on the other side of the bridge.
looked at to see if parity occurred on the bus on the other side of
the bridge.
SYSFS CONFIGURATION SYSFS CONFIGURATION
-------------------
Under /sys/devices/system/edac/pci are control and attribute files as follows: Under /sys/devices/system/edac/pci are control and attribute files as follows:
...@@ -450,8 +418,9 @@ Parity Count: ...@@ -450,8 +418,9 @@ Parity Count:
have been detected. have been detected.
============================================================================
MODULE PARAMETERS MODULE PARAMETERS
-----------------
Panic on UE control file: Panic on UE control file:
...@@ -530,10 +499,8 @@ Panic on PCI PARITY Error: ...@@ -530,10 +499,8 @@ Panic on PCI PARITY Error:
======================================================================= EDAC device type
----------------
EDAC_DEVICE type of device
In the header file, edac_core.h, there is a series of edac_device structures In the header file, edac_core.h, there is a series of edac_device structures
and APIs for the EDAC_DEVICE. and APIs for the EDAC_DEVICE.
...@@ -573,6 +540,7 @@ The test_device_edac device adds at least one of its own custom control: ...@@ -573,6 +540,7 @@ The test_device_edac device adds at least one of its own custom control:
The symlink points to the 'struct dev' that is registered for this edac_device. The symlink points to the 'struct dev' that is registered for this edac_device.
INSTANCES INSTANCES
---------
One or more instance directories are present. For the 'test_device_edac' case: One or more instance directories are present. For the 'test_device_edac' case:
...@@ -586,6 +554,7 @@ counter in deeper subdirectories. ...@@ -586,6 +554,7 @@ counter in deeper subdirectories.
ue_count total of UE events of subdirectories ue_count total of UE events of subdirectories
BLOCKS BLOCKS
------
At the lowest directory level is the 'block' directory. There can be 0, 1 At the lowest directory level is the 'block' directory. There can be 0, 1
or more blocks specified in each instance. or more blocks specified in each instance.
...@@ -623,8 +592,9 @@ unique drivers for their hardware systems. ...@@ -623,8 +592,9 @@ unique drivers for their hardware systems.
The 'test_device_edac' sample driver is located at the The 'test_device_edac' sample driver is located at the
bluesmoke.sourceforge.net project site for EDAC. bluesmoke.sourceforge.net project site for EDAC.
=======================================================================
NEHALEM USAGE OF EDAC APIs NEHALEM USAGE OF EDAC APIs
--------------------------
This chapter documents some EXPERIMENTAL mappings for EDAC API to handle This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
Nehalem EDAC driver. They will likely be changed on future versions Nehalem EDAC driver. They will likely be changed on future versions
...@@ -773,3 +743,20 @@ exports one ...@@ -773,3 +743,20 @@ exports one
by the driver. Since, with udimm, this is counted by software, it is by the driver. Since, with udimm, this is counted by software, it is
possible that some errors could be lost. With rdimm's, they display the possible that some errors could be lost. With rdimm's, they display the
contents of the registers contents of the registers
CREDITS:
========
Written by Doug Thompson <dougthompson@xmission.com>
7 Dec 2005
17 Jul 2007 Updated
(c) Mauro Carvalho Chehab
05 Aug 2009 Nehalem interface
EDAC authors/maintainers:
Doug Thompson, Dave Jiang, Dave Peterson et al,
Mauro Carvalho Chehab
Borislav Petkov
original author: Thayne Harbaugh
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment