-
Gavin Shan authored
The EEH core is talking with the PCI device driver to determine the action (purely reset, or PCI device removal). During the period, the driver might be unloaded and in turn causes kernel crash as follows: EEH: Detected PCI bus error on PHB#4-PE#10000 EEH: This PCI device has failed 3 times in the last hour lpfc 0004:01:00.0: 0:2710 PCI channel disable preparing for reset Unable to handle kernel paging request for data at address 0x00000490 Faulting instruction address: 0xd00000000e682c90 cpu 0x1: Vector: 300 (Data Access) at [c000000fc75ffa20] pc: d00000000e682c90: .lpfc_io_error_detected+0x30/0x240 [lpfc] lr: d00000000e682c8c: .lpfc_io_error_detected+0x2c/0x240 [lpfc] sp: c000000fc75ffca0 msr: 8000000000009032 dar: 490 dsisr: 40000000 current = 0xc000000fc79b88b0 paca = 0xc00000000edb0380 softe: 0 irq_happened: 0x00 pid = 3386, comm = eehd enter ? for help [c000000fc75ffca0] c000000fc75ffd30 (unreliable) [c000000fc75ffd30] c00000000004fd3c .eeh_report_error+0x7c/0xf0 [c000000fc75ffdc0] c00000000004ee00 .eeh_pe_dev_traverse+0xa0/0x180 [c000000fc75ffe70] c00000000004ffd8 .eeh_handle_event+0x68/0x300 [c000000fc75fff00] c0000000000503a0 .eeh_event_handler+0x130/0x1a0 [c000000fc75fff90] c000000000020138 .kernel_thread+0x54/0x70 1:mon> The patch increases the reference of the corresponding driver modules while EEH core does the negotiation with PCI device driver so that the corresponding driver modules can't be unloaded during the period and we're safe to refer the callbacks. Cc: stable@vger.kernel.org Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
feadf7c0