• Xunlei Pang's avatar
    iommu/vt-d: Flush old iommu caches for kdump when the device gets context mapped · aec0e861
    Xunlei Pang authored
    We met the DMAR fault both on hpsa P420i and P421 SmartArray controllers
    under kdump, it can be steadily reproduced on several different machines,
    the dmesg log is like:
    HP HPSA Driver (v 3.4.16-0)
    hpsa 0000:02:00.0: using doorbell to reset controller
    hpsa 0000:02:00.0: board ready after hard reset.
    hpsa 0000:02:00.0: Waiting for controller to respond to no-op
    DMAR: Setting identity map for device 0000:02:00.0 [0xe8000 - 0xe8fff]
    DMAR: Setting identity map for device 0000:02:00.0 [0xf4000 - 0xf4fff]
    DMAR: Setting identity map for device 0000:02:00.0 [0xbdf6e000 - 0xbdf6efff]
    DMAR: Setting identity map for device 0000:02:00.0 [0xbdf6f000 - 0xbdf7efff]
    DMAR: Setting identity map for device 0000:02:00.0 [0xbdf7f000 - 0xbdf82fff]
    DMAR: Setting identity map for device 0000:02:00.0 [0xbdf83000 - 0xbdf84fff]
    DMAR: DRHD: handling fault status reg 2
    DMAR: [DMA Read] Request device [02:00.0] fault addr fffff000 [fault reason 06] PTE Read access is not set
    hpsa 0000:02:00.0: controller message 03:00 timed out
    hpsa 0000:02:00.0: no-op failed; re-trying
    
    After some debugging, we found that the fault addr is from DMA initiated at
    the driver probe stage after reset(not in-flight DMA), and the corresponding
    pte entry value is correct, the fault is likely due to the old iommu caches
    of the in-flight DMA before it.
    
    Thus we need to flush the old cache after context mapping is setup for the
    device, where the device is supposed to finish reset at its driver probe
    stage and no in-flight DMA exists hereafter.
    
    I'm not sure if the hardware is responsible for invalidating all the related
    caches allocated in the iommu hardware before, but seems not the case for hpsa,
    actually many device drivers have problems in properly resetting the hardware.
    Anyway flushing (again) by software in kdump kernel when the device gets context
    mapped which is a quite infrequent operation does little harm.
    
    With this patch, the problematic machine can survive the kdump tests.
    
    CC: Myron Stowe <myron.stowe@gmail.com>
    CC: Joseph Szczypek <jszczype@redhat.com>
    CC: Don Brace <don.brace@microsemi.com>
    CC: Baoquan He <bhe@redhat.com>
    CC: Dave Young <dyoung@redhat.com>
    Fixes: 091d42e4 ("iommu/vt-d: Copy translation tables from old kernel")
    Fixes: dbcd861f ("iommu/vt-d: Do not re-use domain-ids from the old kernel")
    Fixes: cf484d0e ("iommu/vt-d: Mark copied context entries")
    Signed-off-by: default avatarXunlei Pang <xlpang@redhat.com>
    Tested-by: default avatarDon Brace <don.brace@microsemi.com>
    Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
    aec0e861
intel-iommu.c 136 KB