• Robert Richter's avatar
    EDAC/ghes: Fix locking and memory barrier issues · 23f61b9f
    Robert Richter authored
    The ghes registration and refcount is broken in several ways:
    
     * ghes_edac_register() returns with success for a 2nd instance
       even if a first instance's registration is still running. This is
       not correct as the first instance may fail later. A subsequent
       registration may not finish before the first. Parallel registrations
       must be avoided.
    
     * The refcount was increased even if a registration failed. This
       leads to stale counters preventing the device from being released.
    
     * The ghes refcount may not be decremented properly on unregistration.
       Always decrement the refcount once ghes_edac_unregister() is called to
       keep the refcount sane.
    
     * The ghes_pvt pointer is handed to the irq handler before registration
       finished.
    
     * The mci structure could be freed while the irq handler is running.
    
    Fix this by adding a mutex to ghes_edac_register(). This mutex
    serializes instances to register and unregister. The refcount is only
    increased if the registration succeeded. This makes sure the refcount is
    in a consistent state after registering or unregistering a device.
    
    Note: A spinlock cannot be used here as the code section may sleep.
    
    The ghes_pvt is protected by ghes_lock now. This ensures the pointer is
    not updated before registration was finished or while the irq handler is
    running. It is unset before unregistering the device including necessary
    (implicit) memory barriers making the changes visible to other CPUs.
    Thus, the device can not be used anymore by an interrupt.
    
    Also, rename ghes_init to ghes_refcount for better readability and
    switch to refcount API.
    
    A refcount is needed because there can be multiple GHES structures being
    defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error
    Source, "Some platforms may describe multiple Generic Hardware Error
    Source structures with different notification types, ...").
    
    Another approach to use the mci's device refcount (get_device()) and
    have a release function does not work here. A release function will be
    called only for device_release() with the last put_device() call. The
    device must be deleted *before* that with device_del(). This is only
    possible by maintaining an own refcount.
    
     [ bp: touchups. ]
    
    Fixes: 0fe5f281 ("EDAC, ghes: Model a single, logical memory controller")
    Fixes: 1e72e673 ("EDAC/ghes: Fix Use after free in ghes_edac remove path")
    Co-developed-by: default avatarJames Morse <james.morse@arm.com>
    Signed-off-by: default avatarJames Morse <james.morse@arm.com>
    Co-developed-by: default avatarBorislav Petkov <bp@suse.de>
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    Signed-off-by: default avatarRobert Richter <rrichter@marvell.com>
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
    Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
    Cc: Tony Luck <tony.luck@intel.com>
    Link: https://lkml.kernel.org/r/20191105200732.3053-1-rrichter@marvell.com
    23f61b9f
ghes_edac.c 15 KB