• Yunxiang Li's avatar
    drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic · 6e4aa08f
    Yunxiang Li authored
    
    
    The retry loop for SRIOV reset have refcount and memory leak issue.
    Depending on which function call fails it can potentially call
    amdgpu_amdkfd_pre/post_reset different number of times and causes
    kfd_locked count to be wrong. This will block all future attempts at
    opening /dev/kfd. The retry loop also leakes resources by calling
    amdgpu_virt_init_data_exchange multiple times without calling the
    corresponding fini function.
    
    Align with the bare-metal reset path which doesn't have these issues.
    This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
    reset loop and calling amdgpu_device_pre_asic_reset each retry which
    properly free the resources from previous try by calling
    amdgpu_virt_fini_data_exchange.
    Signed-off-by: default avatarYunxiang Li <Yunxiang.Li@amd.com>
    Reviewed-by: default avatarEmily Deng <Emily.Deng@amd.com>
    Reviewed-by: default avatarZhigang Luo <zhigang.luo@amd.com>
    Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
    6e4aa08f
amdgpu_device.c 178 KB