• Felix Kuehling's avatar
    drm/amdkfd: Fix lock dependency warning · 47bf0f83
    Felix Kuehling authored
    ======================================================
    WARNING: possible circular locking dependency detected
    6.5.0-kfd-fkuehlin #276 Not tainted
    ------------------------------------------------------
    kworker/8:2/2676 is trying to acquire lock:
    ffff9435aae95c88 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}, at: __flush_work+0x52/0x550
    
    but task is already holding lock:
    ffff9435cd8e1720 (&svms->lock){+.+.}-{3:3}, at: svm_range_deferred_list_work+0xe8/0x340 [amdgpu]
    
    which lock already depends on the new lock.
    
    the existing dependency chain (in reverse order) is:
    
    -> #2 (&svms->lock){+.+.}-{3:3}:
           __mutex_lock+0x97/0xd30
           kfd_ioctl_alloc_memory_of_gpu+0x6d/0x3c0 [amdgpu]
           kfd_ioctl+0x1b2/0x5d0 [amdgpu]
           __x64_sys_ioctl+0x86/0xc0
           do_syscall_64+0x39/0x80
           entry_SYSCALL_64_after_hwframe+0x63/0xcd
    
    -> #1 (&mm->mmap_lock){++++}-{3:3}:
           down_read+0x42/0x160
           svm_range_evict_svm_bo_worker+0x8b/0x340 [amdgpu]
           process_one_work+0x27a/0x540
           worker_thread+0x53/0x3e0
           kthread+0xeb/0x120
           ret_from_fork+0x31/0x50
           ret_from_fork_asm+0x11/0x20
    
    -> #0 ((work_completion)(&svm_bo->eviction_work)){+.+.}-{0:0}:
           __lock_acquire+0x1426/0x2200
           lock_acquire+0xc1/0x2b0
           __flush_work+0x80/0x550
           __cancel_work_timer+0x109/0x190
           svm_range_bo_release+0xdc/0x1c0 [amdgpu]
           svm_range_free+0x175/0x180 [amdgpu]
           svm_range_deferred_list_work+0x15d/0x340 [amdgpu]
           process_one_work+0x27a/0x540
           worker_thread+0x53/0x3e0
           kthread+0xeb/0x120
           ret_from_fork+0x31/0x50
           ret_from_fork_asm+0x11/0x20
    
    other info that might help us debug this:
    
    Chain exists of:
      (work_completion)(&svm_bo->eviction_work) --> &mm->mmap_lock --> &svms->lock
    
     Possible unsafe locking scenario:
    
           CPU0                    CPU1
           ----                    ----
      lock(&svms->lock);
                                   lock(&mm->mmap_lock);
                                   lock(&svms->lock);
      lock((work_completion)(&svm_bo->eviction_work));
    
    I believe this cannot really lead to a deadlock in practice, because
    svm_range_evict_svm_bo_worker only takes the mmap_read_lock if the BO
    refcount is non-0. That means it's impossible that svm_range_bo_release
    is running concurrently. However, there is no good way to annotate this.
    
    To avoid the problem, take a BO reference in
    svm_range_schedule_evict_svm_bo instead of in the worker. That way it's
    impossible for a BO to get freed while eviction work is pending and the
    cancel_work_sync call in svm_range_bo_release can be eliminated.
    
    v2: Use svm_bo_ref_unless_zero and explained why that's safe. Also
    removed redundant checks that are already done in
    amdkfd_fence_enable_signaling.
    Signed-off-by: default avatarFelix Kuehling <felix.kuehling@amd.com>
    Reviewed-by: default avatarPhilip Yang <philip.yang@amd.com>
    Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
    47bf0f83
kfd_svm.c 114 KB