1. 02 Jul, 2020 35 commits
  2. 01 Jul, 2020 5 commits
    • Ivan Mironov's avatar
      drm/amd/powerplay: Fix NULL dereference in lock_bus() on Vega20 w/o RAS · 78083631
      Ivan Mironov authored
      I updated my system with Radeon VII from kernel 5.6 to kernel 5.7, and
      following started to happen on each boot:
      
      	...
      	BUG: kernel NULL pointer dereference, address: 0000000000000128
      	...
      	CPU: 9 PID: 1940 Comm: modprobe Tainted: G            E     5.7.2-200.im0.fc32.x86_64 #1
      	Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 1407 04/02/2020
      	RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
      	...
      	Call Trace:
      	 i2c_smbus_xfer+0x3d/0xf0
      	 i2c_default_probe+0xf3/0x130
      	 i2c_detect.isra.0+0xfe/0x2b0
      	 ? kfree+0xa3/0x200
      	 ? kobject_uevent_env+0x11f/0x6a0
      	 ? i2c_detect.isra.0+0x2b0/0x2b0
      	 __process_new_driver+0x1b/0x20
      	 bus_for_each_dev+0x64/0x90
      	 ? 0xffffffffc0f34000
      	 i2c_register_driver+0x73/0xc0
      	 do_one_initcall+0x46/0x200
      	 ? _cond_resched+0x16/0x40
      	 ? kmem_cache_alloc_trace+0x167/0x220
      	 ? do_init_module+0x23/0x260
      	 do_init_module+0x5c/0x260
      	 __do_sys_init_module+0x14f/0x170
      	 do_syscall_64+0x5b/0xf0
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      	...
      
      Error appears when some i2c device driver tries to probe for devices
      using adapter registered by `smu_v11_0_i2c_eeprom_control_init()`.
      Code supporting this adapter requires `adev->psp.ras.ras` to be not
      NULL, which is true only when `amdgpu_ras_init()` detects HW support by
      calling `amdgpu_ras_check_supported()`.
      
      Before 9015d60c, adapter was registered by
      
      	-> amdgpu_device_ip_init()
      	  -> amdgpu_ras_recovery_init()
      	    -> amdgpu_ras_eeprom_init()
      	      -> smu_v11_0_i2c_eeprom_control_init()
      
      after verifying that `adev->psp.ras.ras` is not NULL in
      `amdgpu_ras_recovery_init()`. Currently it is registered
      unconditionally by
      
      	-> amdgpu_device_ip_init()
      	  -> pp_sw_init()
      	    -> hwmgr_sw_init()
      	      -> vega20_smu_init()
      	        -> smu_v11_0_i2c_eeprom_control_init()
      
      Fix simply adds HW support check (ras == NULL => no support) before
      calling `smu_v11_0_i2c_eeprom_control_{init,fini}()`.
      
      Please note that there is a chance that similar fix is also required for
      CHIP_ARCTURUS. I do not know whether any actual Arcturus hardware without
      RAS exist, and whether calling `smu_i2c_eeprom_init()` makes any sense
      when there is no HW support.
      
      Cc: stable@vger.kernel.org
      Fixes: 9015d60c ("drm/amdgpu: Move EEPROM I2C adapter to amdgpu_device")
      Signed-off-by: default avatarIvan Mironov <mironov.ivan@gmail.com>
      Tested-by: default avatarBjorn Nostvold <bjorn.nostvold@gmail.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      78083631
    • Alex Deucher's avatar
      drm/amdgpu: enable runtime pm on vega10 when noretry=0 · cd527780
      Alex Deucher authored
      The failures with ROCm only happen with noretry=1, so
      enable runtime pm when noretry=0 (the current default).
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Acked-by: default avatarRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      cd527780
    • Alex Deucher's avatar
      drm/amdgpu: rework runtime pm enablement for BACO · b38c6968
      Alex Deucher authored
      Add a switch statement to simplify asic checks.  Note
      that BACO is not supported on APUs, so there is no
      need to check them.
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      b38c6968
    • Nirmoy Das's avatar
      drm/amdgpu: call release_firmware() without a NULL check · 75e1658e
      Nirmoy Das authored
      The release_firmware() function is NULL tolerant so we do not need
      to check for NULL param before calling it.
      Signed-off-by: default avatarNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      75e1658e
    • Mukul Joshi's avatar
      drm/amdkfd: Fix circular locking dependency warning · d69fd951
      Mukul Joshi authored
      [  150.887733] ======================================================
      [  150.893903] WARNING: possible circular locking dependency detected
      [  150.905917] ------------------------------------------------------
      [  150.912129] kfdtest/4081 is trying to acquire lock:
      [  150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at:
                                       __might_fault+0x3e/0x90
      [  150.924490]
                     but task is already holding lock:
      [  150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at:
                                      destroy_queue_cpsch+0x29/0x210 [amdgpu]
      [  150.939432]
                     which lock already depends on the new lock.
      
      [  150.947603]
                     the existing dependency chain (in reverse order) is:
      [  150.955074]
                     -> #3 (&dqm->lock_hidden){+.+.}:
      [  150.960822]        __mutex_lock+0xa1/0x9f0
      [  150.964996]        evict_process_queues_cpsch+0x22/0x120 [amdgpu]
      [  150.971155]        kfd_process_evict_queues+0x3b/0xc0 [amdgpu]
      [  150.977054]        kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu]
      [  150.982442]        amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu]
      [  150.988615]        amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu]
      [  150.994448]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.000714]        copy_page_range+0xd70/0xd80
      [  151.005159]        dup_mm+0x3ca/0x550
      [  151.008816]        copy_process+0x1bdc/0x1c70
      [  151.013183]        _do_fork+0x76/0x6c0
      [  151.016929]        __x64_sys_clone+0x8c/0xb0
      [  151.021201]        do_syscall_64+0x4a/0x1d0
      [  151.025404]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.030977]
                     -> #2 (&adev->notifier_lock){+.+.}:
      [  151.036993]        __mutex_lock+0xa1/0x9f0
      [  151.041168]        amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu]
      [  151.047019]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.053277]        copy_page_range+0xd70/0xd80
      [  151.057722]        dup_mm+0x3ca/0x550
      [  151.061388]        copy_process+0x1bdc/0x1c70
      [  151.065748]        _do_fork+0x76/0x6c0
      [  151.069499]        __x64_sys_clone+0x8c/0xb0
      [  151.073765]        do_syscall_64+0x4a/0x1d0
      [  151.077952]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.083523]
                     -> #1 (mmu_notifier_invalidate_range_start){+.+.}:
      [  151.090833]        change_protection+0x802/0xab0
      [  151.095448]        mprotect_fixup+0x187/0x2d0
      [  151.099801]        setup_arg_pages+0x124/0x250
      [  151.104251]        load_elf_binary+0x3a4/0x1464
      [  151.108781]        search_binary_handler+0x6c/0x210
      [  151.113656]        __do_execve_file.isra.40+0x7f7/0xa50
      [  151.118875]        do_execve+0x21/0x30
      [  151.122632]        call_usermodehelper_exec_async+0x17e/0x190
      [  151.128393]        ret_from_fork+0x24/0x30
      [  151.132489]
                     -> #0 (&mm->mmap_sem#2){++++}:
      [  151.138064]        __lock_acquire+0x11a1/0x1490
      [  151.142597]        lock_acquire+0x90/0x180
      [  151.146694]        __might_fault+0x68/0x90
      [  151.150879]        read_sdma_queue_counter+0x5f/0xb0 [amdgpu]
      [  151.156693]        update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu]
      [  151.163725]        destroy_queue_cpsch+0x1ae/0x210 [amdgpu]
      [  151.169373]        pqm_destroy_queue+0xf0/0x250 [amdgpu]
      [  151.174762]        kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
      [  151.180577]        kfd_ioctl+0x223/0x400 [amdgpu]
      [  151.185284]        ksys_ioctl+0x8f/0xb0
      [  151.189118]        __x64_sys_ioctl+0x16/0x20
      [  151.193389]        do_syscall_64+0x4a/0x1d0
      [  151.197569]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.203141]
                     other info that might help us debug this:
      
      [  151.211140] Chain exists of:
                       &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden
      
      [  151.222535]  Possible unsafe locking scenario:
      
      [  151.228447]        CPU0                    CPU1
      [  151.232971]        ----                    ----
      [  151.237502]   lock(&dqm->lock_hidden);
      [  151.241254]                                lock(&adev->notifier_lock);
      [  151.247774]                                lock(&dqm->lock_hidden);
      [  151.254038]   lock(&mm->mmap_sem#2);
      
      This commit fixes the warning by ensuring get_user() is not called
      while reading SDMA stats with dqm_lock held as get_user() could cause a
      page fault which leads to the circular locking scenario.
      Signed-off-by: default avatarMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      d69fd951