• Wei Li's avatar
    tracing/timerlat: Fix a race during cpuhp processing · 829e0c9f
    Wei Li authored
    There is another found exception that the "timerlat/1" thread was
    scheduled on CPU0, and lead to timer corruption finally:
    
    ```
    ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
    WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
    Modules linked in:
    CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
    RIP: 0010:debug_print_object+0x7d/0xb0
    ...
    Call Trace:
     <TASK>
     ? __warn+0x7c/0x110
     ? debug_print_object+0x7d/0xb0
     ? report_bug+0xf1/0x1d0
     ? prb_read_valid+0x17/0x20
     ? handle_bug+0x3f/0x70
     ? exc_invalid_op+0x13/0x60
     ? asm_exc_invalid_op+0x16/0x20
     ? debug_print_object+0x7d/0xb0
     ? debug_print_object+0x7d/0xb0
     ? __pfx_timerlat_irq+0x10/0x10
     __debug_object_init+0x110/0x150
     hrtimer_init+0x1d/0x60
     timerlat_main+0xab/0x2d0
     ? __pfx_timerlat_main+0x10/0x10
     kthread+0xb7/0xe0
     ? __pfx_kthread+0x10/0x10
     ret_from_fork+0x2d/0x40
     ? __pfx_kthread+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>
    ```
    
    After tracing the scheduling event, it was discovered that the migration
    of the "timerlat/1" thread was performed during thread creation. Further
    analysis confirmed that it is because the CPU online processing for
    osnoise is implemented through workers, which is asynchronous with the
    offline processing. When the worker was scheduled to create a thread, the
    CPU may has already been removed from the cpu_online_mask during the offline
    process, resulting in the inability to select the right CPU:
    
    T1                       | T2
    [CPUHP_ONLINE]           | cpu_device_down()
    osnoise_hotplug_workfn() |
                             |     cpus_write_lock()
                             |     takedown_cpu(1)
                             |     cpus_write_unlock()
    [CPUHP_OFFLINE]          |
        cpus_read_lock()     |
        start_kthread(1)     |
        cpus_read_unlock()   |
    
    To fix this, skip online processing if the CPU is already offline.
    
    Cc: stable@vger.kernel.org
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
    Fixes: c8895e27 ("trace/osnoise: Support hotplug operations")
    Signed-off-by: default avatarWei Li <liwei391@huawei.com>
    Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
    829e0c9f
trace_osnoise.c 76.2 KB