• Chengming Zhou's avatar
    perf/core: Fix perf_cgroup_switch() · 96492a6c
    Chengming Zhou authored
    There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
    in perf_cgroup_switch().
    
    CPU1						CPU2
    perf_cgroup_sched_out(prev, next)
      cgrp1 = perf_cgroup_from_task(prev)
      cgrp2 = perf_cgroup_from_task(next)
      if (cgrp1 != cgrp2)
        perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
    						cgroup_migrate_execute()
    						  task->cgroups = ?
    						  perf_cgroup_attach()
    						    task_function_call(task, __perf_cgroup_move)
    perf_cgroup_sched_in(prev, next)
      cgrp1 = perf_cgroup_from_task(prev)
      cgrp2 = perf_cgroup_from_task(next)
      if (cgrp1 != cgrp2)
        perf_cgroup_switch(next, PERF_CGROUP_SWIN)
    						__perf_cgroup_move()
    						  perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
    
    The commit a8d757ef ("perf events: Fix slow and broken cgroup
    context switch code") want to skip perf_cgroup_switch() when the
    perf_cgroup of "prev" and "next" are the same.
    
    But task->cgroups can change in concurrent with context_switch()
    in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
    cpuctx won't do sched_out. Then task->cgroups changed cause
    cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
    WARN_ON_ONCE(cpuctx->cgrp).
    
    Even though __perf_cgroup_move() will be synchronized as the context
    switch disables the interrupt, context_switch() still can see the
    task->cgroups is changing in the middle, since task->cgroups changed
    before sending IPI.
    
    So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
    unified into perf_cgroup_switch(), to fix the incosistency between
    perf_cgroup_sched_out() and perf_cgroup_sched_in().
    
    But we can't just compare prev->cgroups with next->cgroups to decide
    whether to skip cpuctx sched_out/in since the prev->cgroups is changing
    too. For example:
    
    CPU1					CPU2
    					cgroup_migrate_execute()
    					  prev->cgroups = ?
    					  perf_cgroup_attach()
    					    task_function_call(task, __perf_cgroup_move)
    perf_cgroup_switch(task)
      cgrp1 = perf_cgroup_from_task(prev)
      cgrp2 = perf_cgroup_from_task(next)
      if (cgrp1 != cgrp2)
        cpuctx sched_out/in ...
    					task_function_call() will return -ESRCH
    
    In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
    to be true, so skip cpuctx sched_out/in. And later task_function_call()
    would return -ESRCH since the prev task isn't running on cpu anymore.
    So we would leave perf_events of the old prev->cgroups still sched on
    the CPU, which is wrong.
    
    The solution is that we should use cpuctx->cgrp to compare with
    the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
    on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
    compare without holding ctx lock.
    
    Fixes: a8d757ef ("perf events: Fix slow and broken cgroup context switch code")
    Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
    96492a6c
core.c 321 KB