• Linus Torvalds's avatar
    x86/resctl: fix scheduler confusion with 'current' · 7fef0997
    Linus Torvalds authored
    The implementation of 'current' on x86 is very intentionally special: it
    is a very common thing to look up, and it uses 'this_cpu_read_stable()'
    to get the current thread pointer efficiently from per-cpu storage.
    
    And the keyword in there is 'stable': the current thread pointer never
    changes as far as a single thread is concerned.  Even if when a thread
    is preempted, or moved to another CPU, or even across an explicit call
    'schedule()' that thread will still have the same value for 'current'.
    
    It is, after all, the kernel base pointer to thread-local storage.
    That's why it's stable to begin with, but it's also why it's important
    enough that we have that special 'this_cpu_read_stable()' access for it.
    
    So this is all done very intentionally to allow the compiler to treat
    'current' as a value that never visibly changes, so that the compiler
    can do CSE and combine multiple different 'current' accesses into one.
    
    However, there is obviously one very special situation when the
    currently running thread does actually change: inside the scheduler
    itself.
    
    So the scheduler code paths are special, and do not have a 'current'
    thread at all.  Instead there are _two_ threads: the previous and the
    next thread - typically called 'prev' and 'next' (or prev_p/next_p)
    internally.
    
    So this is all actually quite straightforward and simple, and not all
    that complicated.
    
    Except for when you then have special code that is run in scheduler
    context, that code then has to be aware that 'current' isn't really a
    valid thing.  Did you mean 'prev'? Did you mean 'next'?
    
    In fact, even if then look at the code, and you use 'current' after the
    new value has been assigned to the percpu variable, we have explicitly
    told the compiler that 'current' is magical and always stable.  So the
    compiler is quite free to use an older (or newer) value of 'current',
    and the actual assignment to the percpu storage is not relevant even if
    it might look that way.
    
    Which is exactly what happened in the resctl code, that blithely used
    'current' in '__resctrl_sched_in()' when it really wanted the new
    process state (as implied by the name: we're scheduling 'into' that new
    resctl state).  And clang would end up just using the old thread pointer
    value at least in some configurations.
    
    This could have happened with gcc too, and purely depends on random
    compiler details.  Clang just seems to have been more aggressive about
    moving the read of the per-cpu current_task pointer around.
    
    The fix is trivial: just make the resctl code adhere to the scheduler
    rules of using the prev/next thread pointer explicitly, instead of using
    'current' in a situation where it just wasn't valid.
    
    That same code is then also used outside of the scheduler context (when
    a thread resctl state is explicitly changed), and then we will just pass
    in 'current' as that pointer, of course.  There is no ambiguity in that
    case.
    
    The fix may be trivial, but noticing and figuring out what went wrong
    was not.  The credit for that goes to Stephane Eranian.
    Reported-by: default avatarStephane Eranian <eranian@google.com>
    Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/
    Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/Reviewed-by: default avatarNick Desaulniers <ndesaulniers@google.com>
    Tested-by: default avatarTony Luck <tony.luck@intel.com>
    Tested-by: default avatarStephane Eranian <eranian@google.com>
    Tested-by: default avatarBabu Moger <babu.moger@amd.com>
    Cc: stable@kernel.org
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    7fef0997
process_64.c 21.9 KB