• Mathias Krause's avatar
    sched/fair: Prevent dead task groups from regaining cfs_rq's · b027789e
    Mathias Krause authored
    Kevin is reporting crashes which point to a use-after-free of a cfs_rq
    in update_blocked_averages(). Initial debugging revealed that we've
    live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
    free_fair_sched_group(). However, it was unclear how that can happen.
    
    His kernel config happened to lead to a layout of struct sched_entity
    that put the 'my_q' member directly into the middle of the object
    which makes it incidentally overlap with SLUB's freelist pointer.
    That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
    mangling, leads to a reliable access violation in form of a #GP which
    made the UAF fail fast.
    
    Michal seems to have run into the same issue[1]. He already correctly
    diagnosed that commit a7b359fc ("sched/fair: Correctly insert
    cfs_rq's to list on unthrottle") is causing the preconditions for the
    UAF to happen by re-adding cfs_rq's also to task groups that have no
    more running tasks, i.e. also to dead ones. His analysis, however,
    misses the real root cause and it cannot be seen from the crash
    backtrace only, as the real offender is tg_unthrottle_up() getting
    called via sched_cfs_period_timer() via the timer interrupt at an
    inconvenient time.
    
    When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
    task group, it doesn't protect itself from getting interrupted. If the
    timer interrupt triggers while we iterate over all CPUs or after
    unregister_fair_sched_group() has finished but prior to unlinking the
    task group, sched_cfs_period_timer() will execute and walk the list of
    task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
    dying task group. These will later -- in free_fair_sched_group() -- be
    kfree()'ed while still being linked, leading to the fireworks Kevin
    and Michal are seeing.
    
    To fix this race, ensure the dying task group gets unlinked first.
    However, simply switching the order of unregistering and unlinking the
    task group isn't sufficient, as concurrent RCU walkers might still see
    it, as can be seen below:
    
        CPU1:                                      CPU2:
          :                                        timer IRQ:
          :                                          do_sched_cfs_period_timer():
          :                                            :
          :                                            distribute_cfs_runtime():
          :                                              rcu_read_lock();
          :                                              :
          :                                              unthrottle_cfs_rq():
        sched_offline_group():                             :
          :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
          list_del_rcu(&tg->list);                           :
     (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
          :                                                    :
     (2)  list_del_rcu(&tg->siblings);                         :
          :                                                    tg_unthrottle_up():
          unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
            :                                                    :
            list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
            :                                                    :
            :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
     (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
          :                                                      :
          :                                                    :
          :                                                  :
          :                                                :
          :                                              :
     (4)  :                                              rcu_read_unlock();
    
    CPU 2 walks the task group list in parallel to sched_offline_group(),
    specifically, it'll read the soon to be unlinked task group entry at
    (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
    still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
    all cfs_rq's via list_del_leaf_cfs_rq() in
    unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
    these at (3), which is the cause of the UAF later on.
    
    To prevent this additional race from happening, we need to wait until
    walk_tg_tree_from() has finished traversing the task groups, i.e.
    after the RCU read critical section ends in (4). Afterwards we're safe
    to call unregister_fair_sched_group(), as each new walk won't see the
    dying task group any more.
    
    On top of that, we need to wait yet another RCU grace period after
    unregister_fair_sched_group() to ensure print_cfs_stats(), which might
    run concurrently, always sees valid objects, i.e. not already free'd
    ones.
    
    This patch survives Michal's reproducer[2] for 8h+ now, which used to
    trigger within minutes before.
    
      [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
      [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/
    
    Fixes: a7b359fc ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
    [peterz: shuffle code around a bit]
    Reported-by: default avatarKevin Tanguy <kevin.tanguy@corp.ovh.com>
    Signed-off-by: default avatarMathias Krause <minipli@grsecurity.net>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    b027789e
sched.h 78.4 KB