• Andrew Morton's avatar
    [PATCH] sched: scheduler domain support · 8c136f71
    Andrew Morton authored
    From: Nick Piggin <piggin@cyberone.com.au>
    
    This is the core sched domains patch.  It can handle any number of levels
    in a scheduling heirachy, and allows architectures to easily customize how
    the scheduler behaves.  It also provides progressive balancing backoff
    needed by SGI on their large systems (although they have not yet tested
    it).
    
    It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and
    gets results very similar to them when using the default scheduling
    description.
    
    Benchmarks
    ==========
    
    Martin was seeing I think 10-20% better system times in kernbench on the 32
    way.  I was seeing improvements in dbench, tbench, kernbench, reaim,
    hackbench on a 16-way NUMAQ.  Hackbench in fact had a non linear element
    which is all but eliminated.  Large improvements in volanomark.
    
    Cross node task migration was decreased in all above benchmarks, sometimes by
    a factor of 100!!  Cross CPU migration was also generally decreased.  See
    this post:
    http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2
    
    Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues
    patch (which is a big improvement).
    
    Some examples on the 16-way NUMAQ (this is slightly older sched domain code):
    
     http://www.kerneltrap.org/~npiggin/w26/hbench.png
     http://www.kerneltrap.org/~npiggin/w26/vmark.html
    
    From: Jes Sorensen <jes@wildopensource.com>
    
       Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS >
       BITS_PER_LONG.
    
    From: "Martin J. Bligh" <mbligh@aracnet.com>
    
       Fix a minor nit with the find_busiest_group code.  No functional change,
       but makes the code simpler and clearer.  This patch does two things ... 
       adds some more expansive comments, and removes this if clause:
    
          if (*imbalance < SCHED_LOAD_SCALE
                          && max_load - this_load > SCHED_LOAD_SCALE)
    		*imbalance = SCHED_LOAD_SCALE;
    
       If we remove the scaling factor, we're basically conditionally doing:
    
    	if (*imbalance < 1)
    		*imbalance = 1;
    
       Which is pointless, as the very next thing we do is to remove the
       scaling factor, rounding up to the nearest integer as we do:
    
    	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;
    
       Thus the if statement is redundant, and only makes the code harder to
       read ;-)
    
    From: Rick Lindsley <ricklind@us.ibm.com>
    
       In find_busiest_group(), after we exit the do/while, we select our
       imbalance.  But max_load, avg_load, and this_load are all unsigned, so
       min(x,y) will make a bad choice if max_load < avg_load < this_load (that
       is, a choice between two negative [very large] numbers).
    
       Unfortunately, there is a bug when max_load never gets changed from zero
       (look in the loop and think what happens if the only load on the machine is
       being created by cpu groups of which we are a member).  And you have a
       recipe for some really bogus values for imbalance.
    
       Even if you fix the max_load == 0 bug, there will still be times when
       avg_load - this_load will be negative (thus very large) and you'll make the
       decision to move stuff when you shouldn't have.
    
       This patch allows for this_load to set max_load, which if I understand
       the logic properly is correct.  With this patch applied, the algorithm is
       *much* more conservative ...  maybe *too* conservative but that's for
       another round of testing ...
    
    From: Ingo Molnar <mingo@elte.hu>
    
       sched-find-busiest-fix
    8c136f71
sched.c 84.8 KB