kernel/sched.c · 8c136f71934b05aebb7ffb9631415b74f0906bad · Kirill Smelkov / linux

[PATCH] sched: scheduler domain support · 8c136f71
Andrew Morton authored May 09, 2004
From: Nick Piggin <piggin@cyberone.com.au>

This is the core sched domains patch.  It can handle any number of levels
in a scheduling heirachy, and allows architectures to easily customize how
the scheduler behaves.  It also provides progressive balancing backoff
needed by SGI on their large systems (although they have not yet tested
it).

It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and
gets results very similar to them when using the default scheduling
description.

Benchmarks
==========

Martin was seeing I think 10-20% better system times in kernbench on the 32
way.  I was seeing improvements in dbench, tbench, kernbench, reaim,
hackbench on a 16-way NUMAQ.  Hackbench in fact had a non linear element
which is all but eliminated.  Large improvements in volanomark.

Cross node task migration was decreased in all above benchmarks, sometimes by
a factor of 100!!  Cross CPU migration was also generally decreased.  See
this post:
http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2

Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues
patch (which is a big improvement).

Some examples on the 16-way NUMAQ (this is slightly older sched domain code):

 http://www.kerneltrap.org/~npiggin/w26/hbench.png
 http://www.kerneltrap.org/~npiggin/w26/vmark.html

From: Jes Sorensen <jes@wildopensource.com>

   Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS >
   BITS_PER_LONG.

From: "Martin J. Bligh" <mbligh@aracnet.com>

   Fix a minor nit with the find_busiest_group code.  No functional change,
   but makes the code simpler and clearer.  This patch does two things ... 
   adds some more expansive comments, and removes this if clause:

      if (*imbalance < SCHED_LOAD_SCALE
                      && max_load - this_load > SCHED_LOAD_SCALE)
		*imbalance = SCHED_LOAD_SCALE;

   If we remove the scaling factor, we're basically conditionally doing:

	if (*imbalance < 1)
		*imbalance = 1;

   Which is pointless, as the very next thing we do is to remove the
   scaling factor, rounding up to the nearest integer as we do:

	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;

   Thus the if statement is redundant, and only makes the code harder to
   read ;-)

From: Rick Lindsley <ricklind@us.ibm.com>

   In find_busiest_group(), after we exit the do/while, we select our
   imbalance.  But max_load, avg_load, and this_load are all unsigned, so
   min(x,y) will make a bad choice if max_load < avg_load < this_load (that
   is, a choice between two negative [very large] numbers).

   Unfortunately, there is a bug when max_load never gets changed from zero
   (look in the loop and think what happens if the only load on the machine is
   being created by cpu groups of which we are a member).  And you have a
   recipe for some really bogus values for imbalance.

   Even if you fix the max_load == 0 bug, there will still be times when
   avg_load - this_load will be negative (thus very large) and you'll make the
   decision to move stuff when you shouldn't have.

   This patch allows for this_load to set max_load, which if I understand
   the logic properly is correct.  With this patch applied, the algorithm is
   *much* more conservative ...  maybe *too* conservative but that's for
   another round of testing ...

From: Ingo Molnar <mingo@elte.hu>

   sched-find-busiest-fix
8c136f71
sched.c 84.8 KB
Replace sched.c