Commit 8c136f71 authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] sched: scheduler domain support

From: Nick Piggin <piggin@cyberone.com.au>

This is the core sched domains patch.  It can handle any number of levels
in a scheduling heirachy, and allows architectures to easily customize how
the scheduler behaves.  It also provides progressive balancing backoff
needed by SGI on their large systems (although they have not yet tested
it).

It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and
gets results very similar to them when using the default scheduling
description.

Benchmarks
==========

Martin was seeing I think 10-20% better system times in kernbench on the 32
way.  I was seeing improvements in dbench, tbench, kernbench, reaim,
hackbench on a 16-way NUMAQ.  Hackbench in fact had a non linear element
which is all but eliminated.  Large improvements in volanomark.

Cross node task migration was decreased in all above benchmarks, sometimes by
a factor of 100!!  Cross CPU migration was also generally decreased.  See
this post:
http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2

Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues
patch (which is a big improvement).

Some examples on the 16-way NUMAQ (this is slightly older sched domain code):

 http://www.kerneltrap.org/~npiggin/w26/hbench.png
 http://www.kerneltrap.org/~npiggin/w26/vmark.html

From: Jes Sorensen <jes@wildopensource.com>

   Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS >
   BITS_PER_LONG.

From: "Martin J. Bligh" <mbligh@aracnet.com>

   Fix a minor nit with the find_busiest_group code.  No functional change,
   but makes the code simpler and clearer.  This patch does two things ... 
   adds some more expansive comments, and removes this if clause:

      if (*imbalance < SCHED_LOAD_SCALE
                      && max_load - this_load > SCHED_LOAD_SCALE)
		*imbalance = SCHED_LOAD_SCALE;

   If we remove the scaling factor, we're basically conditionally doing:

	if (*imbalance < 1)
		*imbalance = 1;

   Which is pointless, as the very next thing we do is to remove the
   scaling factor, rounding up to the nearest integer as we do:

	*imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT;

   Thus the if statement is redundant, and only makes the code harder to
   read ;-)

From: Rick Lindsley <ricklind@us.ibm.com>

   In find_busiest_group(), after we exit the do/while, we select our
   imbalance.  But max_load, avg_load, and this_load are all unsigned, so
   min(x,y) will make a bad choice if max_load < avg_load < this_load (that
   is, a choice between two negative [very large] numbers).

   Unfortunately, there is a bug when max_load never gets changed from zero
   (look in the loop and think what happens if the only load on the machine is
   being created by cpu groups of which we are a member).  And you have a
   recipe for some really bogus values for imbalance.

   Even if you fix the max_load == 0 bug, there will still be times when
   avg_load - this_load will be negative (thus very large) and you'll make the
   decision to move stuff when you shouldn't have.

   This patch allows for this_load to set max_load, which if I understand
   the logic properly is correct.  With this patch applied, the algorithm is
   *much* more conservative ...  maybe *too* conservative but that's for
   another round of testing ...

From: Ingo Molnar <mingo@elte.hu>

   sched-find-busiest-fix
parent 067e0480
Each CPU has a "base" scheduling domain (struct sched_domain). These are
accessed via cpu_sched_domain(i) and this_sched_domain() macros. The domain
hierarchy is built from these base domains via the ->parent pointer. ->parent
MUST be NULL terminated, and domain structures should be per-CPU as they
are locklessly updated.
Each scheduling domain spans a number of CPUs (stored in the ->span field).
A domain's span MUST be a superset of it child's span, and a base domain
for CPU i MUST span at least i. The top domain for each CPU will generally
span all CPUs in the system although strictly it doesn't have to, but this
could lead to a case where some CPUs will never be given tasks to run unless
the CPUs allowed mask is explicitly set. A sched domain's span means "balance
process load among these CPUs".
Each scheduling domain must have one or more CPU groups (struct sched_group)
which are organised as a circular one way linked list from the ->groups
pointer. The union of cpumasks of these groups MUST be the same as the
domain's span. The intersection of cpumasks from any two of these groups
MUST be the empty set. The group pointed to by the ->groups pointer MUST
contain the CPU to which the domain belongs. Groups may be shared among
CPUs as they contain read only data after they have been set up.
Balancing within a sched domain occurs between groups. That is, each group
is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.
In kernel/sched.c, rebalance_tick is run periodically on each CPU. This
function takes its CPU's base sched domain and checks to see if has reached
its rebalance interval. If so, then it will run load_balance on that domain.
rebalance_tick then checks the parent sched_domain (if it exists), and the
parent of the parent and so forth.
*** Implementing sched domains ***
The "base" domain will "span" the first level of the hierarchy. In the case
of SMT, you'll span all siblings of the physical CPU, with each group being
a single virtual CPU.
In SMP, the parent of the base domain will span all physical CPUs in the
node. Each group being a single physical CPU. Then with NUMA, the parent
of the SMP domain will span the entire machine, with each group having the
cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
might have just one domain covering its one NUMA level.
The implementor should read comments in include/linux/sched.h:
struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
the specifics and what to tune.
......@@ -147,6 +147,7 @@ extern spinlock_t mmlist_lock;
typedef struct task_struct task_t;
extern void sched_init(void);
extern void sched_init_smp(void);
extern void init_idle(task_t *idle, int cpu);
extern cpumask_t idle_cpu_mask;
......@@ -542,6 +543,73 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#ifdef CONFIG_SMP
#define SD_FLAG_NEWIDLE 1 /* Balance when about to become idle */
#define SD_FLAG_EXEC 2 /* Balance on exec */
#define SD_FLAG_WAKE 4 /* Balance on task wakeup */
#define SD_FLAG_FASTMIGRATE 8 /* Sync wakes put task on waking CPU */
#define SD_FLAG_IDLE 16 /* Should not have all CPUs idle */
struct sched_group {
struct sched_group *next; /* Must be a circular list */
cpumask_t cpumask;
};
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
cpumask_t span; /* span of all CPUs in this domain */
unsigned long min_interval; /* Minimum balance interval ms */
unsigned long max_interval; /* Maximum balance interval ms */
unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */
int flags; /* See SD_FLAG_* */
/* Runtime fields. */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int nr_balance_failed; /* initialise to 0 */
};
/* Common values for CPUs */
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
.busy_factor = 64, \
.imbalance_pct = 125, \
.cache_hot_time = (5*1000000/2), \
.cache_nice_tries = 1, \
.flags = SD_FLAG_FASTMIGRATE | SD_FLAG_NEWIDLE,\
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#ifdef CONFIG_NUMA
/* Common values for NUMA nodes */
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 256*fls(num_online_cpus()),\
.busy_factor = 8, \
.imbalance_pct = 125, \
.cache_hot_time = (10*1000000), \
.cache_nice_tries = 1, \
.flags = SD_FLAG_EXEC, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
}
#endif
DECLARE_PER_CPU(struct sched_domain, base_domains);
#define cpu_sched_domain(cpu) (&per_cpu(base_domains, (cpu)))
#define this_sched_domain() (&__get_cpu_var(base_domains))
extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
#else
static inline int set_cpus_allowed(task_t *p, cpumask_t new_mask)
......@@ -554,10 +622,8 @@ extern unsigned long long sched_clock(void);
#ifdef CONFIG_NUMA
extern void sched_balance_exec(void);
extern void node_nr_running_init(void);
#else
#define sched_balance_exec() {}
#define node_nr_running_init() {}
#endif
/* Move tasks off this (offline) CPU onto another. */
......
......@@ -567,7 +567,6 @@ static void do_pre_smp_initcalls(void)
migration_init();
#endif
node_nr_running_init();
spawn_ksoftirqd();
}
......@@ -596,6 +595,7 @@ static int init(void * unused)
do_pre_smp_initcalls();
smp_init();
sched_init_smp();
/*
* Do this before initcalls, because some drivers want to access
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment