[PATCH] sched: scheduler domain support

From: Nick Piggin <piggin@cyberone.com.au> This is the core sched domains patch. It can handle any number of levels in a scheduling heirachy, and allows architectures to easily customize how the scheduler behaves. It also provides progressive balancing backoff needed by SGI on their large systems (although they have not yet tested it). It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and gets results very similar to them when using the default scheduling description. Benchmarks ========== Martin was seeing I think 10-20% better system times in kernbench on the 32 way. I was seeing improvements in dbench, tbench, kernbench, reaim, hackbench on a 16-way NUMAQ. Hackbench in fact had a non linear element which is all but eliminated. Large improvements in volanomark. Cross node task migration was decreased in all above benchmarks, sometimes by a factor of 100!! Cross CPU migration was also generally decreased. See this post: http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2 Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues patch (which is a big improvement). Some examples on the 16-way NUMAQ (this is slightly older sched domain code): http://www.kerneltrap.org/~npiggin/w26/hbench.png http://www.kerneltrap.org/~npiggin/w26/vmark.html From: Jes Sorensen <jes@wildopensource.com> Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS > BITS_PER_LONG. From: "Martin J. Bligh" <mbligh@aracnet.com> Fix a minor nit with the find_busiest_group code. No functional change, but makes the code simpler and clearer. This patch does two things ... adds some more expansive comments, and removes this if clause: if (*imbalance < SCHED_LOAD_SCALE && max_load - this_load > SCHED_LOAD_SCALE) *imbalance = SCHED_LOAD_SCALE; If we remove the scaling factor, we're basically conditionally doing: if (*imbalance < 1) *imbalance = 1; Which is pointless, as the very next thing we do is to remove the scaling factor, rounding up to the nearest integer as we do: *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT; Thus the if statement is redundant, and only makes the code harder to read ;-) From: Rick Lindsley <ricklind@us.ibm.com> In find_busiest_group(), after we exit the do/while, we select our imbalance. But max_load, avg_load, and this_load are all unsigned, so min(x,y) will make a bad choice if max_load < avg_load < this_load (that is, a choice between two negative [very large] numbers). Unfortunately, there is a bug when max_load never gets changed from zero (look in the loop and think what happens if the only load on the machine is being created by cpu groups of which we are a member). And you have a recipe for some really bogus values for imbalance. Even if you fix the max_load == 0 bug, there will still be times when avg_load - this_load will be negative (thus very large) and you'll make the decision to move stuff when you shouldn't have. This patch allows for this_load to set max_load, which if I understand the logic properly is correct. With this patch applied, the algorithm is *much* more conservative ... maybe *too* conservative but that's for another round of testing ... From: Ingo Molnar <mingo@elte.hu> sched-find-busiest-fix

[PATCH] sched: scheduler domain support
From: Nick Piggin <piggin@cyberone.com.au> This is the core sched domains patch. It can handle any number of levels in a scheduling heirachy, and allows architectures to easily customize how the scheduler behaves. It also provides progressive balancing backoff needed by SGI on their large systems (although they have not yet tested it). It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and gets results very similar to them when using the default scheduling description. Benchmarks ========== Martin was seeing I think 10-20% better system times in kernbench on the 32 way. I was seeing improvements in dbench, tbench, kernbench, reaim, hackbench on a 16-way NUMAQ. Hackbench in fact had a non linear element which is all but eliminated. Large improvements in volanomark. Cross node task migration was decreased in all above benchmarks, sometimes by a factor of 100!! Cross CPU migration was also generally decreased. See this post: http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2 Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues patch (which is a big improvement). Some examples on the 16-way NUMAQ (this is slightly older sched domain code): http://www.kerneltrap.org/~npiggin/w26/hbench.png http://www.kerneltrap.org/~npiggin/w26/vmark.html From: Jes Sorensen <jes@wildopensource.com> Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS > BITS_PER_LONG. From: "Martin J. Bligh" <mbligh@aracnet.com> Fix a minor nit with the find_busiest_group code. No functional change, but makes the code simpler and clearer. This patch does two things ... adds some more expansive comments, and removes this if clause: if (*imbalance < SCHED_LOAD_SCALE && max_load - this_load > SCHED_LOAD_SCALE) *imbalance = SCHED_LOAD_SCALE; If we remove the scaling factor, we're basically conditionally doing: if (*imbalance < 1) *imbalance = 1; Which is pointless, as the very next thing we do is to remove the scaling factor, rounding up to the nearest integer as we do: *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT; Thus the if statement is redundant, and only makes the code harder to read ;-) From: Rick Lindsley <ricklind@us.ibm.com> In find_busiest_group(), after we exit the do/while, we select our imbalance. But max_load, avg_load, and this_load are all unsigned, so min(x,y) will make a bad choice if max_load < avg_load < this_load (that is, a choice between two negative [very large] numbers). Unfortunately, there is a bug when max_load never gets changed from zero (look in the loop and think what happens if the only load on the machine is being created by cpu groups of which we are a member). And you have a recipe for some really bogus values for imbalance. Even if you fix the max_load == 0 bug, there will still be times when avg_load - this_load will be negative (thus very large) and you'll make the decision to move stuff when you shouldn't have. This patch allows for this_load to set max_load, which if I understand the logic properly is correct. With this patch applied, the algorithm is *much* more conservative ... maybe *too* conservative but that's for another round of testing ... From: Ingo Molnar <mingo@elte.hu> sched-find-busiest-fix
8c136f71 · Andrew Morton · Linus Torvalds · 067e0480 · 8c136f71 · 8c136f71
Commit 8c136f71 authored May 09, 2004 by Andrew Morton Committed by Linus Torvalds May 09, 2004
Showing with 933 additions and 349 deletions

Documentation/sched-domains.txt Documentation/sched-domains.txt +48 -0

include/linux/sched.h include/linux/sched.h +68 -2

init/main.c init/main.c +1 -1

kernel/sched.c kernel/sched.c +816 -346

No files found.
--- a/Documentation/sched-domains.txt
+++ b/Documentation/sched-domains.txt
+Each CPU has a "base" scheduling domain (struct sched_domain). These are
+accessed via cpu_sched_domain(i) and this_sched_domain() macros. The domain
+hierarchy is built from these base domains via the ->parent pointer. ->parent
+MUST be NULL terminated, and domain structures should be per-CPU as they
+are locklessly updated.
+
+Each scheduling domain spans a number of CPUs (stored in the ->span field).
+A domain's span MUST be a superset of it child's span, and a base domain
+for CPU i MUST span at least i. The top domain for each CPU will generally
+span all CPUs in the system although strictly it doesn't have to, but this
+could lead to a case where some CPUs will never be given tasks to run unless
+the CPUs allowed mask is explicitly set. A sched domain's span means "balance
+process load among these CPUs".
+
+Each scheduling domain must have one or more CPU groups (struct sched_group)
+which are organised as a circular one way linked list from the ->groups
+pointer. The union of cpumasks of these groups MUST be the same as the
+domain's span. The intersection of cpumasks from any two of these groups
+MUST be the empty set. The group pointed to by the ->groups pointer MUST
+contain the CPU to which the domain belongs. Groups may be shared among
+CPUs as they contain read only data after they have been set up.
+
+Balancing within a sched domain occurs between groups. That is, each group
+is treated as one entity. The load of a group is defined as the sum of the
+load of each of its member CPUs, and only when the load of a group becomes
+out of balance are tasks moved between groups.
+
+In kernel/sched.c, rebalance_tick is run periodically on each CPU. This
+function takes its CPU's base sched domain and checks to see if has reached
+its rebalance interval. If so, then it will run load_balance on that domain.
+rebalance_tick then checks the parent sched_domain (if it exists), and the
+parent of the parent and so forth.
+
+*** Implementing sched domains ***
+The "base" domain will "span" the first level of the hierarchy. In the case
+of SMT, you'll span all siblings of the physical CPU, with each group being
+a single virtual CPU.
+
+In SMP, the parent of the base domain will span all physical CPUs in the
+node. Each group being a single physical CPU. Then with NUMA, the parent
+of the SMP domain will span the entire machine, with each group having the
+cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
+might have just one domain covering its one NUMA level.
+
+The implementor should read comments in include/linux/sched.h:
+struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
+the specifics and what to tune.
+
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -147,6 +147,7 @@ extern spinlock_t mmlist_lock;
 typedef struct task_struct task_t;

 extern void sched_init(void);
+extern void sched_init_smp(void);
 extern void init_idle(task_t *idle, int cpu);

 extern cpumask_t idle_cpu_mask;
@@ -542,6 +543,73 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)
 #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */

 #ifdef CONFIG_SMP
+#define SD_FLAG_NEWIDLE		1	/* Balance when about to become idle */
+#define SD_FLAG_EXEC		2	/* Balance on exec */
+#define SD_FLAG_WAKE		4	/* Balance on task wakeup */
+#define SD_FLAG_FASTMIGRATE	8	/* Sync wakes put task on waking CPU */
+#define SD_FLAG_IDLE		16	/* Should not have all CPUs idle */
+
+struct sched_group {
+	struct sched_group *next;	/* Must be a circular list */
+	cpumask_t cpumask;
+};
+
+struct sched_domain {
+	/* These fields must be setup */
+	struct sched_domain *parent;	/* top domain must be null terminated */
+	struct sched_group *groups;	/* the balancing groups of the domain */
+	cpumask_t span;			/* span of all CPUs in this domain */
+	unsigned long min_interval;	/* Minimum balance interval ms */
+	unsigned long max_interval;	/* Maximum balance interval ms */
+	unsigned int busy_factor;	/* less balancing by factor if busy */
+	unsigned int imbalance_pct;	/* No balance until over watermark */
+	unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
+	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	int flags;			/* See SD_FLAG_* */
+
+	/* Runtime fields. */
+	unsigned int balance_interval;	/* initialise to 1. units in ms. */
+	unsigned int nr_balance_failed; /* initialise to 0 */
+};
+
+/* Common values for CPUs */
+#define SD_CPU_INIT (struct sched_domain) {		\
+	.span			= CPU_MASK_NONE,	\
+	.parent			= NULL,			\
+	.groups			= NULL,			\
+	.min_interval		= 1,			\
+	.max_interval		= 4,			\
+	.busy_factor		= 64,			\
+	.imbalance_pct		= 125,			\
+	.cache_hot_time		= (5*1000000/2),	\
+	.cache_nice_tries	= 1,			\
+	.flags			= SD_FLAG_FASTMIGRATE | SD_FLAG_NEWIDLE,\
+	.balance_interval	= 1,			\
+	.nr_balance_failed	= 0,			\
+}
+
+#ifdef CONFIG_NUMA
+/* Common values for NUMA nodes */
+#define SD_NODE_INIT (struct sched_domain) {		\
+	.span			= CPU_MASK_NONE,	\
+	.parent			= NULL,			\
+	.groups			= NULL,			\
+	.min_interval		= 8,			\
+	.max_interval		= 256*fls(num_online_cpus()),\
+	.busy_factor		= 8,			\
+	.imbalance_pct		= 125,			\
+	.cache_hot_time		= (10*1000000),		\
+	.cache_nice_tries	= 1,			\
+	.flags			= SD_FLAG_EXEC,		\
+	.balance_interval	= 1,			\
+	.nr_balance_failed	= 0,			\
+}
+#endif
+
+DECLARE_PER_CPU(struct sched_domain, base_domains);
+#define cpu_sched_domain(cpu)	(&per_cpu(base_domains, (cpu)))
+#define this_sched_domain()	(&__get_cpu_var(base_domains))
+
 extern int set_cpus_allowed(task_t *p, cpumask_t new_mask);
 #else
 static inline int set_cpus_allowed(task_t *p, cpumask_t new_mask)
@@ -554,10 +622,8 @@ extern unsigned long long sched_clock(void);

 #ifdef CONFIG_NUMA
 extern void sched_balance_exec(void);
-extern void node_nr_running_init(void);
 #else
 #define sched_balance_exec()   {}
-#define node_nr_running_init() {}
 #endif

 /* Move tasks off this (offline) CPU onto another. */

--- a/init/main.c
+++ b/init/main.c
@@ -567,7 +567,6 @@ static void do_pre_smp_initcalls(void)

 	migration_init();
 #endif
-	node_nr_running_init();
 	spawn_ksoftirqd();
 }

@@ -596,6 +595,7 @@ static int init(void * unused)
 	do_pre_smp_initcalls();

 	smp_init();
+	sched_init_smp();

 	/*
 	 * Do this before initcalls, because some drivers want to access

--- a/kernel/sched.c
+++ b/kernel/sched.c