- 10 May, 2004 19 commits
-
-
Andrew Morton authored
From: Nick Piggin <nickpiggin@yahoo.com.au> From: John Hawkes The following brings up performance on a 64-way Altix. This system being on the smaller end of the scale should also be applicable to other NUMA systems.
-
Andrew Morton authored
From: Nick Piggin <nickpiggin@yahoo.com.au> Imbalance calculations were not right. This would cause unneeded migration.
-
Andrew Morton authored
From: Nick Piggin <nickpiggin@yahoo.com.au> Make affine wakes and "passive load balancing" more conservative. Aggressive affine wakeups were causing huge regressions in dbt3-pgsql on 8-way non NUMA systems at OSDL's STP.
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au> From: Srivatsa Vaddagiri <vatsa@in.ibm.com> From: Andrew Morton <akpm@osdl.org> From: Rusty Russell <rusty@rustcorp.com.au> We want to get rid of lock_cpu_hotplug() in sched_migrate_task. Found that lockless migration of execing task is _extremely_ racy. The races I hit are described below, alongwith probable solutions. Task migration done elsewhere should be safe (?) since they either hold the lock (sys_sched_setaffinity) or are done entirely with preemption disabled (load_balance). sched_balance_exec does: a. disables preemption b. finds new_cpu for current c. enables preemption d. calls sched_migrate_task to migrate current to new_cpu and sched_migrate_task does: e. task_rq_lock(p) f. migrate_task(p, dest_cpu ..) (if we have to wait for migration thread) g. task_rq_unlock() h. wake_up_process(rq->migration_thread) i. wait_for_completion() Several things can happen here: 1. new_cpu can go down after h and before migration thread has got around to handle the request ==> we need to add a cpu_is_offline check in __migrate_task 2. new_cpu can go down between c and d or before f. ===> Even though this case is automatically handled by the above change (migrate_task being called on a running task, current, will delegate migration to migration thread), would it be good practice to avoid calling migrate_task in the first place itself when dest_cpu is offline. This means adding another cpu_is_offline check after e in sched_migrate_task 3. The 'current' task can get preempted _immediately_ after g and when it comes back, task_cpu(p) can be dead. In which case, it is invalid to do wake_up on a non-existent migration thread. (rq->migration_thread can be NULL). ===> We should disable preemption thr' g and h 4. Before migration thread gets around to handle the request, its cpu goes dead. This will leave unhandled migration requests in the dead cpu. ===> We need to wakeup sleeping requestors (if any) in CPU_DEAD notification. I really wonder if we can get rid of these issues by avoiding balancing at exec time and instead have it balanced during load_balance ..Alternately if this is valuable and we want to retain it, I think we still need to consider a read/write sem, with sched_migrate_task doing down_read_trylock. This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE notification, which had forced me away from r/w sem. Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1 and has been tested on a 4way Intel Pentium SMP m/c. Rusty sez: Two other changes: 1) I grabbed a reference to the thread, rather than using preempt_disable(). It's the more obvious way I think. 2) Why the wait_to_die code? It might be needed if we move tasks after stop_machine, but for nowI don't see the problem with the migration thread running on the wrong CPU for a bit: nothing is on this runqueue so active_load_balance is safe, and __migrate task will be a noop (due to cpu_is_offline() check). If there is a problem, your fix is racy, because we could be preempted immediately afterwards. So I just stop the kthread then wakeup any remaining...
-
Andrew Morton authored
From: Ingo Molnar <mingo@elte.hu> The trivial fixes. - added recent trivial bits from Nick's and my patches. - hotplug CPU fix - early init cleanup
-
Andrew Morton authored
From: Martin Hicks <mort@wildopensource.com> Another optimization patch from Jack Steiner, intended to reduce TLB flushes during process migration. Most architextures should define tlb_migrate_prepare() to be flush_tlb_mm(), but on i386, it would be a wasted flush, because i386 disconnects previous cpus from the tlb flush automatically.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This patch removes the per runqueue array of NR_CPU arrays. Each time we want to check a remote CPU's load we check nr_running as well anyway, so introduce a cpu_load which is the load of the local runqueue and is kept updated in the timer tick. Put them in the same cacheline. This has additional benefits of having the cpu_load consistent across all CPUs and more up to date. It is sampled better too, being updated once per timer tick. This shouldn't make much difference in scheduling behaviour, but all benchmarks are either as good or better on the 16-way NUMAQ: hackbench, reaim, volanomark are about the same, tbench and dbench are maybe a bit better. kernbench is about one percent better. John reckons it isn't a big deal, but it does save 4K per CPU or 2MB total on his big systems, so I figure it must be a bit kinder on the caches. I think it is just nicer in general anyway.
-
Andrew Morton authored
From: Con Kolivas <kernel@kolivas.org> This patch provides full per-package priority support for SMT processors (aka pentium4 hyperthreading) when combined with CONFIG_SCHED_SMT. It maintains cpu percentage distribution within each physical cpu package by limiting the time a lower priority task can run on a sibling cpu concurrently with a higher priority task. It introduces a new flag into the scheduler domain unsigned int per_cpu_gain; /* CPU % gained by adding domain cpus */ This is empirically set to 15% for pentium4 at the moment and can be modified to support different values dynamically as newer processors come out with improved SMT performance. It should not matter how many siblings there are. How it works is it compares tasks running on sibling cpus and when a lower static priority task is running it will delay it till high_priority_timeslice * (100 - per_cpu_gain) / 100 <= low_prio_timeslice eg. a nice 19 task timeslice is 10ms and nice 0 timeslice is 102ms On vanilla the nice 0 task runs on one logical cpu while the nice 19 task runs unabated on the other logical cpu. With smtnice the nice 0 runs on one logical cpu for 102ms and the nice 19 sleeps till the nice 0 task has 12ms remaining and then will schedule. Real time tasks and kernel threads are not altered by this code, and kernel threads do not delay lower priority user tasks. with lots of thanks to Zwane Mwaikambo and Nick Piggin for help with the coding of this version. If this is merged, it is probably best to delay pushing this upstream in mainline till sched_domains gets tested for at least one major release.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This changes sched domains to contain all possible CPUs, and check for online as needed. It's in order to play nicely with CPU hotplug.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> The following patch implements a cpu_power member to struct sched_group. This allows special casing to be removed for SMT groups in the balancing code. It does not take CPU hotplug into account yet, but that shouldn't be too hard. I have tested it on the NUMAQ by pretending it has SMT. Works as expected. Active balances across nodes.
-
Andrew Morton authored
From: Rusty Russell <rusty@rustcorp.com.au>, Nick Piggin <piggin@cyberone.com.au> The current sched_balance_exec() sets the task's cpus_allowed mask temporarily to move it to a different CPU. This has several issues, including the fact that a task will see its affinity at a bogus value. So we change the migration_req_t to explicitly specify a destination CPU, rather than the migration thread deriving it from cpus_allowed. If the requested CPU is no longer valid (racing with another set_cpus_allowed, say), it can be ignored: if the task is not allowed on this CPU, there will be another migration request pending. This change allows sched_balance_exec() to tell the migration thread what to do without changing the cpus_allowed mask. So we rename __set_cpus_allowed() to move_task(), as the cpus_allowed mask is now set by the caller. And move_task_away(), which the migration thread uses to actually perform the move, is renamed __move_task(). I also ignore offline CPUs in sched_best_cpu(), so sched_migrate_task() doesn't need to check for offline CPUs. Ulterior motive: this approach also plays well with CPU Hotplug. Previously that patch might have seen a task with cpus_allowed only containing the dying CPU (temporarily due to sched_balance_exec) and forcibly reset it to all cpus, which might be wrong. The other approach is to hold the cpucontrol sem around sched_balance_exec(), which is too much of a bottleneck.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> John Hawkes discribed this problem to me: There *is* a small problem in this area, though, that SuSE avoids. "jiffies" gets updated by cpu0. The other CPUs may, over time, get out of sync (and they're initialized on ia64 to start out being out of sync), so it's no guarantee that every CPU will wake up from its timer interrupt and see a "jiffies" value that is guaranteed to be last_jiffies+1. Sometimes the jiffies value may be unchanged since the last wakeup. Sometimes the jiffies value may have incremented by 2 (or more, especially if cpu0's interrupts are disabled for long stretches of time). So an algoithm that says, "I'll call load_balance() only when jiffies is *exactly* N" is going to fail on occasion, either by calling load_balance() too often or not often enough. *** I fixed this by adding a last_balance field to struct sched_domain, and working off that.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> The following patch builds a scheduling description for the i386 architecture using cpu_sibling_map to set up SMT if CONFIG_SCHED_SMT is set. It could be made more fancy and collapse degenerate domains at runtime (ie. 1 sibling per CPU, or 1 NUMA node in the computer). From: Zwane Mwaikambo <zwane@arm.linux.org.uk> This fixes an oops due to cpu_sibling_map being uninitialised when a system with no MP table (most UP boxen) boots a CONFIG_SMT kernel. What also happens is that the cpu_group lists end up not being terminated properly, but this oops kills it first. Patch tested on UP w/o MP table, 2x P2 and UP Xeon w/ no siblings. From: "Martin J. Bligh" <mbligh@aracnet.com>, Nick Piggin <piggin@cyberone.com.au> Change arch_init_sched_domains to use cpu_online_map From: Anton Blanchard <anton@samba.org> Fix build with NR_CPUS > BITS_PER_LONG
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This is a (somewhat) trivial patch which converts cpu_sibling_map from an array of CPUs to an array of cpumasks. Needed for >2 siblings per package, but it actually can simplify code as it allows the cpu_sibling_map to be set up even when there is 1 sibling per package. Intel want this, I use it in the next patch to build scheduling domains for the P4 HT. From: Thomas Schlichter <thomas.schlichter@web.de> Build fix From: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com> Fix to handle more than 2 siblings per package.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This patch gets the sched_domain scheduler working better WRT balancing. Its been tested on the NUMAQ. Among other things it changes to the way SMT load calculation works so as not to active load blances when it shouldn't. It still has a problem with SMT and NUMA: it will put a task on each sibling in a node before moving tasks to another node. It should probably start moving tasks after each *physical* CPU is filled. To fix, you need "how much CPU power in this domain?" At the moment we approximate # runqueues == CPU power, and hack around it at the CPU physical domain by counting all sibling runqueues as 1. It isn't hard to correctly work the CPU power out, but once CPU hotplug is in the equation it becomes much more hotplug events. If anyone is actually interested in getting this fixed, that is.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> Anton was attempting to make a sched domain topology for his POWER5 and was having some trouble. This patch only includes code which is ifdefed out, but hopefully it will be of some use to implementors.
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> This is the core sched domains patch. It can handle any number of levels in a scheduling heirachy, and allows architectures to easily customize how the scheduler behaves. It also provides progressive balancing backoff needed by SGI on their large systems (although they have not yet tested it). It is built on top of (well, uses ideas from) my previous SMP/NUMA work, and gets results very similar to them when using the default scheduling description. Benchmarks ========== Martin was seeing I think 10-20% better system times in kernbench on the 32 way. I was seeing improvements in dbench, tbench, kernbench, reaim, hackbench on a 16-way NUMAQ. Hackbench in fact had a non linear element which is all but eliminated. Large improvements in volanomark. Cross node task migration was decreased in all above benchmarks, sometimes by a factor of 100!! Cross CPU migration was also generally decreased. See this post: http://groups.google.com.au/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&frame=right&th=a406c910b30cbac4&seekm=UAdQ.3hj.5%40gated-at.bofh.it#link2 Results on a hyperthreading P4 are equivalent to Ingo's shared runqueues patch (which is a big improvement). Some examples on the 16-way NUMAQ (this is slightly older sched domain code): http://www.kerneltrap.org/~npiggin/w26/hbench.png http://www.kerneltrap.org/~npiggin/w26/vmark.html From: Jes Sorensen <jes@wildopensource.com> Tiny patch to make -mm3 compile on an NUMA box with NR_CPUS > BITS_PER_LONG. From: "Martin J. Bligh" <mbligh@aracnet.com> Fix a minor nit with the find_busiest_group code. No functional change, but makes the code simpler and clearer. This patch does two things ... adds some more expansive comments, and removes this if clause: if (*imbalance < SCHED_LOAD_SCALE && max_load - this_load > SCHED_LOAD_SCALE) *imbalance = SCHED_LOAD_SCALE; If we remove the scaling factor, we're basically conditionally doing: if (*imbalance < 1) *imbalance = 1; Which is pointless, as the very next thing we do is to remove the scaling factor, rounding up to the nearest integer as we do: *imbalance = (*imbalance + SCHED_LOAD_SCALE - 1) >> SCHED_LOAD_SHIFT; Thus the if statement is redundant, and only makes the code harder to read ;-) From: Rick Lindsley <ricklind@us.ibm.com> In find_busiest_group(), after we exit the do/while, we select our imbalance. But max_load, avg_load, and this_load are all unsigned, so min(x,y) will make a bad choice if max_load < avg_load < this_load (that is, a choice between two negative [very large] numbers). Unfortunately, there is a bug when max_load never gets changed from zero (look in the loop and think what happens if the only load on the machine is being created by cpu groups of which we are a member). And you have a recipe for some really bogus values for imbalance. Even if you fix the max_load == 0 bug, there will still be times when avg_load - this_load will be negative (thus very large) and you'll make the decision to move stuff when you shouldn't have. This patch allows for this_load to set max_load, which if I understand the logic properly is correct. With this patch applied, the algorithm is *much* more conservative ... maybe *too* conservative but that's for another round of testing ... From: Ingo Molnar <mingo@elte.hu> sched-find-busiest-fix
-
Andrew Morton authored
From: Nick Piggin <piggin@cyberone.com.au> From: Frank Cornelis <frank.cornelis@elis.ugent.be> In order to get the best possible resolution we need to use NR_CPUS instead of the constant value 10. load is an int, so no need to worry about overflows...
-
Andrew Morton authored
From: Ingo Molnar <mingo@elte.hu> From: Nick Piggin <piggin@cyberone.com.au> wrote: It removes the last place where we mess with run_list open coded.
-
- 09 May, 2004 13 commits
-
-
Linus Torvalds authored
-
Linus Torvalds authored
This should help some laptops where the generic PCI code might otherwise believe that this range is unused. The ACPI IO range is usually not visible as a standard BAR.
-
Andi Kleen authored
Various people hit this in earlier kernels. The x86-64 kernel did not compile without CONFIG_IOMMU_GART in various configurations. Just add the missing symbol and export it. Also export iommu_merge while I am at it.
-
Andi Kleen authored
This fixes a bug in the new machine check handler on x86-64. One nasty part was that when you got an MCE during boot up then it would not always print it on the screen, but still panic because it attempted to kill the idle task. This patch does: - Always use KERN_EMERG when printing MCEs - Always panic and print on screen before killing idle loop or init.
-
bk://bk.arm.linux.org.uk/linux-2.6-rmkLinus Torvalds authored
into ppc970.osdl.org:/home/torvalds/v2.6/linux
-
Russell King authored
into flint.arm.linux.org.uk:/usr/src/bk/linux-2.6-rmk
-
Marc Singer authored
Patch from Marc Singer Documentation for the Sharp-LH machines.
-
Marc Singer authored
Patch from Marc Singer Include files for this updated lh7a40x patch set. The changes in this set from the previous are mostly cosmetic. The memory macros were reworked in order to be more similar to the other ARM versions. The previous versions produced the same results, but the forms are slightly different.
-
Marc Singer authored
Patch from Marc Singer Updated change set for the 2.6.5 kernel *and* for the April 8th arm patch. Also included are changes suggested by Russell that merge several of the files in the mach- directory. I have also endeavored to remove all unnecessary whitespace additions. Note that since I've found the cause of an annoying user-space crash, I believe that this patch is OK. The crash appears to have nothing to do with the system setup.
-
Tony Lindgren authored
Patch from Tony Lindgren This patch syncs the mainline kernel with the linux-omap tree. The patch contains following updates: - Move virtual IO area to 0xfefb0000 from 0xfffb0000 to fix parts of IO area overlapping with ARM Linux reserved memory area - Add support to OMAP-730, OMAP-5912, and OMAP-1710 processors - Reorganize board support - Add OMAP core detection This patch requires ARM Linux patch 1844/1 be applied to compile OMAP-730 and OMAP-5912
-
Tony Lindgren authored
Patch from Tony Lindgren This patch syncs the mainline kernel with the linux-omap tree. The patch contains following updates: - Move virtual IO area to 0xfefb0000 from 0xfffb0000 to fix parts of IO area overlapping with ARM Linux reserved memory area - Add support to OMAP-730, OMAP-5912, and OMAP-1710 processors - Reorganize board support - Add OMAP core detection This patch requires ARM Linux patch 1844/1 be applied to compile OMAP-730 and OMAP-5912
-
Tony Lindgren authored
Patch from Tony Lindgren Adds OMAP-730 and OMAP-5910 support
-
Armin Schindler authored
On IDI module cleanup, the freed card must be removed from list. Use list_empty() instead of list_for_each() loop. Thanks Linus.
-
- 08 May, 2004 8 commits
-
-
Linus Torvalds authored
We don't bother aligining them on a cacheline boundary, since that is totally excessive in some configurations (especially P4's with 128-byte cachelines). Instead, we make the minimum inline string size a bit longer, and re-order a few fields that allow for better packing on 64-bit architectures, for better memory utilization.
-
bk://kernel.bkbits.net/davem/sparc-2.6Linus Torvalds authored
into ppc970.osdl.org:/home/torvalds/v2.6/linux
-
bk://kernel.bkbits.net/davem/net-2.6Linus Torvalds authored
into ppc970.osdl.org:/home/torvalds/v2.6/linux
-
Andrew Morton authored
I moved this a little too late - we need to run populate_rootfs() before running initcalls because some driver initcalls need to open files for firmware. The populate_rootfs() call is still coming after init_idle(), so it won't knock the scheduler over.
-
David S. Miller authored
-
David S. Miller authored
-
Stephen Hemminger authored
This is a version of Binary Increase Control (BIC) TCP developed by NCSU. It is yet another TCP congestion control algorithm for handling big fat pipes. For normal size congestion windows it behaves the same as existing TCP Reno, but when window is large it uses additive increase to ensure fairness and when window is small it uses binary search increase. For more details see the BIC TCP web page http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/ The original code was for web100 (2.4); this version is pretty much the same but targeted for 2.6 with less sysctl parameters and more constants. I don't have a real high speed long haul network to test, but when running over 1G links with delays, the performance is more stable (ie tests are repeatable) and as fast as existing Reno.
-
Sridhar Samudrala authored
Avoid the use of sizeof() and pointer arithmetic to get to the end of sctp_cookie structure. Instead use the last element peer_init which is a zero-sized array as the offset.
-