kernel/sched.c · a2ea2d4ce970a2a59b3d3a7ef5f69d314db69298 · Kirill Smelkov / linux

[PATCH] sched: improve wakeup-affinity · a2ea2d4c
Nick Piggin authored Jun 04, 2004
David Mosberger noticed bw_pipe was way down on sched-domains kernels on
SMP systems.

That is due to two things: first, the previous wake-affine logic would
*always* move a pipe wakee onto the waker's CPU.  With the scheduler
rework, this was toned down a lot (but extended to all types of wakeups).

One of the ways this was damped was with the logic: don't move the wakee if
its CPU is relatively idle compared to the waker's CPU.  Without this, some
workloads would pile everything up onto a few CPUs and get lots of idle
time.

However, the fix was a bit of a blunt hack: if the wakee runqueue was below
50% busy, and the waker's was above 50% busy, we wouldn't do the move.  I
think a better way to capture it is what this patch does: if the wakee
runqueue is below 100% busy, and the sum of the two runqueue's loads is
above 100% busy, and the wakee runqueue is less busy than the waker
runqueue (ie.  CPU utilisation would drop if we do the move), then we don't
do the move.

After I fixed this, I found things were still getting bounced around quite
a bit.  The reason is that we were attempting very aggressive idle
balancing in order to cut down idle time in a dbt2-pgsql workload, which is
particularly sensitive to idle.

After having Mark Wong (markw@osdl.org) retest this load with this patch,
it looks like we don't need to be so aggressive.  I'm glad to be rid of
this because it never sat too well with me.  We should see slightly lower
cost of schedule and slightly improved cache impact with this change too.

Mark said:
---
        This looks pretty good:

        metric  kernel
        2334    2.6.7-rc2
        2298    2.6.7-rc2-mm2
        2329    2.6.7-rc2-mm2-sched-more-wakeaffine
---
ie. within the noise.

David said:
---
        Oooh, me likeee!

        Host                OS  Pipe AF
                                     UNIX
        --------- ------------- ---- ----
        caldera.h   Linux 2.6.6 3424 2057       (plain 2.6.6)
        caldera.h Linux 2.6.7-r 333. 1402       (original 2.6.7-rc1)
        caldera.h Linux 2.6.7-r 3086 4301       (2.6.7-rc1 with your patch)

        Pipe-bandwidth is still down about 10% but that may be due to
        unrelated changes (or perhaps warmup effects?).  The AF UNIX bandwidth
        is just mindboggling.  Moreover, with your patch 2.6.7-rc1 shows
        better context-switch times and lower communication latencies (more
        like the numbers you're getting on UP).

        So it seems like the overall balance of keeping things on the same CPU
        vs. distributing them across CPUs is improved.
---

I also ran some tests on the NUMAQ. kernbench, dbench, hackbench, reaim
were much the same. tbench was improved, very much so when clients < NR_CPU.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
a2ea2d4c
sched.c 97.9 KB
Replace sched.c