• Nick Piggin's avatar
    [PATCH] sched: improve wakeup-affinity · a2ea2d4c
    Nick Piggin authored
    David Mosberger noticed bw_pipe was way down on sched-domains kernels on
    SMP systems.
    
    That is due to two things: first, the previous wake-affine logic would
    *always* move a pipe wakee onto the waker's CPU.  With the scheduler
    rework, this was toned down a lot (but extended to all types of wakeups).
    
    One of the ways this was damped was with the logic: don't move the wakee if
    its CPU is relatively idle compared to the waker's CPU.  Without this, some
    workloads would pile everything up onto a few CPUs and get lots of idle
    time.
    
    However, the fix was a bit of a blunt hack: if the wakee runqueue was below
    50% busy, and the waker's was above 50% busy, we wouldn't do the move.  I
    think a better way to capture it is what this patch does: if the wakee
    runqueue is below 100% busy, and the sum of the two runqueue's loads is
    above 100% busy, and the wakee runqueue is less busy than the waker
    runqueue (ie.  CPU utilisation would drop if we do the move), then we don't
    do the move.
    
    After I fixed this, I found things were still getting bounced around quite
    a bit.  The reason is that we were attempting very aggressive idle
    balancing in order to cut down idle time in a dbt2-pgsql workload, which is
    particularly sensitive to idle.
    
    After having Mark Wong (markw@osdl.org) retest this load with this patch,
    it looks like we don't need to be so aggressive.  I'm glad to be rid of
    this because it never sat too well with me.  We should see slightly lower
    cost of schedule and slightly improved cache impact with this change too.
    
    Mark said:
    ---
            This looks pretty good:
    
            metric  kernel
            2334    2.6.7-rc2
            2298    2.6.7-rc2-mm2
            2329    2.6.7-rc2-mm2-sched-more-wakeaffine
    ---
    ie. within the noise.
    
    David said:
    ---
            Oooh, me likeee!
    
            Host                OS  Pipe AF
                                         UNIX
            --------- ------------- ---- ----
            caldera.h   Linux 2.6.6 3424 2057       (plain 2.6.6)
            caldera.h Linux 2.6.7-r 333. 1402       (original 2.6.7-rc1)
            caldera.h Linux 2.6.7-r 3086 4301       (2.6.7-rc1 with your patch)
    
            Pipe-bandwidth is still down about 10% but that may be due to
            unrelated changes (or perhaps warmup effects?).  The AF UNIX bandwidth
            is just mindboggling.  Moreover, with your patch 2.6.7-rc1 shows
            better context-switch times and lower communication latencies (more
            like the numbers you're getting on UP).
    
            So it seems like the overall balance of keeping things on the same CPU
            vs. distributing them across CPUs is improved.
    ---
    
    I also ran some tests on the NUMAQ. kernbench, dbench, hackbench, reaim
    were much the same. tbench was improved, very much so when clients < NR_CPU.
    Signed-off-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    a2ea2d4c
sched.c 97.9 KB