1. 02 Dec, 2009 5 commits
    • Hidetoshi Seto's avatar
      sched, cputime: Introduce thread_group_times() · 0cf55e1e
      Hidetoshi Seto authored
      This is a real fix for problem of utime/stime values decreasing
      described in the thread:
      
         http://lkml.org/lkml/2009/11/3/522
      
      Now cputime is accounted in the following way:
      
       - {u,s}time in task_struct are increased every time when the thread
         is interrupted by a tick (timer interrupt).
      
       - When a thread exits, its {u,s}time are added to signal->{u,s}time,
         after adjusted by task_times().
      
       - When all threads in a thread_group exits, accumulated {u,s}time
         (and also c{u,s}time) in signal struct are added to c{u,s}time
         in signal struct of the group's parent.
      
      So {u,s}time in task struct are "raw" tick count, while
      {u,s}time and c{u,s}time in signal struct are "adjusted" values.
      
      And accounted values are used by:
      
       - task_times(), to get cputime of a thread:
         This function returns adjusted values that originates from raw
         {u,s}time and scaled by sum_exec_runtime that accounted by CFS.
      
       - thread_group_cputime(), to get cputime of a thread group:
         This function returns sum of all {u,s}time of living threads in
         the group, plus {u,s}time in the signal struct that is sum of
         adjusted cputimes of all exited threads belonged to the group.
      
      The problem is the return value of thread_group_cputime(),
      because it is mixed sum of "raw" value and "adjusted" value:
      
        group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
      
      This misbehavior can break {u,s}time monotonicity.
      Assume that if there is a thread that have raw values greater
      than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
      but only runs 45ms) and if it exits, cputime will decrease (e.g.
      -5ms).
      
      To fix this, we could do:
      
        group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
      
      But task_times() contains hard divisions, so applying it for
      every thread should be avoided.
      
      This patch fixes the above problem in the following way:
      
       - Modify thread's exit (= __exit_signal()) not to use task_times().
         It means {u,s}time in signal struct accumulates raw values instead
         of adjusted values.  As the result it makes thread_group_cputime()
         to return pure sum of "raw" values.
      
       - Introduce a new function thread_group_times(*task, *utime, *stime)
         that converts "raw" values of thread_group_cputime() to "adjusted"
         values, in same calculation procedure as task_times().
      
       - Modify group's exit (= wait_task_zombie()) to use this introduced
         thread_group_times().  It make c{u,s}time in signal struct to
         have adjusted values like before this patch.
      
       - Replace some thread_group_cputime() by thread_group_times().
         This replacements are only applied where conveys the "adjusted"
         cputime to users, and where already uses task_times() near by it.
         (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
      
      This patch have a positive side effect:
      
       - Before this patch, if a group contains many short-life threads
         (e.g. runs 0.9ms and not interrupted by ticks), the group's
         cputime could be invisible since thread's cputime was accumulated
         after adjusted: imagine adjustment function as adj(ticks, runtime),
           {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
         After this patch it will not happen because the adjustment is
         applied after accumulated.
      
      v2:
       - remove if()s, put new variables into signal_struct.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      0cf55e1e
    • Hidetoshi Seto's avatar
      sched, cputime: Cleanups related to task_times() · d99ca3b9
      Hidetoshi Seto authored
      - Remove if({u,s}t)s because no one call it with NULL now.
      - Use cputime_{add,sub}().
      - Add ifndef-endif for prev_{u,s}time since they are used
        only when !VIRT_CPU_ACCOUNTING.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B1624C7.7040302@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d99ca3b9
    • Tim Blechmann's avatar
      Revert "sched, x86: Optimize branch hint in __switch_to()" · be8147e6
      Tim Blechmann authored
      This reverts commit a3a1de0c.
      
      Commit 8ec6993d cleared the es
      and ds selectors, so the original branch hints are correct now.
      Therefore the branch hint doesn't need to be removed.
      Signed-off-by: default avatarTim Blechmann <tim@klingt.org>
      LKML-Reference: <4B16503A.8030508@klingt.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      be8147e6
    • Rusty Russell's avatar
      sched: Fix isolcpus boot option · bdddd296
      Rusty Russell authored
      Anton Blanchard wrote:
      
      > We allocate and zero cpu_isolated_map after the isolcpus
      > __setup option has run. This means cpu_isolated_map always
      > ends up empty and if CPUMASK_OFFSTACK is enabled we write to a
      > cpumask that hasn't been allocated.
      
      I introduced this regression in 49557e62 (sched: Fix
      boot crash by zalloc()ing most of the cpu masks).
      
      Use the bootmem allocator if they set isolcpus=, otherwise
      allocate and zero like normal.
      Reported-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: peterz@infradead.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <200912021409.17013.rusty@rustcorp.com.au>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Tested-by: default avatarAnton Blanchard <anton@samba.org>
      bdddd296
    • Tejun Heo's avatar
      sched: Revert 498657a4 · 8592e648
      Tejun Heo authored
      498657a4 incorrectly assumed
      that preempt wasn't disabled around context_switch() and thus
      was fixing imaginary problem.  It also broke KVM because it
      depended on ->sched_in() to be called with irq enabled so that
      it can do smp calls from there.
      
      Revert the incorrect commit and add comment describing different
      contexts under with the two callbacks are invoked.
      
      Avi: spotted transposed in/out in the added comment.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAvi Kivity <avi@redhat.com>
      Cc: peterz@infradead.org
      Cc: efault@gmx.de
      Cc: rusty@rustcorp.com.au
      LKML-Reference: <1259726212-30259-2-git-send-email-tj@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8592e648
  2. 26 Nov, 2009 5 commits
    • Hidetoshi Seto's avatar
      sched, time: Define nsecs_to_jiffies() · b7b20df9
      Hidetoshi Seto authored
      Use of msecs_to_jiffies() for nsecs_to_cputime() have some
      problems:
      
       - The type of msecs_to_jiffies()'s argument is unsigned int, so
         it cannot convert msecs greater than UINT_MAX = about 49.7 days.
      
       - msecs_to_jiffies() returns MAX_JIFFY_OFFSET if MSB of argument
         is set, assuming that input was negative value.  So it cannot
         convert msecs greater than INT_MAX = about 24.8 days too.
      
      This patch defines a new function nsecs_to_jiffies() that can
      deal greater values, and that can deal all incoming values as
      unsigned.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Amrico Wang <xiyou.wangcong@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      LKML-Reference: <4B0E16E7.5070307@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b7b20df9
    • Hidetoshi Seto's avatar
      sched: Remove task_{u,s,g}time() · d5b7c78e
      Hidetoshi Seto authored
      Now all task_{u,s}time() pairs are replaced by task_times().
      And task_gtime() is too simple to be an inline function.
      
      Cleanup them all.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      LKML-Reference: <4B0E16D1.70902@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d5b7c78e
    • Hidetoshi Seto's avatar
      sched: Introduce task_times() to replace task_{u,s}time() pair · d180c5bc
      Hidetoshi Seto authored
      Functions task_{u,s}time() are called in pair in almost all
      cases.  However task_stime() is implemented to call task_utime()
      from its inside, so such paired calls run task_utime() twice.
      
      It means we do heavy divisions (div_u64 + do_div) twice to get
      utime and stime which can be obtained at same time by one set
      of divisions.
      
      This patch introduces a function task_times(*tsk, *utime,
      *stime) to retrieve utime and stime at once in better, optimized
      way.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      LKML-Reference: <4B0E16AE.906@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d180c5bc
    • Ingo Molnar's avatar
      Merge branch 'sched/urgent' into sched/core · 16bc67ed
      Ingo Molnar authored
      Merge reason: Pick up fixes that did not make it into .32.0
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      16bc67ed
    • Mike Travis's avatar
      sched: Limit the number of scheduler debug messages · f6630114
      Mike Travis authored
      Remove the verbose scheduler debug messages unless kernel
      parameter "sched_debug" set.  /proc/sched_debug unchanged.
      Signed-off-by: default avatarMike Travis <travis@sgi.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20091118002221.489305000@alcatraz.americas.sgi.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f6630114
  3. 25 Nov, 2009 1 commit
  4. 24 Nov, 2009 3 commits
    • Tim Blechmann's avatar
      sched, x86: Optimize branch hint in __switch_to() · a3a1de0c
      Tim Blechmann authored
      Branch hint profiling on my nehalem machine showed 96%
      incorrect branch hints:
      
        6548732 174664120  96 __switch_to                    process_64.c
          406
        6548745 174565593  96 __switch_to                    process_64.c
          410
      Signed-off-by: default avatarTim Blechmann <tim@klingt.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <4B0BBB93.3080307@klingt.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      a3a1de0c
    • Tim Blechmann's avatar
      sched: Optimize branch hint in context_switch() · 710390d9
      Tim Blechmann authored
      Branch hint profiling on my nehalem machine showed over 90%
      incorrect branch hints:
      
        10420275 170645395  94 context_switch                 sched.c
         3043
        10408421 171098521  94 context_switch                 sched.c
         3050
      Signed-off-by: default avatarTim Blechmann <tim@klingt.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <4B0BBB9F.6080304@klingt.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      710390d9
    • Tim Blechmann's avatar
      sched: Optimize branch hint in pick_next_task_fair() · 36ace27e
      Tim Blechmann authored
      Branch hint profiling on my nehalem machine showed 90%
      incorrect branch hints:
      
        15728471 158903754  90 pick_next_task_fair
        sched_fair.c    1555
      Signed-off-by: default avatarTim Blechmann <tim@klingt.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <4B0BBBB1.2050100@klingt.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      36ace27e
  5. 23 Nov, 2009 1 commit
    • Jan Blunck's avatar
      sched_feat_write(): Update ppos instead of file->f_pos · 42994724
      Jan Blunck authored
      sched_feat_write() should update ppos instead of file->f_pos.
      
      (This reduces some BKL dependencies of this code.)
      Signed-off-by: default avatarJan Blunck <jblunck@suse.de>
      Cc: jkacur@redhat.com
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jamie Lokier <jamie@shareable.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      LKML-Reference: <1258735245-25826-8-git-send-email-jblunck@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      42994724
  6. 16 Nov, 2009 1 commit
  7. 15 Nov, 2009 1 commit
    • Tejun Heo's avatar
      sched, kvm: Fix race condition involving sched_in_preempt_notifers · 498657a4
      Tejun Heo authored
      In finish_task_switch(), fire_sched_in_preempt_notifiers() is
      called after finish_lock_switch().
      
      However, depending on architecture, preemption can be enabled after
      finish_lock_switch() which breaks the semantics of preempt
      notifiers.
      
      So move it before finish_arch_switch(). This also makes the in-
      notifiers symmetric to out- notifiers in terms of locking - now
      both are called under rq lock.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAvi Kivity <avi@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4AFD2801.7020900@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      498657a4
  8. 13 Nov, 2009 2 commits
  9. 12 Nov, 2009 2 commits
    • Hidetoshi Seto's avatar
      sched: Fix granularity of task_u/stime() · 761b1d26
      Hidetoshi Seto authored
      Originally task_s/utime() were designed to return clock_t but
      later changed to return cputime_t by following commit:
      
        commit efe567fc
        Author: Christian Borntraeger <borntraeger@de.ibm.com>
        Date:   Thu Aug 23 15:18:02 2007 +0200
      
      It only changed the type of return value, but not the
      implementation. As the result the granularity of task_s/utime()
      is still that of clock_t, not that of cputime_t.
      
      So using task_s/utime() in __exit_signal() makes values
      accumulated to the signal struct to be rounded and coarse
      grained.
      
      This patch removes casts to clock_t in task_u/stime(), to keep
      granularity of cputime_t over the calculation.
      
      v2:
        Use div_u64() to avoid error "undefined reference to `__udivdi3`"
        on some 32bit systems.
      Signed-off-by: default avatarHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: xiyou.wangcong@gmail.com
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4AFB9029.9000208@jp.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      761b1d26
    • Mike Galbraith's avatar
      sched: Fix/add missing update_rq_clock() calls · 055a0086
      Mike Galbraith authored
      kthread_bind(), migrate_task() and sched_fork were missing
      updates, and try_to_wake_up() was updating after having already
      used the stale clock.
      
      Aside from preventing potential latency hits, there' a side
      benefit in that early boot printk time stamps become monotonic.
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1258020464.6491.2.camel@marge.simson.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      LKML-Reference: <new-submission>
      055a0086
  10. 11 Nov, 2009 19 commits