1. 14 Aug, 2011 10 commits
    • Steven Rostedt's avatar
      sched/cpupri: Remove the vec->lock · c92211d9
      Steven Rostedt authored
      sched/cpupri: Remove the vec->lock
      
      The cpupri vec->lock has been showing up as a top contention
      lately. This is because of the RT push/pull logic takes an
      agressive approach for migrating RT tasks. The cpupri logic is
      in place to improve the performance of the push/pull when dealing
      with large number CPU machines.
      
      The problem though is a vec->lock is required, where a vec is a
      global per RT priority structure. That is, if there are lots of
      RT tasks at the same priority, every time they are added or removed
      from the RT queue, this global vec->lock is taken. Now that more
      kernel threads are becoming RT (RCU boost and threaded interrupts)
      this is becoming much more of an issue.
      
      There are two variables that are being synced by the vec->lock.
      The cpupri bitmask, and the vec->counter. The cpupri bitmask
      is one bit per priority. If a RT priority vec has a process queued,
      then the vec->count is > 0 and the cpupri bitmask is set for that
      RT priority.
      
      If the cpupri bitmask gets out of sync with the vec->counter, we could
      end up pushing a low proirity RT task to a high priority queue.
      That RT task that could have run immediately could be queued on a
      run queue with a higher priority task indefinitely.
      
      The solution is not to use the cpupri bitmask and just look at the
      vec->count directly when doing a pull. The cpupri bitmask is just
      a fast way to scan the RT priorities when a pull is made. Instead
      of using the bitmask, and just examine all RT priorities, and
      look at the vec->counts, we could eliminate the vec->lock. The
      scan of RT tasks is to find a run queue that we can push an RT task
      to, and we do not push to a high priority queue, thus the scan only
      needs to go from 1 to RT task->prio, and not all 100 RT priorities.
      
      The push algorithm, which does the scan of RT priorities (and
      scan of the bitmask) only happens when we have an overloaded RT run
      queue (more than one RT task queued). The grabbing of the vec->lock
      happens every time any RT task is queued or dequeued on the run
      queue for that priority. The slowing down of the scan by not using
      a bitmask is negligible by the speed up of removing the vec->lock
      contention, and replacing it with an atomic counter and memory barrier.
      
      To prove this, I wrote a patch that times both the loop and the code
      that grabs the vec->locks. I passed the patches to various people
      (and companies) to test and show the results. I let everyone choose
      their own load to test, giving different loads on the system,
      for various different setups.
      
      Here's some of the results: (snipping to a few CPUs to not make
      this change log huge, but the results were consistent across
      the entire system).
      
      System 1 (24 CPUs)
      
      Before patch:
      CPU:    Name    Count   Max     Min     Average Total
      ----    ----    -----   ---     ---     ------- -----
      [...]
      cpu 20: loop    3057    1.766   0.061   0.642   1963.170
              vec     6782949 90.469  0.089   0.414   2811760.503
      cpu 21: loop    2617    1.723   0.062   0.641   1679.074
              vec     6782810 90.499  0.089   0.291   1978499.900
      cpu 22: loop    2212    1.863   0.063   0.699   1547.160
              vec     6767244 85.685  0.089   0.435   2949676.898
      cpu 23: loop    2320    2.013   0.062   0.594   1380.265
              vec     6781694 87.923  0.088   0.431   2928538.224
      
      After patch:
      cpu 20: loop    2078    1.579   0.061   0.533   1108.006
              vec     6164555 5.704   0.060   0.143   885185.809
      cpu 21: loop    2268    1.712   0.065   0.575   1305.248
              vec     6153376 5.558   0.060   0.187   1154960.469
      cpu 22: loop    1542    1.639   0.095   0.533   823.249
              vec     6156510 5.720   0.060   0.190   1172727.232
      cpu 23: loop    1650    1.733   0.068   0.545   900.781
              vec     6170784 5.533   0.060   0.167   1034287.953
      
      All times are in microseconds. The 'loop' is the amount of time spent
      doing the loop across the priorities (before patch uses bitmask).
      the 'vec' is the amount of time in the code that requires grabbing
      the vec->lock. The second patch just does not have the vec lock, but
      encompasses the same code.
      
      Amazingly the loop code even went down on average. The vec code went
      from .5 down to .18, that's more than half the time spent!
      
      Note, more than one test was run, but they all had the same results.
      
      System 2 (64 CPUs)
      
      Before patch:
      CPU:    Name    Count   Max     Min     Average Total
      ----    ----    -----   ---     ---     ------- -----
      cpu 60: loop    0       0       0       0       0
              vec     5410840 277.954 0.084   0.782   4232895.727
      cpu 61: loop    0       0       0       0       0
              vec     4915648 188.399 0.084   0.570   2803220.301
      cpu 62: loop    0       0       0       0       0
              vec     5356076 276.417 0.085   0.786   4214544.548
      cpu 63: loop    0       0       0       0       0
              vec     4891837 170.531 0.085   0.799   3910948.833
      
      After patch:
      cpu 60: loop    0       0       0       0       0
              vec     5365118 5.080   0.021   0.063   340490.267
      cpu 61: loop    0       0       0       0       0
              vec     4898590 1.757   0.019   0.071   347903.615
      cpu 62: loop    0       0       0       0       0
              vec     5737130 3.067   0.021   0.119   687108.734
      cpu 63: loop    0       0       0       0       0
              vec     4903228 1.822   0.021   0.071   348506.477
      
      The test run during the measurement did not have any (very few,
      from other CPUs) RT tasks pushing. But this shows that it helped
      out tremendously with the contention, as the contention happens
      because the vec->lock is taken only on queuing at an RT priority,
      and different CPUs that queue tasks at the same priority will
      have contention.
      
      I tested on my own 4 CPU machine with the following results:
      
      Before patch:
      CPU:    Name    Count   Max     Min     Average Total
      ----    ----    -----   ---     ---     ------- -----
      cpu 0:  loop    2377    1.489   0.158   0.588   1398.395
              vec     4484    770.146 2.301   4.396   19711.755
      cpu 1:  loop    2169    1.962   0.160   0.576   1250.110
              vec     4425    152.769 2.297   4.030   17834.228
      cpu 2:  loop    2324    1.749   0.155   0.559   1299.799
              vec     4368    779.632 2.325   4.665   20379.268
      cpu 3:  loop    2325    1.629   0.157   0.561   1306.113
              vec     4650    408.782 2.394   4.348   20222.577
      
      After patch:
      CPU:    Name    Count   Max     Min     Average Total
      ----    ----    -----   ---     ---     ------- -----
      cpu 0:  loop    2121    1.616   0.113   0.636   1349.189
              vec     4303    1.151   0.225   0.421   1811.966
      cpu 1:  loop    2130    1.638   0.178   0.644   1372.927
              vec     4627    1.379   0.235   0.428   1983.648
      cpu 2:  loop    2056    1.464   0.165   0.637   1310.141
              vec     4471    1.311   0.217   0.433   1937.927
      cpu 3:  loop    2154    1.481   0.162   0.601   1295.083
              vec     4236    1.253   0.230   0.425   1803.008
      
      This was running my migrate.c code that can be found at:
      http://lwn.net/Articles/425763/
      
      The migrate code does stress the RT tasks a bit. This shows that
      the loop did increase a little after the patch, but not by much.
      The vec code dropped dramatically. From 4.3us down to .42us.
      That's a 10x improvement!
      Tested-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Tested-by: default avatarLuis Claudio R. Gonçalves <lgoncalv@redhat.com>
      Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarGregory Haskins <gregory.haskins@gmail.com>
      Acked-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Chris Mason <chris.mason@oracle.com>
      Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c92211d9
    • Steven Rostedt's avatar
      sched: Use pushable_tasks to determine next highest prio · 5181f4a4
      Steven Rostedt authored
      Hillf Danton proposed a patch (see link) that cleaned up the
      sched_rt code that calculates the priority of the next highest priority
      task to be used in finding run queues to pull from.
      
      His patch removed the calculating of the next prio to just use the current
      prio when deteriming if we should examine a run queue to pull from. The problem
      with his patch was that it caused more false checks. Because we check a run
      queue for pushable tasks if the current priority of that run queue is higher
      in priority than the task about to run on our run queue. But after grabbing
      the locks and doing the real check, we find that there may not be a task
      that has a higher prio task to pull. Thus the locks were taken with nothing to
      do.
      
      I added some trace_printks() to record when and how many times the run queue
      locks were taken to check for pullable tasks, compared to how many times we
      pulled a task.
      
      With the current method, it was:
      
        3806 locks taken vs 2812 pulled tasks
      
      With Hillf's patch:
      
        6728 locks taken vs 2804 pulled tasks
      
      The number of times locks were taken to pull a task went up almost double with
      no more success rate.
      
      But his patch did get me thinking. When we look at the priority of the highest
      task to consider taking the locks to do a pull, a failure to pull can be one
      of the following: (in order of most likely)
      
       o RT task was pushed off already between the check and taking the lock
       o Waiting RT task can not be migrated
       o RT task's CPU affinity does not include the target run queue's CPU
       o RT task's priority changed between the check and taking the lock
      
      And with Hillf's patch, the thing that caused most of the failures, is
      the RT task to pull was not at the right priority to pull (not greater than
      the current RT task priority on the target run queue).
      
      Most of the above cases we can't help. But the current method does not check
      if the next highest prio RT task can be migrated or not, and if it can not,
      we still grab the locks to do the test (we don't find out about this fact until
      after we have the locks). I thought about this case, and realized that the
      pushable task plist that is maintained only holds RT tasks that can migrate.
      If we move the calculating of the next highest prio task from the inc/dec_rt_task()
      functions into the queuing of the pushable tasks, then we only measure the
      priorities of those tasks that we push, and we get this basically for free.
      
      Not only does this patch make the code a little more efficient, it cleans it
      up and makes it a little simpler.
      
      Thanks to Hillf Danton for inspiring me on this patch.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Gregory Haskins <ghaskins@novell.com>
      Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5181f4a4
    • Steven Rostedt's avatar
      sched: Balance RT tasks when forked as well · c37495fd
      Steven Rostedt authored
      When a new task is woken, the code to balance the RT task is currently
      skipped in the select_task_rq() call. But it will be pushed if the rq
      is currently overloaded with RT tasks anyway. The issue is that we
      already queued the task, and if it does get pushed, it will have to
      be dequeued and requeued on the new run queue. The advantage with
      pushing it first is that we avoid this requeuing as we are pushing it
      off before the task is ever queued.
      
      See commit 318e0893 ("sched: pre-route RT tasks on wakeup")
      for more details.
      
      The return of select_task_rq() when it is not a wake up has also been
      changed to return task_cpu() instead of smp_processor_id(). This is more
      of a sanity because the current only other user of select_task_rq()
      besides wake ups, is an exec, where task_cpu() should also be the same
      as smp_processor_id(). But if it is used for other purposes, lets keep
      the task on the same CPU. Why would we mant to migrate it to the current
      CPU?
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hillf Danton <dhillf@gmail.com>
      Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.orgSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c37495fd
    • Hillf Danton's avatar
      sched: Remove resetting exec_start in put_prev_task_rt() · 1812a643
      Hillf Danton authored
      There's no reason to clean the exec_start in put_prev_task_rt() as it is reset
      when the task gets back to the run queue. This saves us doing a store() in the
      fast path.
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Link: http://lkml.kernel.org/r/BANLkTimqWD=q6YnSDi-v9y=LMWecgEzEWg@mail.gmail.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      1812a643
    • Hillf Danton's avatar
      sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task() · 311e800e
      Hillf Danton authored
      Do not call dequeue_pushable_task() when failing to push an eligible
      task, as it remains pushable, merely not at this particular moment.
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarMike Galbraith <mgalbraith@gmx.de>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Link: http://lkml.kernel.org/r/1306895385.4791.26.camel@marge.simson.netSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      311e800e
    • Hillf Danton's avatar
      sched: Remove noop in lowest_flag_domain() · 08354716
      Hillf Danton authored
      Checking for the validity of sd is removed, since it is already
      checked by the for_each_domain macro.
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      08354716
    • Hillf Danton's avatar
      sched: Remove noop in next_prio() · 67d95538
      Hillf Danton authored
      When computing the next priority for a given run-queue, the check for
      RT priority of the task determined by the pick_next_highest_task_rt()
      function could be removed, since only RT tasks are returned by the
      function.
      Reviewed-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/BANLkTimxmWiof9s5AvS3v_0X+sMiE=0x5g@mail.gmail.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      67d95538
    • Mike Galbraith's avatar
      sched: fix broken SCHED_RESET_ON_FORK handling · c350a04e
      Mike Galbraith authored
      Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
      been handled for an RT parent gives birth to a deranged mutant child with
      non-RT policy, but RT prio and sched_class.
      
      Move PI leakage protection up, always set priorities and weight, and if the
      child is leaving RT class, reset rt_priority to the proper value.
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.netSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c350a04e
    • Yong Zhang's avatar
      sched: Kill WAKEUP_PREEMPT · 2c2efaed
      Yong Zhang authored
      Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
      and its outlived its use by a long long while.
      Signed-off-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Acked-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhySigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2c2efaed
    • Jan H. Schönherr's avatar
      sched: Remove rq->avg_load_per_task · e2b245f8
      Jan H. Schönherr authored
      Since commit a2d47777 ("sched: fix stale value in average load per task")
      the variable rq->avg_load_per_task is no longer required. Remove it.
      Signed-off-by: default avatarJan H. Schönherr <schnhrr@cs.tu-berlin.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.deSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e2b245f8
  2. 12 Aug, 2011 5 commits
  3. 11 Aug, 2011 6 commits
    • Vasiliy Kulikov's avatar
      move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997
      Vasiliy Kulikov authored
      The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
      check in set_user() to check for NPROC exceeding via setuid() and
      similar functions.
      
      Before the check there was a possibility to greatly exceed the allowed
      number of processes by an unprivileged user if the program relied on
      rlimit only.  But the check created new security threat: many poorly
      written programs simply don't check setuid() return code and believe it
      cannot fail if executed with root privileges.  So, the check is removed
      in this patch because of too often privilege escalations related to
      buggy programs.
      
      The NPROC can still be enforced in the common code flow of daemons
      spawning user processes.  Most of daemons do fork()+setuid()+execve().
      The check introduced in execve() (1) enforces the same limit as in
      setuid() and (2) doesn't create similar security issues.
      
      Neil Brown suggested to track what specific process has exceeded the
      limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
      this process would fail on execve(), and other processes' execve()
      behaviour is not changed.
      
      Solar Designer suggested to re-check whether NPROC limit is still
      exceeded at the moment of execve().  If the process was sleeping for
      days between set*uid() and execve(), and the NPROC counter step down
      under the limit, the defered execve() failure because NPROC limit was
      exceeded days ago would be unexpected.  If the limit is not exceeded
      anymore, we clear the flag on successful calls to execve() and fork().
      
      The flag is also cleared on successful calls to set_user() as the limit
      was exceeded for the previous user, not the current one.
      
      Similar check was introduced in -ow patches (without the process flag).
      
      v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
      Reviewed-by: default avatarJames Morris <jmorris@namei.org>
      Signed-off-by: default avatarVasiliy Kulikov <segoon@openwall.com>
      Acked-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72fa5997
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of... · 1d229d54
      Linus Torvalds authored
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
        perf symbols: Check '/tmp/perf-' symbol file ownership
        perf sched: Usage leftover from trace -> script rename
        perf sched: Do not delete session object prematurely
        perf tools: Check $HOME/.perfconfig ownership
        perf, x86: Add model 45 SandyBridge support
        perf tools: Add support to install perf python extension
        perf tools: do not look at ./config for configuration
        perf tools: Make clean leaves some files
        perf lock: Dropping unsupported ':r' modifier
        perf probe: Fix coredump introduced by probe module option
        jump label: Reduce the cycle count by changing the link order
        perf report: Use ui__warning in some more places
        perf python: Add PERF_RECORD_{LOST,READ,SAMPLE} routine tables
        perf evlist: Introduce 'disable' method
        trace events: Update version number reference to new 3.x scheme for EVENT_POWER_TRACING_DEPRECATED
        perf buildid-cache: Zero out buffer of filenames when adding/removing buildid
      1d229d54
    • Tracey Dent's avatar
      MAINTAINERS: Update linus' git repository · d16adea3
      Tracey Dent authored
      Change to new git tree -
       (git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git).
      Signed-off-by: default avatarTracey Dent <tdent48227@gmail.com>
      Acked-by: default avatarWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d16adea3
    • Linus Torvalds's avatar
      Revert "EDAC: Correct Kconfig dependencies" · a9f729f0
      Linus Torvalds authored
      This reverts commit af9d220b.
      
      It turns out that one was meant to be applied on top of the edac.git
      tree in -next that has more i7core_edac changes, but that wasn't clear
      in the original email.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarBorislav Petkov <borislav.petkov@amd.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9f729f0
    • Peng Tao's avatar
      NFS41: make PNFS_BLOCK selectable · 54a33b19
      Peng Tao authored
      PNFS_BLOCK needs BLK_DEV_DM/MD, which is not a dependency for other
      pnfs layout drivers. Seperate it out so others can still build when
      BLK_DEV_DM/MD is not enabled.
      
      Also change select to depends on to avoid build failures.
      Reported-and-tested-by: default avatarRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: default avatarPeng Tao <peng_tao@emc.com>
      Acked-by: default avatarBenny Halevy <bhalevy@tonian.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a33b19
    • Linus Torvalds's avatar
      Merge branch 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm · 068ef739
      Linus Torvalds authored
      * 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
        ARM: drop experimental status for ARM_PATCH_PHYS_VIRT
        ARM: 7008/1: alignment: Make SIGBUS sent to userspace POSIXly correct
        ARM: 7007/1: alignment: Prevent ignoring of faults with ARMv6 unaligned access model
        ARM: 7010/1: mm: fix invalid loop for poison_init_mem
        ARM: 7005/1: freshen up mm/proc-arm946.S
        dmaengine: PL08x: Fix trivial build error
        ARM: Fix build error for SMP=n builds
      068ef739
  4. 10 Aug, 2011 9 commits
  5. 09 Aug, 2011 10 commits