- 29 Oct, 2013 1 commit
-
-
Ben Segall authored
When we transition cfs_bandwidth_used to false, any currently throttled groups will incorrectly return false from cfs_rq_throttled. While tg_set_cfs_bandwidth will unthrottle them eventually, currently running code (including at least dequeue_task_fair and distribute_cfs_runtime) will cause errors. Fix this by turning off cfs_bandwidth_used only after unthrottling all cfs_rqs. Tested: toggle bandwidth back and forth on a loaded cgroup. Caused crashes in minutes without the patch, hasn't crashed with it. Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 28 Oct, 2013 1 commit
-
-
Michael wang authored
Commit 6acce3ef: sched: Remove get_online_cpus() usage has left one extra put_online_cpus() inside sched_setaffinity(), remove it to fix the WARN: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 3166 at kernel/cpu.c:84 put_online_cpus+0x43/0x70() ... [<ffffffff810c3fef>] put_online_cpus+0x43/0x70 [ [<ffffffff810efd59>] sched_setaffinity+0x7d/0x1f9 [ ... Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/526DD0EE.1090309@linux.vnet.ibm.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 26 Oct, 2013 1 commit
-
-
Li Bin authored
This issue was introduced by 454c7999 ("sched/rt: Fix SCHED_RR across cgroups") that missed the word 'not'. Fix it. Signed-off-by: Li Bin <huawei.libin@huawei.com> Cc: <guohanjun@huawei.com> Cc: <xiexiuqi@huawei.com> Cc: <peterz@infradead.org> Link: http://lkml.kernel.org/r/1382357743-54136-1-git-send-email-huawei.libin@huawei.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 23 Oct, 2013 1 commit
-
-
Thierry Reding authored
The wait_event_interruptible_lock_irq() macro is missing a semi-colon which causes a build failure in the i915 DRM driver. Signed-off-by: Thierry Reding <treding@nvidia.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1382528455-29911-1-git-send-email-treding@nvidia.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 16 Oct, 2013 5 commits
-
-
Oleg Nesterov authored
Add the new helper, prepare_to_wait_event() which should only be used by ___wait_event(). prepare_to_wait_event() returns -ERESTARTSYS if signal_pending_state() is true, otherwise it does prepare_to_wait/exclusive. This allows to uninline the signal-pending checks in wait_event*() macros. Also, it can initialize wait->private/func. We do not care if they were already initialized, the values are the same. This also shaves a couple of insns from the inlined code. This obviously makes prepare_*() path a little bit slower, but we are likely going to sleep anyway, so I think it makes sense to shrink .text: text data bss dec hex filename =================================================== before: 5126092 2959248 10117120 18202460 115bf5c vmlinux after: 5124618 2955152 10117120 18196890 115a99a vmlinux on my build. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131007161824.GA29757@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
Commit 4c663cfc ("wait: fix false timeouts when using wait_event_timeout()") introduced the additional condition checks after a timeout but only in the "slow" __wait*() paths. wait_event_timeout(wq, CONDITION, 0) still returns 0 if CONDITION is already true and we do not call __wait*(). Now that we have ___wait_cond_timeout() we can use it instead to ensure that __ret will be properly updated. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131007183106.GA10973@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Remove get_online_cpus() usage from the scheduler; there's 4 sites that use it: - sched_init_smp(); where its completely superfluous since we're in 'early' boot and there simply cannot be any hotplugging. - sched_getaffinity(); we already take a raw spinlock to protect the task cpus_allowed mask, this disables preemption and therefore also stabilizes cpu_online_mask as that's modified using stop_machine. However switch to active mask for symmetry with sched_setaffinity()/set_cpus_allowed_ptr(). We guarantee active mask stability by inserting sync_rcu/sched() into _cpu_down. - sched_setaffinity(); we don't appear to need get_online_cpus() either, there's two sites where hotplug appears relevant: * cpuset_cpus_allowed(); for the !cpuset case we use possible_mask, for the cpuset case we hold task_lock, which is a spinlock and thus for mainline disables preemption (might cause pain on RT). * set_cpus_allowed_ptr(); Holds all scheduler locks and thus has preemption properly disabled; also it already deals with hotplug races explicitly where it releases them. - migrate_swap(); we can make stop_two_cpus() do the heavy lifting for us with a little trickery. By adding a sync_sched/rcu() after the CPU_DOWN_PREPARE notifier we can provide preempt/rcu guarantees for cpu_active_mask. Use these to validate that both our cpus are active when queueing the stop work before we queue the stop_machine works for take_cpu_down(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20131011123820.GV3081@twins.programming.kicks-ass.netSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap places with task T, on CPU B. Task P: - call migrate_swap Task T: - go to sleep, removing itself from the runqueue Task P: - double lock the runqueues on CPU A & B Task T: - get woken up, place itself on the runqueue of CPU C Task P: - see that task T is on a runqueue, and pretend to remove it from the runqueue on CPU B Now CPUs B & C both have corrupted scheduler data structures. This patch fixes it, by holding the pi_lock for both of the tasks involved in the migrate swap. This prevents task T from waking up, and placing itself onto another runqueue, until after migrate_swap has released all locks. This means that, when migrate_swap checks, task T will be either on the runqueue where it was originally seen, or not on any runqueue at all. Migrate_swap deals correctly with of those cases. Tested-by: Joe Mario <jmario@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: hannes@cmpxchg.org Cc: aarcange@redhat.com Cc: srikar@linux.vnet.ibm.com Cc: tglx@linutronix.de Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.netSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
While discussing the proposed SCHED_DEADLINE patches which in parts mimic the existing FIFO code it was noticed that the wmb in rt_set_overloaded() didn't have a matching barrier. The only site using rt_overloaded() to test the rto_count is pull_rt_task() and we should issue a matching rmb before then assuming there's an rto_mask bit set. Without that smp_rmb() in there we could actually miss seeing the rto_mask bit. Also, change to using smp_[wr]mb(), even though this is SMP only code; memory barriers without smp_ always make me think they're against hardware of some sort. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: vincent.guittot@linaro.org Cc: luca.abeni@unitn.it Cc: bruce.ashfield@windriver.com Cc: dhaval.giani@gmail.com Cc: rostedt@goodmis.org Cc: hgu1972@gmail.com Cc: oleg@redhat.com Cc: fweisbec@gmail.com Cc: darren@dvhart.com Cc: johan.eker@ericsson.com Cc: p.faure@akatech.ch Cc: paulmck@linux.vnet.ibm.com Cc: raistlin@linux.it Cc: claudio@evidence.eu.com Cc: insop.song@gmail.com Cc: michael@amarulasolutions.com Cc: liming.wang@windriver.com Cc: fchecconi@gmail.com Cc: jkacur@redhat.com Cc: tommaso.cucinotta@sssup.it Cc: Juri Lelli <juri.lelli@gmail.com> Cc: harald.gustafsson@ericsson.com Cc: nicola.manica@disi.unitn.it Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.netSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 14 Oct, 2013 1 commit
-
-
Kamalesh Babulal authored
- 'load_icx' => 'load_idx' - 'calculcate_imbalance' => 'calculate_imbalance' Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1381685775-3544-1-git-send-email-kamalesh@linux.vnet.ibm.com [ Also, don't capitalize 'idle' unnecessarily. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
- 12 Oct, 2013 1 commit
-
-
Ramkumar Ramachandra authored
The balance parameter was removed by 23f0d209 ("sched: Factor out code to should_we_balance()", 2013-08-06). Signed-off-by: Ramkumar Ramachandra <artagnon@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381400433-2030-1-git-send-email-artagnon@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 11 Oct, 2013 6 commits
-
-
Ingo Molnar authored
Apply the asm_volatile_goto() compiler quirk to the new rmwcc.h file as well, introduced in: c2daa3be sched, x86: Provide a per-cpu preempt_count implementation Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Ingo Molnar authored
Merge in asm goto fix, to be able to apply the asm/rmwcc.h fix. Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Ingo Molnar authored
Fengguang Wu, Oleg Nesterov and Peter Zijlstra tracked down a kernel crash to a GCC bug: GCC miscompiles certain 'asm goto' constructs, as outlined here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670 Implement a workaround suggested by Jakub Jelinek. Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Suggested-by: Jakub Jelinek <jakub@redhat.com> Reviewed-by: Richard Henderson <rth@twiddle.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Kent Overstreet authored
Commit c0f04d88 ("bcache: Fix flushes in writeback mode") was fixing a reported data corruption bug, but it seems some last minute refactoring or rebasing introduced a null pointer deref. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-stagingLinus Torvalds authored
Pull hwmon fix from Guenter Roeck: "Fix root cause of crash/error seen in applesmc driver" * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: (applesmc) Always read until end of data
-
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuildLinus Torvalds authored
Pull kbuild fix from Michal Marek: "Here is an ARM Makefile fix that you even acked. After nobody wanted to take it, it ended up in the kbuild tree" * 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: arm, kbuild: make "make install" not depend on vmlinux
-
- 10 Oct, 2013 11 commits
-
-
git://www.linux-watchdog.org/linux-watchdogLinus Torvalds authored
Pull watchdog fix from Wim Van Sebroeck: "Make sure that the hpwdt driver will not load auxilary iLO devices" * git://www.linux-watchdog.org/linux-watchdog: watchdog: hpwdt: Patch to ignore auxilary iLO devices
-
Fengguang Wu authored
Useful for locating buggy drivers on kernel oops. It may add dozens of new lines to boot dmesg. DEBUG_KOBJECT_RELEASE is hopefully only enabled in debug kernels (like maybe the Fedora rawhide one, or at developers), so being a bit more verbose is likely ok. Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mingarelli, Thomas authored
This patch is to prevent hpwdt from loading on any auxilary iLO devices defined after the initial (or main) iLO device. All auxilary iLO devices will have a subsystem device ID set to 0x1979 in order for hpwdt to differentiate between the two types. Signed-off-by: Thomas Mingarelli <thomas.mingarelli@hp.com> Tested-by: Lisa Mitchell <lisa.mitchell@hp.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/randomLinus Torvalds authored
Pull /dev/random changes from Ted Ts'o: "These patches are designed to enable improvements to /dev/random for non-x86 platforms, in particular MIPS and ARM" * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: allow architectures to optionally define random_get_entropy() random: run random_int_secret_init() run after all late_initcalls
-
git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds authored
Pull kvm fixes from Paolo Bonzini: "Fixes for 3.12-rc5: two old PPC bugs and one new (3.12-rc2) x86 bug" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: ppc: booke: check range page invalidation progress on page setup KVM: PPC: Book3S HV: Fix typo in saving DSCR KVM: nVMX: fix shadow on EPT
-
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spiLinus Torvalds authored
Pull spi fixes from Mark Brown: "This is all driver updates, mostly fixes for error handling paths except for the s3c64xx and hspi fixes for trying to use runtime PM before it is enabled and the pxa2xx fix for interactions between power management and interrupt handling" * tag 'spi-v3.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: atmel: Fix incorrect error path spi/hspi: fixup Runtime PM enable timing spi/s3c64xx: Ensure runtime PM is enabled prior to registration spi/clps711x: drop clk_put for devm_clk_get in spi_clps711x_probe() spi: fix return value check in dspi_probe() spi: mpc512x: fix error return code in mpc512x_psc_spi_do_probe() spi: clps711x: Don't call kfree() after spi_master_put/spi_unregister_master spi/pxa2xx: check status register as well to determine if the device is off
-
Theodore Ts'o authored
Allow architectures which have a disabled get_cycles() function to provide a random_get_entropy() function which provides a fine-grained, rapidly changing counter that can be used by the /dev/random driver. For example, an architecture might have a rapidly changing register used to control random TLB cache eviction, or DRAM refresh that doesn't meet the requirements of get_cycles(), but which is good enough for the needs of the random driver. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
-
git://git.infradead.org/linux-mtdLinus Torvalds authored
Pull MTD fixes from Brian Norris: - fix a small memory leak in some new ONFI code - account for additional odd variations of Micron SPI flash Acked by David Woodhouse. * tag 'for-linus-20131008' of git://git.infradead.org/linux-mtd: mtd: m25p80: Fix 4 byte addressing mode for Micron devices. mtd: nand: fix memory leak in ONFI extended parameter page
-
Bharat Bhushan authored
When the MM code is invalidating a range of pages, it calls the KVM kvm_mmu_notifier_invalidate_range_start() notifier function, which calls kvm_unmap_hva_range(), which arranges to flush all the TLBs for guest pages. However, the Linux PTEs for the range being flushed are still valid at that point. We are not supposed to establish any new references to pages in the range until the ...range_end() notifier gets called. The PPC-specific KVM code doesn't get any explicit notification of that; instead, we are supposed to use mmu_notifier_retry() to test whether we are or have been inside a range flush notifier pair while we have been referencing a page. This patch calls the mmu_notifier_retry() while mapping the guest page to ensure we are not referencing a page when in range invalidation. This call is inside a region locked with kvm->mmu_lock, which is the same lock that is called by the KVM MMU notifier functions, thus ensuring that no new notification can proceed while we are in the locked region. Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com> Acked-by: Alexander Graf <agraf@suse.de> [Backported to 3.12 - Paolo] Reviewed-by: Bharat Bhushan <bharat.bhushan@freescale.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paul Mackerras authored
This fixes a typo in the code that saves the guest DSCR (Data Stream Control Register) into the kvm_vcpu_arch struct on guest exit. The effect of the typo was that the DSCR value was saved in the wrong place, so changes to the DSCR by the guest didn't persist across guest exit and entry, and some host kernel memory got corrupted. Cc: stable@vger.kernel.org [v3.1+] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Alexander Graf <agraf@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Gleb Natapov authored
72f85795 broke shadow on EPT. This patch reverts it and fixes PAE on nEPT (which reverted commit fixed) in other way. Shadow on EPT is now broken because while L1 builds shadow page table for L2 (which is PAE while L2 is in real mode) it never loads L2's GUEST_PDPTR[0-3]. They do not need to be loaded because without nested virtualization HW does this during guest entry if EPT is disabled, but in our case L0 emulates L2's vmentry while EPT is enables, so we cannot rely on vmcs12->guest_pdptr[0-3] to contain up-to-date values and need to re-read PDPTEs from L2 memory. This is what kvm_set_cr3() is doing, but by clearing cache bits during L2 vmentry we drop values that kvm_set_cr3() read from memory. So why the same code does not work for PAE on nEPT? kvm_set_cr3() reads pdptes into vcpu->arch.walk_mmu->pdptrs[]. walk_mmu points to vcpu->arch.nested_mmu while nested guest is running, but ept_load_pdptrs() uses vcpu->arch.mmu which contain incorrect values. Fix that by using walk_mmu in ept_(load|save)_pdptrs. Signed-off-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 09 Oct, 2013 12 commits
-
-
Henrik Rydberg authored
The crash reported and investigated in commit 5f4513 turned out to be caused by a change to the read interface on newer (2012) SMCs. Tests by Chris show that simply reading the data valid line is enough for the problem to go away. Additional tests show that the newer SMCs no longer wait for the number of requested bytes, but start sending data right away. Apparently the number of bytes to read is no longer specified as before, but instead found out by reading until end of data. Failure to read until end of data confuses the state machine, which eventually causes the crash. As a remedy, assuming bit0 is the read valid line, make sure there is nothing more to read before leaving the read function. Tested to resolve the original problem, and runtested on MBA3,1, MBP4,1, MBP8,2, MBP10,1, MBP10,2. The patch seems to have no effect on machines before 2012. Tested-by: Chris Murphy <chris@cmurf.com> Cc: stable@vger.kernel.org Signed-off-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
-
Peter Zijlstra authored
Reflow the function a bit because GCC gets confused: kernel/sched/fair.c: In function ‘task_numa_fault’: kernel/sched/fair.c:1448:3: warning: ‘my_grp’ may be used uninitialized in this function [-Wmaybe-uninitialized] kernel/sched/fair.c:1463:27: note: ‘my_grp’ was declared here Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-6ebt6x7u64pbbonq1khqu2z9@git.kernel.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Short spikes of CPU load can lead to a task being migrated away from its preferred node for temporary reasons. It is important that the task is migrated back to where it belongs, in order to avoid migrating too much memory to its new location, and generally disturbing a task's NUMA location. This patch fixes NUMA placement for 4 specjbb instances on a 4 node system. Without this patch, things take longer to converge, and processes are not always completely on their own node. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-64-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
As Peter says "If you're going to hold locks you can also do away with all that atomic_long_*() nonsense". Lock aquisition moved slightly to protect the updates. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-63-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Shared faults can lead to lots of unnecessary page migrations, slowing down the system, and causing private faults to hit the per-pgdat migration ratelimit. This patch adds sysctl numa_balancing_migrate_deferred, which specifies how many shared page migrations to skip unconditionally, after each page migration that is skipped because it is a shared fault. This reduces the number of page migrations back and forth in shared fault situations. It also gives a strong preference to the tasks that are already running where most of the memory is, and to moving the other tasks to near the memory. Testing this with a much higher scan rate than the default still seems to result in fewer page migrations than before. Memory seems to be somewhat better consolidated than previously, with multi-instance specjbb runs on a 4 node system. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
With the scan rate code working (at least for multi-instance specjbb), the large hammer that is "sched: Do not migrate memory immediately after switching node" can be replaced with something smarter. Revert temporarily migration disabling and all traces of numa_migrate_seq. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
With scan rate adaptions based on whether the workload has properly converged or not there should be no need for the scan period reset hammer. Get rid of it. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Adjust numa_scan_period in task_numa_placement, depending on how much useful work the numa code can do. The more local faults there are in a given scan window the longer the period (and hence the slower the scan rate) during the next window. If there are excessive shared faults then the scan period will decrease with the amount of scaling depending on whether the ratio of shared/private faults. If the preferred node changes then the scan rate is reset to recheck if the task is properly placed. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Mel Gorman authored
Scan rate is altered based on whether shared/private faults dominated. task_numa_group() may detect false sharing but that information is not taken into account when adapting the scan rate. Take it into account. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-58-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
Due to the way the pid is truncated, and tasks are moved between CPUs by the scheduler, it is possible for the current task_numa_fault to group together tasks that do not actually share memory together. This patch adds a few easy sanity checks to task_numa_fault, joining tasks together if they share the same tsk->mm, or if the fault was on a page with an elevated mapcount, in a shared VMA. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-57-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
This patch classifies scheduler domains and runqueues into types depending the number of tasks that are about their NUMA placement and the number that are currently running on their preferred node. The types are regular: There are tasks running that do not care about their NUMA placement. remote: There are tasks running that care about their placement but are currently running on a node remote to their ideal placement all: No distinction To implement this the patch tracks the number of tasks that are optimally NUMA placed (rq->nr_preferred_running) and the number of tasks running that care about their placement (nr_numa_running). The load balancer uses this information to avoid migrating idea placed NUMA tasks as long as better options for load balancing exists. For example, it will not consider balancing between a group whose tasks are all perfectly placed and a group with remote tasks. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-56-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
This patch separately considers task and group affinities when searching for swap candidates during NUMA placement. If tasks are part of the same group, or no group at all, the task weights are considered. Some hysteresis is added to prevent tasks within one group from getting bounced between NUMA nodes due to tiny differences. If tasks are part of different groups, the code compares group weights, in order to favor grouping task groups together. The patch also changes the group weight multiplier to be the same as the task weight multiplier, since the two are no longer added up like before. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-55-git-send-email-mgorman@suse.deSigned-off-by: Ingo Molnar <mingo@kernel.org>
-