1. 23 Sep, 2012 24 commits
    • Paul E. McKenney's avatar
      rcu: Shrink RCU based on number of CPUs · b17c7035
      Paul E. McKenney authored
      Currently, rcu_init_geometry() only reshapes RCU's combining trees
      if the leaf fanout is changed at boot time.  This means that by
      default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
      data structures, even when running on systems with (say) four CPUs.
      
      This commit therefore checks to see if the maximum number of CPUs on
      the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
      reshapes the combining trees accordingly.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b17c7035
    • Paul E. McKenney's avatar
      rcu: Handle unbalanced rcu_node configurations with few CPUs · 4dbd6bb3
      Paul E. McKenney authored
      If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
      to nr_cpu_ids) to require more than a single rcu_node structure, but if
      NR_CPUS is larger than would fit into a single rcu_node structure, then
      the current rcu_init_levelspread() code is subject to integer overflow
      in the eight-bit ->levelspread[] array in the rcu_state structure.
      
      In this case, the solution is -not- to increase the size of the
      elements in this array because the values in that array should be
      constrained to the number of bits in an unsigned long.  Instead, this
      commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
      function's initialization of the cprv local variable.  This results in
      all of the arithmetic being consistently based off of the nr_cpu_ids
      value, thus avoiding the overflow, which was caused by the mixing of
      nr_cpu_ids and NR_CPUS.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4dbd6bb3
    • Paul E. McKenney's avatar
      rcu: Simplify quiescent-state detection · d7d6a11e
      Paul E. McKenney authored
      The current quiescent-state detection algorithm is needlessly
      complex.  It records the grace-period number corresponding to
      the quiescent state at the time of the quiescent state, which
      works, but it seems better to simply erase any record of previous
      quiescent states at the time that the CPU notices the new grace
      period.  This has the further advantage of removing another piece
      of RCU for which lockless reasoning is required.
      
      Therefore, this commit makes this change.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      d7d6a11e
    • Paul E. McKenney's avatar
      rcu: Reduce synchronize_rcu_expedited() latency · 1943c89d
      Paul E. McKenney authored
      The synchronize_rcu_expedited() function disables interrupts across a
      scan of all leaf rcu_node structures, which is not good for real-time
      scheduling latency on large systems (hundreds or especially thousands
      of CPUs).  This commit therefore holds off CPU-hotplug operations using
      get_online_cpus(), and removes the prior acquisiion of the ->onofflock
      (which required disabling interrupts).
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      1943c89d
    • Paul E. McKenney's avatar
      rcu: Eliminate signed overflow in synchronize_rcu_expedited() · bcfa57ce
      Paul E. McKenney authored
      In the C language, signed overflow is undefined.  It is true that
      twos-complement arithmetic normally comes to the rescue, but if the
      compiler can subvert this any time it has any information about the values
      being compared.  For example, given "if (a - b > 0)", if the compiler
      has enough information to realize that (for example) the value of "a"
      is positive and that of "b" is negative, the compiler is within its
      rights to optimize to a simple "if (1)", which might not be what you want.
      
      This commit therefore converts synchronize_rcu_expedited()'s work-done
      detection counter from signed to unsigned.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      bcfa57ce
    • Paul E. McKenney's avatar
      rcu: Adjust for unconditional ->completed assignment · 25d30cf4
      Paul E. McKenney authored
      Now that the rcu_node structures' ->completed fields are unconditionally
      assigned at grace-period cleanup time, they should already have the
      correct value for the new grace period at grace-period initialization
      time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
      invariant.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      25d30cf4
    • Paul E. McKenney's avatar
      rcu: Add random PROVE_RCU_DELAY to grace-period initialization · 661a85dc
      Paul E. McKenney authored
      Preemption greatly raised the probability of certain types of race
      conditions, so this commit adds an anti-heisenbug to greatly increase
      the collision cross section, also known as the probability of occurrence.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      661a85dc
    • Paul E. McKenney's avatar
      rcu: Fix day-zero grace-period initialization/cleanup race · 5d4b8659
      Paul E. McKenney authored
      The current approach to grace-period initialization is vulnerable to
      extremely low-probability races.  These races stem from the fact that
      the old grace period is marked completed on the same traversal through
      the rcu_node structure that is marking the start of the new grace period.
      This means that some rcu_node structures will believe that the old grace
      period is still in effect at the same time that other rcu_node structures
      believe that the new grace period has already started.
      
      These sorts of disagreements can result in too-short grace periods,
      as shown in the following scenario:
      
      1.	CPU 0 completes a grace period, but needs an additional
      	grace period, so starts initializing one, initializing all
      	the non-leaf rcu_node structures and the first leaf rcu_node
      	structure.  Because CPU 0 is both completing the old grace
      	period and starting a new one, it marks the completion of
      	the old grace period and the start of the new grace period
      	in a single traversal of the rcu_node structures.
      
      	Therefore, CPUs corresponding to the first rcu_node structure
      	can become aware that the prior grace period has completed, but
      	CPUs corresponding to the other rcu_node structures will see
      	this same prior grace period as still being in progress.
      
      2.	CPU 1 passes through a quiescent state, and therefore informs
      	the RCU core.  Because its leaf rcu_node structure has already
      	been initialized, this CPU's quiescent state is applied to the
      	new (and only partially initialized) grace period.
      
      3.	CPU 1 enters an RCU read-side critical section and acquires
      	a reference to data item A.  Note that this CPU believes that
      	its critical section started after the beginning of the new
      	grace period, and therefore will not block this new grace period.
      
      4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
      	mode, other CPUs informed the RCU core of its extended quiescent
      	state for the past several grace periods.  This means that CPU 16
      	is not yet aware that these past grace periods have ended.  Assume
      	that CPU 16 corresponds to the second leaf rcu_node structure --
      	which has not yet been made aware of the new grace period.
      
      5.	CPU 16 removes data item A from its enclosing data structure
      	and passes it to call_rcu(), which queues a callback in the
      	RCU_NEXT_TAIL segment of the callback queue.
      
      6.	CPU 16 enters the RCU core, possibly because it has taken a
      	scheduling-clock interrupt, or alternatively because it has
      	more than 10,000 callbacks queued.  It notes that the second
      	most recent grace period has completed (recall that because it
      	corresponds to the second as-yet-uninitialized rcu_node structure,
      	it cannot yet become aware that the most recent grace period has
      	completed), and therefore advances its callbacks.  The callback
      	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
      	of the callback queue.
      
      7.	CPU 0 completes initialization of the remaining leaf rcu_node
      	structures for the new grace period, including the structure
      	corresponding to CPU 16.
      
      8.	CPU 16 again enters the RCU core, again, possibly because it has
      	taken a scheduling-clock interrupt, or alternatively because
      	it now has more than 10,000 callbacks queued.	It notes that
      	the most recent grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.
      
      9.	All CPUs other than CPU 1 pass through quiescent states.  Because
      	CPU 1 already passed through its quiescent state, the new grace
      	period completes.  Note that CPU 1 is still in its RCU read-side
      	critical section, still referencing data item A.
      
      10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
      	state for the new grace period, and suppose further that CPU 2
      	did not have any callbacks queued, therefore not needing an
      	additional grace period.  CPU 2 therefore traverses all of the
      	rcu_node structures, marking the new grace period as completed,
      	but does not initialize a new grace period.
      
      11.	CPU 16 yet again enters the RCU core, yet again possibly because
      	it has taken a scheduling-clock interrupt, or alternatively
      	because it now has more than 10,000 callbacks queued.	It notes
      	that the new grace period has ended, and therefore advances
      	its callbacks.	The callback for data item A is therefore in
      	the RCU_DONE_TAIL segment of the callback queue.  This means
      	that this callback is now considered ready to be invoked.
      
      12.	CPU 16 invokes the callback, freeing data item A while CPU 1
      	is still referencing it.
      
      This scenario represents a day-zero bug for TREE_RCU.  This commit
      therefore ensures that the old grace period is marked completed in
      all leaf rcu_node structures before a new grace period is marked
      started in any of them.
      
      That said, it would have been insanely difficult to force this race to
      happen before the grace-period initialization process was preemptible.
      Therefore, this commit is not a candidate for -stable.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      
      Conflicts:
      
      	kernel/rcutree.c
      5d4b8659
    • Paul E. McKenney's avatar
      rcu: Make rcutree module parameters visible in sysfs · 7e5c2dfb
      Paul E. McKenney authored
      The module parameters blimit, qhimark, and qlomark (and more
      recently, rcu_fanout_leaf) have permission masks of zero, so
      that their values are not visible from sysfs.  This is unnecessary
      and inconvenient to administrators who might like an easy way to
      see what these values are on a running system.  This commit therefore
      sets their permission masks to 0444, allowing them to be read but
      not written.
      Reported-by: default avatarRusty Russell <rusty@ozlabs.org>
      Reported-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      7e5c2dfb
    • Paul E. McKenney's avatar
      rcu: Control grace-period duration from sysfs · d40011f6
      Paul E. McKenney authored
      Although almost everyone is well-served by the defaults, some uses of RCU
      benefit from shorter grace periods, while others benefit more from the
      greater efficiency provided by longer grace periods.  Situations requiring
      a large number of grace periods to elapse (and wireshark startup has
      been called out as an example of this) are helped by lower-latency
      grace periods.  Furthermore, in some embedded applications, people are
      willing to accept a small degradation in update efficiency (due to there
      being more of the shorter grace-period operations) in order to gain the
      lower latency.
      
      In contrast, those few systems with thousands of CPUs need longer grace
      periods because the CPU overhead of a grace period rises roughly
      linearly with the number of CPUs.  Such systems normally do not make
      much use of facilities that require large numbers of grace periods to
      elapse, so this is a good tradeoff.
      
      Therefore, this commit allows the durations to be controlled from sysfs.
      There are two sysfs parameters, one named "jiffies_till_first_fqs" that
      specifies the delay in jiffies from the end of grace-period initialization
      until the first attempt to force quiescent states, and the other named
      "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
      between subsequent attempts to force quiescent states.  They both default
      to three jiffies, which is compatible with the old hard-coded behavior.
      
      At some future time, it may be possible to automatically increase the
      grace-period length with the number of CPUs, but we do not yet have
      sufficient data to do a good job.  Preliminary data indicates that we
      should add an addiitonal jiffy to each of the delays for every 200 CPUs
      in the system, but more experimentation is needed.  For now, the number
      of systems with more than 1,000 CPUs is small enough that this can be
      relegated to boot-time hand tuning.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      d40011f6
    • Paul E. McKenney's avatar
      rcu: Prevent force_quiescent_state() memory contention · 394f2769
      Paul E. McKenney authored
      Large systems running RCU_FAST_NO_HZ kernels see extreme memory
      contention on the rcu_state structure's ->fqslock field.  This
      can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
      or at boot time (via the nohz kernel boot parameter), but large
      systems will no doubt become sensitive to energy consumption.
      This commit therefore uses a combining-tree approach to spread the
      memory contention across new cache lines in the leaf rcu_node structures.
      This can be thought of as a tournament lock that has only a try-lock
      acquisition primitive.
      
      The effect on small systems is minimal, because such systems have
      an rcu_node "tree" consisting of a single node.  In addition, this
      functionality is not used on fastpaths.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      394f2769
    • Paul E. McKenney's avatar
      rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing · 4605c014
      Paul E. McKenney authored
      Moving quiescent-state forcing into a kthread dispenses with the need
      for the ->n_rp_need_fqs field, so this commit removes it.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      4605c014
    • Paul E. McKenney's avatar
      rcu: Allow RCU quiescent-state forcing to be preempted · b4be093f
      Paul E. McKenney authored
      RCU quiescent-state forcing is currently carried out without preemption
      points, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore inserts
      a voluntary preemption point into force_qs_rnp(), which should greatly
      reduce the magnitude of these spikes.
      Reported-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Reported-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      b4be093f
    • Paul E. McKenney's avatar
      rcu: Move quiescent-state forcing into kthread · 4cdfc175
      Paul E. McKenney authored
      As the first step towards allowing quiescent-state forcing to be
      preemptible, this commit moves RCU quiescent-state forcing into the
      same kthread that is now used to initialize and clean up after grace
      periods.  This is yet another step towards keeping scheduling
      latency down to a dull roar.
      
      Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
      and to remove the now-unused rcu_state structure fields as suggested by
      Peter Zijlstra.
      Reported-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Reported-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4cdfc175
    • Dimitri Sivanich's avatar
      rcu: Segregate rcu_state fields to improve cache locality · b402b73b
      Dimitri Sivanich authored
      The fields in the rcu_state structure that are protected by the
      root rcu_node structure's ->lock can share a cache line with the
      fields protected by ->onofflock.  This can result in excessive
      memory contention on large systems, so this commit applies
      ____cacheline_internodealigned_in_smp to the ->onofflock field in
      order to segregate them.
      Signed-off-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      b402b73b
    • Paul E. McKenney's avatar
      rcu: Provide OOM handler to motivate lazy RCU callbacks · b626c1b6
      Paul E. McKenney authored
      In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
      large number of lazy callbacks, which as the name implies will be slow
      to be invoked.  This can be a problem on small-memory systems, where the
      default 6-second sleep for CPUs having only lazy RCU callbacks could well
      be fatal.  This commit therefore installs an OOM hander that ensures that
      every CPU with lazy callbacks has at least one non-lazy callback, in turn
      ensuring timely advancement for these callbacks.
      
      Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan.
      
      Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(),
      thus reducing the number of IPIs, as suggested by Steven Rostedt.  Also
      to make the for_each_online_cpu() loop be preemptible.  (Later, it might
      be good to use smp_call_function(), as suggested by Peter Zijlstra.)
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      b626c1b6
    • Paul E. McKenney's avatar
      rcu: Prevent offline CPUs from executing RCU core code · bfa00b4c
      Paul E. McKenney authored
      Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
      in order to note a quiescent state for the outgoing CPU.  Because the
      CPU is marked "offline" during the execution of the CPU_DYING notifiers,
      the RCU core had to tolerate being invoked from an offline CPU.  However,
      commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
      code in the CPU_DYING notifier, so the RCU core need no longer execute
      on offline CPUs.  This commit therefore enforces this restriction.
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      bfa00b4c
    • Paul E. McKenney's avatar
      rcu: Break up rcu_gp_kthread() into subfunctions · 7fdefc10
      Paul E. McKenney authored
      Then rcu_gp_kthread() function is too large and furthermore needs to
      have the force_quiescent_state() code pulled in.  This commit therefore
      breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      7fdefc10
    • Paul E. McKenney's avatar
      rcu: Allow RCU grace-period cleanup to be preempted · c856bafa
      Paul E. McKenney authored
      RCU grace-period cleanup is currently carried out with interrupts
      disabled, which can result in excessive latency spikes on large systems
      (many hundreds or thousands of CPUs).  This patch therefore makes the
      RCU grace-period cleanup be preemptible, including voluntary preemption
      points, which should eliminate those latency spikes.  Similar spikes from
      forcing of quiescent states will be dealt with similarly by later patches.
      
      Updated to replace uses of spin_lock_irqsave() with spin_lock_irq(), as
      suggested by Peter Zijlstra.
      Reported-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Reported-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      c856bafa
    • Paul E. McKenney's avatar
      rcu: Move RCU grace-period cleanup into kthread · cabc49c1
      Paul E. McKenney authored
      As a first step towards allowing grace-period cleanup to be preemptible,
      this commit moves the RCU grace-period cleanup into the same kthread
      that is now used to initialize grace periods.  This is needed to keep
      scheduling latency down to a dull roar.
      
      [ paulmck: Get rid of stray spin_lock_irqsave() calls. ]
      Reported-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Reported-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      cabc49c1
    • Paul E. McKenney's avatar
      rcu: Allow RCU grace-period initialization to be preempted · 755609a9
      Paul E. McKenney authored
      RCU grace-period initialization is currently carried out with interrupts
      disabled, which can result in 200-microsecond latency spikes on systems
      on which RCU has been configured for 4096 CPUs.  This patch therefore
      makes the RCU grace-period initialization be preemptible, which should
      eliminate those latency spikes.  Similar spikes from grace-period cleanup
      and the forcing of quiescent states will be dealt with similarly by later
      patches.
      Reported-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Reported-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      755609a9
    • Paul E. McKenney's avatar
      rcu: Prevent initialization-time quiescent-state race · 79bce672
      Paul E. McKenney authored
      The next step in reducing RCU's grace-period initialization latency on
      large systems will make this initialization preemptible.  Unfortunately,
      making the grace-period initialization subject to interrupts (let alone
      preemption) exposes the following race on systems whose rcu_node tree
      contains more than one node:
      
      1.	CPU 31 starts initializing the grace period, including the
          	first leaf rcu_node structures, and is then preempted.
      
      2.	CPU 0 refers to the first leaf rcu_node structure, and notes
          	that a new grace period has started.  It passes through a
          	quiescent state shortly thereafter, and informs the RCU core
          	of this rite of passage.
      
      3.	CPU 0 enters an RCU read-side critical section, acquiring
          	a pointer to an RCU-protected data item.
      
      4.	CPU 31 takes an interrupt whose handler removes the data item
      	referenced by CPU 0 from the data structure, and registers an
      	RCU callback in order to free it.
      
      5.	CPU 31 resumes initializing the grace period, including its
          	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
          	which advances all callbacks, including the one registered
          	in #4 above, to be handled by the current grace period.
      
      6.	The remaining CPUs pass through quiescent states and inform
          	the RCU core, but CPU 0 remains in its RCU read-side critical
          	section, still referencing the now-removed data item.
      
      7.	The grace period completes and all the callbacks are invoked,
          	including the one that frees the data item that CPU 0 is still
          	referencing.  Oops!!!
      
      One way to avoid this race is to remove grace-period acceleration from
      rcu_start_gp_per_cpu().  Now, the only reason for this acceleration was
      to allow CPUs bringing RCU out of idle state to have their callbacks
      invoked after only one grace period, rather than the two grace periods
      that would otherwise be required.  But this acceleration does not
      work when RCU grace-period initialization is moved to a kthread because
      the CPU posting the callback is no longer necessarily the CPU that is
      initializing the resulting grace period.
      
      This commit therefore removes this now-pointless (and soon to be dangerous)
      grace-period acceleration, thus avoiding the above race.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      79bce672
    • Paul E. McKenney's avatar
      rcu: Move RCU grace-period initialization into a kthread · b3dbec76
      Paul E. McKenney authored
      As the first step towards allowing grace-period initialization to be
      preemptible, this commit moves the RCU grace-period initialization
      into its own kthread.  This is needed to keep large-system scheduling
      latency at reasonable levels.
      
      Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
      by Peter Zijlstra in review comments.
      Reported-by: default avatarMike Galbraith <mgalbraith@suse.de>
      Reported-by: default avatarDimitri Sivanich <sivanich@sgi.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      b3dbec76
    • Paul E. McKenney's avatar
      rcu: Fix day-one dyntick-idle stall-warning bug · a10d206e
      Paul E. McKenney authored
      Each grace period is supposed to have at least one callback waiting
      for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
      extra callback-free grace period is no big problem -- it will chew up
      a tiny bit of CPU time, but it will complete normally.  In contrast,
      CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
      sleep indefinitely, in turn indefinitely delaying completion of the
      callback-free grace period.  Given that nothing is waiting on this grace
      period, this is also not a problem.
      
      That is, unless RCU CPU stall warnings are also enabled, as they are
      in recent kernels.  In this case, if a CPU wakes up after at least one
      minute of inactivity, an RCU CPU stall warning will result.  The reason
      that no one noticed until quite recently is that most systems have enough
      OS noise that they will never remain absolutely idle for a full minute.
      But there are some embedded systems with cut-down userspace configurations
      that consistently get into this situation.
      
      All this begs the question of exactly how a callback-free grace period
      gets started in the first place.  This can happen due to the fact that
      CPUs do not necessarily agree on which grace period is in progress.
      If a CPU still believes that the grace period that just completed is
      still ongoing, it will believe that it has callbacks that need to wait for
      another grace period, never mind the fact that the grace period that they
      were waiting for just completed.  This CPU can therefore erroneously
      decide to start a new grace period.  Note that this can happen in
      TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system:  Deadlock
      considerations mean that the CPU that detected the end of the grace
      period is not necessarily officially informed of this fact for some time.
      
      Once this CPU notices that the earlier grace period completed, it will
      invoke its callbacks.  It then won't have any callbacks left.  If no
      other CPU has any callbacks, we now have a callback-free grace period.
      
      This commit therefore makes CPUs check more carefully before starting a
      new grace period.  This new check relies on an array of tail pointers
      into each CPU's list of callbacks.  If the CPU is up to date on which
      grace periods have completed, it checks to see if any callbacks follow
      the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
      follow the RCU_WAIT_TAIL segment.  The reason that this works is that
      the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
      as soon as the CPU is officially notified that the old grace period
      has ended.
      
      This change is to cpu_needs_another_gp(), which is called in a number
      of places.  The only one that really matters is in rcu_start_gp(), where
      the root rcu_node structure's ->lock is held, which prevents any
      other CPU from starting or completing a grace period, so that the
      comparison that determines whether the CPU is missing the completion
      of a grace period is stable.
      Reported-by: default avatarBecky Bruce <bgillbruce@gmail.com>
      Reported-by: default avatarSubodh Nijsure <snijsure@grid-net.com>
      Reported-by: default avatarPaul Walmsley <paul@pwsan.com>
      Signed-off-by: default avatarPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP3730, OMAP4430
      Cc: stable@vger.kernel.org
      a10d206e
  2. 12 Sep, 2012 1 commit
  3. 08 Sep, 2012 3 commits
    • Linus Torvalds's avatar
      Linux 3.6-rc5 · 55d512e2
      Linus Torvalds authored
      55d512e2
    • Linus Torvalds's avatar
      Merge branch 'fixes-for-3.6' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping · 32d687ca
      Linus Torvalds authored
      Pull DMA-mapping fixes from Marek Szyprowski:
       "Another set of fixes for ARM dma-mapping subsystem.
      
        Commit e9da6e99 replaced custom consistent buffer remapping code
        with generic vmalloc areas.  It however introduced some regressions
        caused by limited support for allocations in atomic context.  This
        series contains fixes for those regressions.
      
        For some subplatforms the default, pre-allocated pool for atomic
        allocations turned out to be too small, so a function for setting its
        size has been added.
      
        Another set of patches adds support for atomic allocations to
        IOMMU-aware DMA-mapping implementation.
      
        The last part of this pull request contains two fixes for Contiguous
        Memory Allocator, which relax too strict requirements."
      
      * 'fixes-for-3.6' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
        ARM: dma-mapping: IOMMU allocates pages from atomic_pool with GFP_ATOMIC
        ARM: dma-mapping: Introduce __atomic_get_pages() for __iommu_get_pages()
        ARM: dma-mapping: Refactor out to introduce __in_atomic_pool
        ARM: dma-mapping: atomic_pool with struct page **pages
        ARM: Kirkwood: increase atomic coherent pool size
        ARM: DMA-Mapping: print warning when atomic coherent allocation fails
        ARM: DMA-Mapping: add function for setting coherent pool size from platform code
        ARM: relax conditions required for enabling Contiguous Memory Allocator
        mm: cma: fix alignment requirements for contiguous regions
      32d687ca
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 11be4bc6
      Linus Torvalds authored
      Pull input subsystem updates from Dmitry Torokhov.
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: wacom - add support for EMR on Cintiq 24HD touch
        Input: i8042 - add Gigabyte T1005 series netbooks to noloop table
        Input: imx_keypad - reset the hardware before enabling
        Input: edt-ft5x06 - fix build error when compiling wthout CONFIG_DEBUG_FS
      11be4bc6
  4. 07 Sep, 2012 4 commits
  5. 06 Sep, 2012 8 commits
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · eeea3ac9
      Linus Torvalds authored
      Pull ARM SoC bug fixes from Olof Johansson:
       "Mostly Renesas and Atmel bugfixes this time, targeting boot and build
        problems.  A couple of patches for gemini and kirkwood as well.  On a
        whole nothing very controversial."
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: gemini: fix the gemini build
        ARM: shmobile: armadillo800eva: enable rw rootfs mount
        ARM: Kirkwood: Fix 'SZ_1M' undeclared here for db88f6281-bp-setup.c
        ARM: shmobile: mackerel: fixup usb module order
        ARM: shmobile: armadillo800eva: fixup: sound card detection order
        ARM: shmobile: marzen: fixup smsc911x id for regulator
        ARM: at91/feature-removal-schedule: delay at91_mci removal
        ARM: mach-shmobile: armadillo800eva: Enable power button as wakeup source
        ARM: mach-shmobile: armadillo800eva: Fix GPIO buttons descriptions
        ARM: at91/dts: remove partial parameter in at91sam9g25ek.dts
        ARM: at91/clock: fix PLLA overclock warning
        ARM: at91: fix rtc-at91sam9 irq issue due to sparse irq support
        ARM: at91: fix system timer irq issue due to sparse irq support
        ARM: shmobile: sh73a0: fixup RELOC_BASE of intca_irq_pins_desc
      eeea3ac9
    • Linus Torvalds's avatar
      Merge tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging · c7c6bf1e
      Linus Torvalds authored
      Pull a hwmon fix from Guenter Roeck:
       "One patch, fixing DIV_ROUND_CLOSEST to support negative dividends.
      
        While the changes are not in the drivers/hwmon directory, the problem
        primarily affects hwmon drivers, and it makes sense to push the patch
        through the hwmon tree."
      
      * tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        linux/kernel.h: Fix DIV_ROUND_CLOSEST to support negative dividends
      c7c6bf1e
    • Linus Torvalds's avatar
      Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · bd12ce8c
      Linus Torvalds authored
      Pull kbuild fixes from Michal Marek:
       "These are two fixes that should go into 3.6.  The link-vmlinux.sh one
        is obvious.
      
        The other one fixes make firmware_install with certain configurations,
        where a file in the toplevel firmware tree gets installed first, and
        $(INSTALL_FW_PATH)/$$(dir <file>) results in /lib/firmware/./, which
        confuses make 3.82 for some reason."
      
      * 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
        firmware: fix directory creation rule matching with make 3.82
        link-vmlinux.sh: Fix stray "echo" in error message
      bd12ce8c
    • Dave Jones's avatar
      Remove user-triggerable BUG from mpol_to_str · 80de7c31
      Dave Jones authored
      Trivially triggerable, found by trinity:
      
        kernel BUG at mm/mempolicy.c:2546!
        Process trinity-child2 (pid: 23988, threadinfo ffff88010197e000, task ffff88007821a670)
        Call Trace:
          show_numa_map+0xd5/0x450
          show_pid_numa_map+0x13/0x20
          traverse+0xf2/0x230
          seq_read+0x34b/0x3e0
          vfs_read+0xac/0x180
          sys_pread64+0xa2/0xc0
          system_call_fastpath+0x1a/0x1f
        RIP: mpol_to_str+0x156/0x360
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80de7c31
    • Konrad Rzeszutek Wilk's avatar
      xen/pciback: Fix proper FLR steps. · 80ba77df
      Konrad Rzeszutek Wilk authored
      When we do FLR and save PCI config we did it in the wrong order.
      The end result was that if a PCI device was unbind from
      its driver, then binded to xen-pciback, and then back to its
      driver we would get:
      
      > lspci -s 04:00.0
      04:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
      13:42:12 # 4 :~/
      > echo "0000:04:00.0" > /sys/bus/pci/drivers/pciback/unbind
      > modprobe e1000e
      e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k
      e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
      e1000e 0000:04:00.0: Disabling ASPM L0s L1
      e1000e 0000:04:00.0: enabling device (0000 -> 0002)
      xen: registering gsi 48 triggering 0 polarity 1
      Already setup the GSI :48
      e1000e 0000:04:00.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
      e1000e: probe of 0000:04:00.0 failed with error -2
      
      This fixes it by first saving the PCI configuration space, then
      doing the FLR.
      Reported-by: default avatarRen, Yongjie <yongjie.ren@intel.com>
      Reported-and-Tested-by: default avatarTobias Geiger <tobias.geiger@vido.info>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      CC: stable@vger.kernel.org
      80ba77df
    • Linus Torvalds's avatar
      Merge tag 'mmc-fixes-for-3.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc · 08090950
      Linus Torvalds authored
      Pull MMC fixes from Chris Ball:
       - a firmware bug on several Samsung MoviNAND eMMC models causes
         permanent corruption on the device when secure erase and secure trim
         requests are made, so we disable those requests on these eMMC devices.
       - atmel-mci: fix a hang with some SD cards by waiting for not-busy flag.
       - dw_mmc: low-power mode breaks SDIO interrupts; fix PIO error handling;
         fix handling of error interrupts.
       - mxs-mmc: fix deadlocks; fix compile error due to dma.h arch change.
       - omap: fix broken PIO mode causing memory corruption.
       - sdhci-esdhc: fix card detection.
      
      * tag 'mmc-fixes-for-3.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc:
        mmc: omap: fix broken PIO mode
        mmc: card: Skip secure erase on MoviNAND; causes unrecoverable corruption.
        mmc: dw_mmc: Disable low power mode if SDIO interrupts are used
        mmc: dw_mmc: fix error handling in PIO mode
        mmc: dw_mmc: correct mishandling error interrupt
        mmc: dw_mmc: amend using error interrupt status
        mmc: atmel-mci: not busy flag has also to be used for read operations
        mmc: sdhci-esdhc: break out early if clock is 0
        mmc: mxs-mmc: fix deadlock caused by recursion loop
        mmc: mxs-mmc: fix deadlock in SDIO IRQ case
        mmc: bfin_sdh: fix dma_desc_array build error
      08090950
    • Miklos Szeredi's avatar
      uml: fix compile error in deliver_alarm() · bc6c8364
      Miklos Szeredi authored
      Fix the following compile error on UML.
      
        arch/um/os-Linux/time.c: In function 'deliver_alarm':
        arch/um/os-Linux/time.c:117:3: error: too few arguments to function 'alarm_handler'
        arch/um/os-Linux/internal.h:1:6: note: declared here
      
      The error was introduced by commit d3c1cfcd ("um: pass siginfo to guest
      process") in 3.6-rc1.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      CC: Martin Pärtel <martin.partel@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc6c8364
    • Alan Cox's avatar
      dj: memory scribble in logi_dj · 8a55ade7
      Alan Cox authored
      Allocate a structure not a pointer to it !
      Signed-off-by: default avatarAlan Cox <alan@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a55ade7