1. 18 Apr, 2017 27 commits
    • Paul E. McKenney's avatar
      srcu: Move rcu_node traversal macros to rcu.h · efbe451d
      Paul E. McKenney authored
      This commit moves rcu_for_each_node_breadth_first(),
      rcu_for_each_nonleaf_node_breadth_first(), and
      rcu_for_each_leaf_node() from kernel/rcu/tree.h to
      kernel/rcu/rcu.h so that SRCU can access them.
      This commit is code-movement only.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      efbe451d
    • Paul E. McKenney's avatar
      rcu: Remove redundant levelcnt[] array from rcu_init_one() · 41f5c631
      Paul E. McKenney authored
      The levelcnt[] array is identical to num_rcu_lvl[], so this commit
      removes levelcnt[].
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      41f5c631
    • Paul E. McKenney's avatar
      srcu: Move rcu_init_levelspread() to rcu_tree_node.h · 2b34c43c
      Paul E. McKenney authored
      This commit moves the rcu_init_levelspread() function from
      kernel/rcu/tree.c to kernel/rcu/rcu.h so that SRCU can access it.  This is
      another step towards enabling SRCU to create its own combining tree.
      This commit is code-movement only, give or take knock-on adjustments.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      2b34c43c
    • Paul E. McKenney's avatar
      srcu: Move combining-tree definitions for SRCU's benefit · f2425b4e
      Paul E. McKenney authored
      This commit moves the C preprocessor code that defines the default shape
      of the rcu_node combining tree to a new include/linux/rcu_node_tree.h
      file as a first step towards enabling SRCU to create its own combining
      tree, which in turn enables SRCU to implement per-CPU callback handling,
      thus avoiding contention on the lock currently guarding the single list
      of callbacks.  Note that users of SRCU still need to know the size of
      the srcu_struct structure, hence include/linux rather than kernel/rcu.
      
      This commit is code-movement only.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f2425b4e
    • Paul E. McKenney's avatar
      srcu: Use rcu_segcblist to track SRCU callbacks · 8660b7d8
      Paul E. McKenney authored
      This commit switches SRCU from custom-built callback queues to the new
      rcu_segcblist structure.  This change associates grace-period sequence
      numbers with groups of callbacks, which will be needed for efficient
      processing of per-CPU callbacks.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8660b7d8
    • Paul E. McKenney's avatar
      srcu: Add grace-period sequence numbers · ac367c1c
      Paul E. McKenney authored
      This commit adds grace-period sequence numbers, which will be used to
      handle mid-boot grace periods and per-CPU callback lists.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      ac367c1c
    • Paul E. McKenney's avatar
      srcu: Move to state-based grace-period sequencing · c2a8ec07
      Paul E. McKenney authored
      The current SRCU grace-period processing might never reach the last
      portion of srcu_advance_batches().  This is OK given the current
      implementation, as the first portion, up to the try_check_zero()
      following the srcu_flip() is sufficient to drive grace periods forward.
      However, it has the unfortunate side-effect of making it impossible to
      determine when a given grace period has ended, and it will be necessary
      to efficiently trace ends of grace periods in order to efficiently handle
      per-CPU SRCU callback lists.
      
      This commit therefore adds states to the SRCU grace-period processing,
      so that the end of a given SRCU grace period is marked by the transition
      to the SRCU_STATE_DONE state.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c2a8ec07
    • Paul E. McKenney's avatar
      srcu: Push srcu_advance_batches() fastpath into common case · c6e56f59
      Paul E. McKenney authored
      This commit simplifies the SRCU state machine by pushing the
      srcu_advance_batches() idle-SRCU fastpath into the common case.  This is
      done by giving srcu_reschedule() a delay parameter, which is zero in
      the call from srcu_advance_batches().
      
      This commit is a step towards numbering callbacks in order to
      efficiently handle per-CPU callback lists.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c6e56f59
    • Dmitry Vyukov's avatar
      rcu: Fix warning in rcu_seq_end() · f010ed82
      Dmitry Vyukov authored
      The rcu_seq_end() function increments seq signifying completion
      of a grace period, after that checks that the seq is even and wakes
      _synchronize_rcu_expedited().  The _synchronize_rcu_expedited() function
      uses wait_event() to wait for even seq.  The problem is that wait_event()
      can return as soon as seq becomes even without waiting for the wakeup.
      In such case the warning in rcu_seq_end() can falsely fire if the next
      expedited grace period starts before the check.
      
      Check that seq has good value before incrementing it.
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: syzkaller@googlegroups.com
      Cc: linux-kernel@vger.kernel.org
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: josh@joshtriplett.org
      Cc: jiangshanlai@gmail.com
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      
      ---
      
      syzkaller-triggered warning:
      
      WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533
      rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
      CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Workqueue: events wait_rcu_exp_gp
      Call Trace:
       __dump_stack lib/dump_stack.c:15 [inline]
       dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
       panic+0x1fb/0x412 kernel/panic.c:179
       __warn+0x1c4/0x1e0 kernel/panic.c:540
       warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
       rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533
       rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline]
       rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517
       rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline]
       wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570
       process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096
       worker_thread+0x223/0x19c0 kernel/workqueue.c:2230
       kthread+0x326/0x3f0 kernel/kthread.c:227
       ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
      ---
      f010ed82
    • Paul E. McKenney's avatar
      rcu: Expedited wakeups need to be fully ordered · 3c345825
      Paul E. McKenney authored
      Expedited grace periods use workqueue handlers that wake up the requesters,
      but there is no lock mediating this wakeup.  Therefore, memory barriers
      are required to ensure that the handler's memory references are seen by
      all to occur before synchronize_*_expedited() returns to its caller.
      Possibly detected by syzkaller.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3c345825
    • Paul E. McKenney's avatar
      srcu: Move rcu_seq_start() and friends to rcu.h · 2e8c28c2
      Paul E. McKenney authored
      This commit moves rcu_seq_start(), rcu_seq_end(), rcu_seq_snap(),
      and rcu_seq_done() from kernel/rcu/tree.c to kernel/rcu/rcu.h.
      This will allow SRCU to use these functions, which in turn will
      allow SRCU to move from a single global callback queue to a
      per-CPU callback queue.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      2e8c28c2
    • Paul E. McKenney's avatar
      rcu: Add single-element dequeue functions to rcu_segcblist · bdcabf4c
      Paul E. McKenney authored
      This commit adds single-element dequeue functions to rcu_segcblist.
      These are less efficient than using the extract and insert functions,
      but allow more precise debugging code.  These functions are thus
      expected to be used only in debug builds, for example, CONFIG_PROVE_RCU.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bdcabf4c
    • Paul E. McKenney's avatar
      srcu: Allow early boot use of synchronize_srcu() · b5eaeaa5
      Paul E. McKenney authored
      This commit checks for pre-scheduler state, and if that early in the
      boot process, synchronize_srcu() and friends are no-ops.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b5eaeaa5
    • Paul E. McKenney's avatar
      srcu: Allow SRCU to access rcu_scheduler_active · 900b1028
      Paul E. McKenney authored
      This is primarily a code-movement commit in preparation for allowing
      SRCU to handle early-boot SRCU grace periods.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      900b1028
    • Paul E. McKenney's avatar
      srcu: Abstract multi-tail callback list handling · 15fecf89
      Paul E. McKenney authored
      RCU has only one multi-tail callback list, which is implemented via
      the nxtlist, nxttail, nxtcompleted, qlen_lazy, and qlen fields in the
      rcu_data structure, and whose operations are open-code throughout the
      Tree RCU implementation.  This has been more or less OK in the past,
      but upcoming callback-list optimizations in SRCU could really use
      a multi-tail callback list there as well.
      
      This commit therefore abstracts the multi-tail callback list handling
      into a new kernel/rcu/rcu_segcblist.h file, and uses this new API.
      The simple head-and-tail pointer callback list is also abstracted and
      applied everywhere except for the NOCB callback-offload lists.  (Yes,
      the plan is to apply them there as well, but this commit is already
      bigger than would be good.)
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      15fecf89
    • Paul E. McKenney's avatar
      rcu: Default RCU_FANOUT_LEAF to 16 unless explicitly changed · b8c78d3a
      Paul E. McKenney authored
      If the RCU_EXPERT Kconfig option is not set (the default), then the
      RCU_FANOUT_LEAF Kconfig option will not be defined, which will cause
      the leaf-level rcu_node tree fanout to default to 32 on 32-bit systems
      and 64 on 64-bit systems.  This can result in excessive lock contention.
      This commit therefore changes the computation of the leaf-level rcu_node
      tree fanout so that the result will be 16 unless an explicit Kconfig or
      kernel-boot setting says otherwise.
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b8c78d3a
    • Paul E. McKenney's avatar
      rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() actions · 9226b10d
      Paul E. McKenney authored
      The rcu_all_qs() and rcu_note_context_switch() do a series of checks,
      taking various actions to supply RCU with quiescent states, depending
      on the outcomes of the various checks.  This is a bit much for scheduling
      fastpaths, so this commit creates a separate ->rcu_urgent_qs field in
      the rcu_dynticks structure that acts as a global guard for these checks.
      Thus, in the common case, rcu_all_qs() and rcu_note_context_switch()
      check the ->rcu_urgent_qs field, find it false, and simply return.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      9226b10d
    • Paul E. McKenney's avatar
      rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle() · 0f9be8ca
      Paul E. McKenney authored
      The rcu_momentary_dyntick_idle() function scans the RCU flavors, checking
      that one of them still needs a quiescent state before doing an expensive
      atomic operation on the ->dynticks counter.  However, this check reduces
      overhead only after a rare race condition, and increases complexity.  This
      commit therefore removes the scan and the mechanism enabling the scan.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0f9be8ca
    • Paul E. McKenney's avatar
      rcu: Pull rcu_qs_ctr into rcu_dynticks structure · 9577df9a
      Paul E. McKenney authored
      The rcu_qs_ctr variable is yet another isolated per-CPU variable,
      so this commit pulls it into the pre-existing rcu_dynticks per-CPU
      structure.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      9577df9a
    • Paul E. McKenney's avatar
      rcu: Pull rcu_sched_qs_mask into rcu_dynticks structure · abb06b99
      Paul E. McKenney authored
      The rcu_sched_qs_mask variable is yet another isolated per-CPU variable,
      so this commit pulls it into the pre-existing rcu_dynticks per-CPU
      structure.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      abb06b99
    • Paul E. McKenney's avatar
      rcu: Semicolon inside RCU_TRACE() for tree.c · 88a4976d
      Paul E. McKenney authored
      The current use of "RCU_TRACE(statement);" can cause odd bugs, especially
      where "statement" is a local-variable declaration, as it can leave a
      misplaced ";" in the source code.  This commit therefore converts these
      to "RCU_TRACE(statement;)", which avoids the misplaced ";".
      Reported-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      88a4976d
    • Paul E. McKenney's avatar
      rcu: Semicolon inside RCU_TRACE() for Tiny RCU · 6c8c1485
      Paul E. McKenney authored
      The current use of "RCU_TRACE(statement);" can cause odd bugs, especially
      where "statement" is a local-variable declaration, as it can leave a
      misplaced ";" in the source code.  This commit therefore converts these
      to "RCU_TRACE(statement;)", which avoids the misplaced ";".
      Reported-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6c8c1485
    • Paul E. McKenney's avatar
      rcu: Semicolon inside RCU_TRACE() for rcu.h · dffd06a7
      Paul E. McKenney authored
      The current use of "RCU_TRACE(statement);" can cause odd bugs, especially
      where "statement" is a local-variable declaration, as it can leave a
      misplaced ";" in the source code.  This commit therefore converts these
      to "RCU_TRACE(statement;)", which avoids the misplaced ";".
      Reported-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      dffd06a7
    • Paul E. McKenney's avatar
      srcu: Check for tardy grace-period activity in cleanup_srcu_struct() · 15c68f7f
      Paul E. McKenney authored
      Users of SRCU are obliged to complete all grace-period activity before
      invoking cleanup_srcu_struct().  This means that all calls to either
      synchronize_srcu() or synchronize_srcu_expedited() must have returned,
      and all calls to call_srcu() must have returned, and the last call to
      call_srcu() must have been followed by a call to srcu_barrier().
      Furthermore, the caller must have done something to prevent any
      further calls to synchronize_srcu(), synchronize_srcu_expedited(),
      and call_srcu().
      
      Therefore, if there has ever been an invocation of call_srcu() on
      the srcu_struct in question, the sequence of events must be as
      follows:
      
      1.  Prevent any further calls to call_srcu().
      2.  Wait for any pre-existing call_srcu() invocations to return.
      3.  Invoke srcu_barrier().
      4.  It is now safe to invoke cleanup_srcu_struct().
      
      On the other hand, if there has ever been a call to synchronize_srcu()
      or synchronize_srcu_expedited(), the sequence of events must be as
      follows:
      
      1.  Prevent any further calls to synchronize_srcu() or
          synchronize_srcu_expedited().
      2.  Wait for any pre-existing synchronize_srcu() or
          synchronize_srcu_expedited() invocations to return.
      3.  It is now safe to invoke cleanup_srcu_struct().
      
      If there have been calls to all both types of functions (call_srcu()
      and either of synchronize_srcu() and synchronize_srcu_expedited()), then
      the caller must do the first three steps of the call_srcu() procedure
      above and the first two steps of the synchronize_s*() procedure above,
      and only then invoke cleanup_srcu_struct().
      
      Note that cleanup_srcu_struct() does some probabilistic checks
      for the caller failing to follow these procedures, in which case
      cleanup_srcu_struct() does WARN_ON() and avoids freeing the per-CPU
      structures associated with the specified srcu_struct structure.
      Reported-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      15c68f7f
    • Paul E. McKenney's avatar
      srcu: Consolidate batch checking into rcu_all_batches_empty() · cc985822
      Paul E. McKenney authored
      The srcu_reschedule() function invokes rcu_batch_empty() on each of
      the four rcu_batch structures in the srcu_struct in question twice.
      Given that this check will also be needed in cleanup_srcu_struct(), this
      commit consolidates these four checks into a new rcu_all_batches_empty()
      function.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      cc985822
    • Paul E. McKenney's avatar
      rcu: Make arch select smp_mb__after_unlock_lock() strength · 77e58496
      Paul E. McKenney authored
      The definition of smp_mb__after_unlock_lock() is currently smp_mb()
      for CONFIG_PPC and a no-op otherwise.  It would be better to instead
      provide an architecture-selectable Kconfig option, and select the
      strength of smp_mb__after_unlock_lock() based on that option.  This
      commit therefore creates ARCH_WEAK_RELEASE_ACQUIRE, has PPC select it,
      and bases the definition of smp_mb__after_unlock_lock() on this new
      ARCH_WEAK_RELEASE_ACQUIRE Kconfig option.
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Boqun Feng <boqun.feng@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: <linuxppc-dev@lists.ozlabs.org>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      77e58496
    • Paul E. McKenney's avatar
      rcu: Maintain special bits at bottom of ->dynticks counter · b8c17e66
      Paul E. McKenney authored
      Currently, IPIs are used to force other CPUs to invalidate their TLBs
      in response to a kernel virtual-memory mapping change.  This works, but
      degrades both battery lifetime (for idle CPUs) and real-time response
      (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
      the fact that CPUs executing in usermode are unaffected by stale kernel
      mappings.  It would be better to cause a CPU executing in usermode to
      wait until it is entering kernel mode to do the flush, first to avoid
      interrupting usemode tasks and second to handle multiple flush requests
      with a single flush in the case of a long-running user task.
      
      This commit therefore reserves a bit at the bottom of the ->dynticks
      counter, which is checked upon exit from extended quiescent states.
      If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
      invoked, which, if not supplied, is an empty single-pass do-while loop.
      If this bottom bit is set on -entry- to an extended quiescent state,
      then a WARN_ON_ONCE() triggers.
      
      This bottom bit may be set using a new rcu_eqs_special_set() function,
      which returns true if the bit was set, or false if the CPU turned
      out to not be in an extended quiescent state.  Please note that this
      function refuses to set the bit for a non-nohz_full CPU when that CPU
      is executing in usermode because usermode execution is tracked by RCU
      as a dyntick-idle extended quiescent state only for nohz_full CPUs.
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      b8c17e66
  2. 12 Mar, 2017 5 commits
    • Linus Torvalds's avatar
      Linux 4.11-rc2 · 4495c08e
      Linus Torvalds authored
      4495c08e
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 56b24d1b
      Linus Torvalds authored
      Pull s390 fixes from Martin Schwidefsky:
      
       - four patches to get the new cputime code in shape for s390
      
       - add the new statx system call
      
       - a few bug fixes
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390: wire up statx system call
        KVM: s390: Fix guest migration for huge guests resulting in panic
        s390/ipl: always use load normal for CCW-type re-IPL
        s390/timex: micro optimization for tod_to_ns
        s390/cputime: provide archicture specific cputime_to_nsecs
        s390/cputime: reset all accounting fields on fork
        s390/cputime: remove last traces of cputime_t
        s390: fix in-kernel program checks
        s390/crypt: fix missing unlock in ctr_paes_crypt on error path
      56b24d1b
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5a45a5a8
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
      
       - a fix for the kexec/purgatory regression which was introduced in the
         merge window via an innocent sparse fix. We could have reverted that
         commit, but on deeper inspection it turned out that the whole
         machinery is neither documented nor robust. So a proper cleanup was
         done instead
      
       - the fix for the TLB flush issue which was discovered recently
      
       - a simple typo fix for a reboot quirk
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/tlb: Fix tlb flushing when lguest clears PGE
        kexec, x86/purgatory: Unbreak it and clean it up
        x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk
      5a45a5a8
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ecade114
      Linus Torvalds authored
      Pull irq fixes from Thomas Gleixner:
      
       - a workaround for a GIC erratum
      
       - a missing stub function for CONFIG_IRQDOMAIN=n
      
       - fixes for a couple of type inconsistencies
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/crossbar: Fix incorrect type of register size
        irqchip/gicv3-its: Add workaround for QDF2400 ITS erratum 0065
        irqdomain: Add empty irq_domain_check_msi_remap
        irqchip/crossbar: Fix incorrect type of local variables
      ecade114
    • Daniel Borkmann's avatar
      x86/tlb: Fix tlb flushing when lguest clears PGE · 2c4ea6e2
      Daniel Borkmann authored
      Fengguang reported random corruptions from various locations on x86-32
      after commits d2852a22 ("arch: add ARCH_HAS_SET_MEMORY config") and
      9d876e79 ("bpf: fix unlocking of jited image when module ronx not set")
      that uses the former. While x86-32 doesn't have a JIT like x86_64, the
      bpf_prog_lock_ro() and bpf_prog_unlock_ro() got enabled due to
      ARCH_HAS_SET_MEMORY, whereas Fengguang's test kernel doesn't have module
      support built in and therefore never had the DEBUG_SET_MODULE_RONX setting
      enabled.
      
      After investigating the crashes further, it turned out that using
      set_memory_ro() and set_memory_rw() didn't have the desired effect, for
      example, setting the pages as read-only on x86-32 would still let
      probe_kernel_write() succeed without error. This behavior would manifest
      itself in situations where the vmalloc'ed buffer was accessed prior to
      set_memory_*() such as in case of bpf_prog_alloc(). In cases where it
      wasn't, the page attribute changes seemed to have taken effect, leading to
      the conclusion that a TLB invalidate didn't happen. Moreover, it turned out
      that this issue reproduced with qemu in "-cpu kvm64" mode, but not for
      "-cpu host". When the issue occurs, change_page_attr_set_clr() did trigger
      a TLB flush as expected via __flush_tlb_all() through cpa_flush_range(),
      though.
      
      There are 3 variants for issuing a TLB flush: invpcid_flush_all() (depends
      on CPU feature bits X86_FEATURE_INVPCID, X86_FEATURE_PGE), cr4 based flush
      (depends on X86_FEATURE_PGE), and cr3 based flush.  For "-cpu host" case in
      my setup, the flush used invpcid_flush_all() variant, whereas for "-cpu
      kvm64", the flush was cr4 based. Switching the kvm64 case to cr3 manually
      worked fine, and further investigating the cr4 one turned out that
      X86_CR4_PGE bit was not set in cr4 register, meaning the
      __native_flush_tlb_global_irq_disabled() wrote cr4 twice with the same
      value instead of clearing X86_CR4_PGE in the first write to trigger the
      flush.
      
      It turned out that X86_CR4_PGE was cleared from cr4 during init from
      lguest_arch_host_init() via adjust_pge(). The X86_FEATURE_PGE bit is also
      cleared from there due to concerns of using PGE in guest kernel that can
      lead to hard to trace bugs (see bff672e6 ("lguest: documentation V:
      Host") in init()). The CPU feature bits are cleared in dynamic
      boot_cpu_data, but they never propagated to __flush_tlb_all() as it uses
      static_cpu_has() instead of boot_cpu_has() for testing which variant of TLB
      flushing to use, meaning they still used the old setting of the host
      kernel.
      
      Clearing via setup_clear_cpu_cap(X86_FEATURE_PGE) so this would propagate
      to static_cpu_has() checks is too late at this point as sections have been
      patched already, so for now, it seems reasonable to switch back to
      boot_cpu_has(X86_FEATURE_PGE) as it was prior to commit c109bf95
      ("x86/cpufeature: Remove cpu_has_pge"). This lets the TLB flush trigger via
      cr3 as originally intended, properly makes the new page attributes visible
      and thus fixes the crashes seen by Fengguang.
      
      Fixes: c109bf95 ("x86/cpufeature: Remove cpu_has_pge")
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: bp@suse.de
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: lkp@01.org
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernrl.org/r/20170301125426.l4nf65rx4wahohyl@wfg-t540p.sh.intel.com
      Link: http://lkml.kernel.org/r/25c41ad9eca164be4db9ad84f768965b7eb19d9e.1489191673.git.daniel@iogearbox.netSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      2c4ea6e2
  3. 11 Mar, 2017 8 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 106e4da6
      Linus Torvalds authored
      Pull KVM fixes from Radim Krčmář:
       "ARM updates from Marc Zyngier:
         - vgic updates:
           - Honour disabling the ITS
           - Don't deadlock when deactivating own interrupts via MMIO
           - Correctly expose the lact of IRQ/FIQ bypass on GICv3
      
         - I/O virtualization:
           - Make KVM_CAP_NR_MEMSLOTS big enough for large guests with many
             PCIe devices
      
         - General bug fixes:
           - Gracefully handle exception generated with syndroms that the host
             doesn't understand
           - Properly invalidate TLBs on VHE systems
      
        x86:
         - improvements in emulation of VMCLEAR, VMX MSR bitmaps, and VCPU
           reset
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: nVMX: do not warn when MSR bitmap address is not backed
        KVM: arm64: Increase number of user memslots to 512
        KVM: arm/arm64: Remove KVM_PRIVATE_MEM_SLOTS definition that are unused
        KVM: arm/arm64: Enable KVM_CAP_NR_MEMSLOTS on arm/arm64
        KVM: Add documentation for KVM_CAP_NR_MEMSLOTS
        KVM: arm/arm64: VGIC: Fix command handling while ITS being disabled
        arm64: KVM: Survive unknown traps from guests
        arm: KVM: Survive unknown traps from guests
        KVM: arm/arm64: Let vcpu thread modify its own active state
        KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset
        kvm: nVMX: VMCLEAR should not cause the vCPU to shut down
        KVM: arm/arm64: vgic-v3: Don't pretend to support IRQ/FIQ bypass
        arm64: KVM: VHE: Clear HCR_TGE when invalidating guest TLBs
      106e4da6
    • Linus Torvalds's avatar
      Merge tag 'extable-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux · 4b050f22
      Linus Torvalds authored
      Pull extable.h fix from Paul Gortmaker:
       "Fixup for arch/score after extable.h introduction.
      
        It seems that Guenter is the only one on the planet doing builds for
        arch/score -- we don't have compile coverage for it in linux-next or
        in the kbuild-bot either. Guenter couldn't even recall where he got
        his toolchain, but was kind enough to share it with me so I could
        validate this change and also add arch/score to my build coverage.
      
        I sat on this a bit in case there was any other fallout in other arch
        dirs, but since this still seems to be the only one, I might as well
        send it on its way"
      
      * tag 'extable-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
        score: Fix implicit includes now failing build after extable change
      4b050f22
    • Linus Torvalds's avatar
      Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random · 84c37c16
      Linus Torvalds authored
      Pull random updates from Ted Ts'o:
       "Change get_random_{int,log} to use the CRNG used by /dev/urandom and
        getrandom(2). It's faster and arguably more secure than cut-down MD5
        that we had been using.
      
        Also do some code cleanup"
      
      * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
        random: move random_min_urandom_seed into CONFIG_SYSCTL ifdef block
        random: convert get_random_int/long into get_random_u32/u64
        random: use chacha20 for get_random_int/long
        random: fix comment for unused random_min_urandom_seed
        random: remove variable limit
        random: remove stale urandom_init_wait
        random: remove stale maybe_reseed_primary_crng
      84c37c16
    • Guenter Roeck's avatar
      score: Fix implicit includes now failing build after extable change · 0acf6119
      Guenter Roeck authored
      After changing from module.h to extable.h, score builds fail with:
      
        arch/score/kernel/traps.c: In function 'do_ri':
        arch/score/kernel/traps.c:248:4: error: implicit declaration of function 'user_disable_single_step'
        arch/score/mm/extable.c: In function 'fixup_exception':
        arch/score/mm/extable.c:32:38: error: dereferencing pointer to incomplete type
        arch/score/mm/extable.c:34:24: error: dereferencing pointer to incomplete type
      
      because extable.h doesn't drag in the same amount of headers as the
      module.h did.  Add in the headers which were implicitly expected.
      
      Fixes: 90858794 ("module.h: remove extable.h include now users have migrated")
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      [PG: tweak commit log; refresh for sched header refactoring.]
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      0acf6119
    • Linus Torvalds's avatar
      Merge tag 'tty-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · 434fd635
      Linus Torvalds authored
      Pull tty/serial fixes frpm Greg KH:
       "Here are two bugfixes for tty stuff for 4.11-rc2.
      
        One of them resolves the pretty bad bug in the n_hdlc code that
        Alexander Popov found and fixed and has been reported everywhere. The
        other just fixes a samsung serial driver issue when DMA fails on some
        systems.
      
        Both have been in linux-next with no reported issues"
      
      * tag 'tty-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        serial: samsung: Continue to work if DMA request fails
        tty: n_hdlc: get rid of racy n_hdlc.tbuf
      434fd635
    • Linus Torvalds's avatar
      Merge tag 'staging-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 85298808
      Linus Torvalds authored
      Pull staging driver fixes from Greg KH:
       "Here are two small build warning fixes for some staging drivers that
        Arnd has found on his valiant quest to get the kernel to build
        properly with no warnings.
      
        Both of these have been in linux-next this week and resolve the
        reported issues"
      
      * tag 'staging-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: octeon: remove unused variable
        staging/vc04_services: add CONFIG_OF dependency
      85298808
    • Linus Torvalds's avatar
      Merge tag 'usb-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 46552bf4
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here is a number of different USB fixes for 4.11-rc2.
      
        Seems like there were a lot of unresolved issues that people have been
        finding for this subsystem, and a bunch of good security auditing
        happening as well from Johan Hovold. There's the usual batch of gadget
        driver fixes and xhci issues resolved as well.
      
       All of these have been in linux-next with no reported issues"
      
      * tag 'usb-4.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (35 commits)
        usb: host: xhci-plat: Fix timeout on removal of hot pluggable xhci controllers
        usb: host: xhci-dbg: HCIVERSION should be a binary number
        usb: xhci: remove dummy extra_priv_size for size of xhci_hcd struct
        usb: xhci-mtk: check hcc_params after adding primary hcd
        USB: serial: digi_acceleport: fix OOB-event processing
        MAINTAINERS: usb251xb: remove reference inexistent file
        doc: dt-bindings: usb251xb: mark reg as required
        usb: usb251xb: dt: add unit suffix to oc-delay and power-on-time
        usb: usb251xb: remove max_{power,current}_{sp,bp} properties
        usb-storage: Add ignore-residue quirk for Initio INIC-3619
        USB: iowarrior: fix NULL-deref in write
        USB: iowarrior: fix NULL-deref at probe
        usb: phy: isp1301: Add OF device ID table
        usb: ohci-at91: Do not drop unhandled USB suspend control requests
        USB: serial: safe_serial: fix information leak in completion handler
        USB: serial: io_ti: fix information leak in completion handler
        USB: serial: omninet: drop open callback
        USB: serial: omninet: fix reference leaks at open
        USB: serial: io_ti: fix NULL-deref in interrupt callback
        usb: dwc3: gadget: make to increment req->remaining in all cases
        ...
      46552bf4
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · cb853a82
      Linus Torvalds authored
      Pull pinctrl fixes from Linus Walleij:
       "Two smaller pin control fixes for the v4.11 series:
      
         - Add a get_direction() function to the qcom driver
      
         - Fix two pin names in the uniphier driver"
      
      * tag 'pinctrl-v4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: uniphier: change pin names of aio/xirq for LD11
        pinctrl: qcom: add get_direction function
      cb853a82