1. 28 Apr, 2017 8 commits
    • Nicholas Piggin's avatar
      powerpc/64s: Dedicated system reset interrupt stack · b1ee8a3d
      Nicholas Piggin authored
      The system reset interrupt is used for crash/debug situations, so it is
      desirable to have as little impact on the normal state of the system as
      possible.
      
      Currently it uses the current kernel stack to process the exception.
      This stores into the stack which may be involved with the crash. The
      stack pointer may be corrupted, or it may have overflowed.
      
      Avoid or minimise these problems by creating a dedicated NMI stack for
      the system reset interrupt to use.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b1ee8a3d
    • Nicholas Piggin's avatar
      powerpc/64s: Disallow system reset vs system reset reentrancy · c4f3b52c
      Nicholas Piggin authored
      In preparation for using a dedicated stack for system reset interrupts,
      prevent a nested system reset from recovering, in order to simplify
      code that is called in crash/debug path. This allows a system reset
      interrupt to just use the base stack pointer.
      
      Keep an in_nmi nesting counter similarly to the in_mce counter. Consider
      the interrrupt non-recoverable if it is taken inside another system
      reset.
      
      Interrupt nesting could be allowed similarly to MCE, but system reset
      is a special case that's not for normal operation, so simplicity wins
      until there is requirement for nested system reset interrupts.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c4f3b52c
    • Nicholas Piggin's avatar
      powerpc/64s: Fix system reset vs general interrupt reentrancy · a3d96f70
      Nicholas Piggin authored
      The system reset interrupt can occur when MSR_EE=0, and it currently
      uses the PACA_EXGEN save area.
      
      Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0
      when the save area is still in use. A system reset interrupt in this
      window can lead to undetected corruption when the save area gets
      overwritten.
      
      This patch introduces PACA_EXNMI save area for system reset exceptions,
      which closes this corruption window. It's also helpful to retain the
      EXGEN state for debugging situations, even if not considering the
      recoverability aspect.
      
      This patch also moves the PACA_EXMC area down to a less frequently used
      part of the paca with the new save area.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a3d96f70
    • Nicholas Piggin's avatar
      powerpc/64s: Exception macro for stack frame and initial register save · a4087a4d
      Nicholas Piggin authored
      This code is common to a few exceptions, and another user will be added.
      This causes a trivial change to generated code:
      
      -     604: std     r9,416(r1)
      -     608: mfspr   r11,314
      -     60c: std     r11,368(r1)
      -     610: mfspr   r12,315
      +     604: mfspr   r11,314
      +     608: mfspr   r12,315
      +     60c: std     r9,416(r1)
      +     610: std     r11,368(r1)
      
      machine_check_powernv_early could also use this, but that requires non
      trivial changes to generated code, so that's for another patch.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a4087a4d
    • Nicholas Piggin's avatar
      powerpc/64s: Add exception macro that does not enable RI · 83a980f7
      Nicholas Piggin authored
      Subsequent patches will add more non-RI variant exceptions, so
      create a macro for it rather than open-code it.
      
      This does not change generated instructions.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      83a980f7
    • Nicholas Piggin's avatar
      powerpc/cbe: Do not process external or decremeter interrupts from sreset · 6e83985b
      Nicholas Piggin authored
      Cell will wake from low power state at the system reset interrupt,
      with the event encoded in SRR1, rather than waking at the interrupt
      vector that corresponds to that event.
      
      The system reset handler for this platform decodes SRR1 event reason
      and calls the interrupt handler to process it directly from the system
      reset handlre.
      
      A subsequent change will treat the system reset interrupt as a Linux NMI
      with its own per-CPU stack, and this will no longer work. Remove the
      external and decrementer handlers from the system reset handler.
      
      - The external exception remains raised and will fire again at the
        EE interrupt vector when system reset returns.
      
      - The decrementer is set to 1 so it will be raised again and fire when
        the system reset returns.
      
      It is possible to branch to an idle handler from the system reset
      interrupt (like POWER does), then restore a normal stack and restore
      this optimisation. But simplicity wins for now.
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6e83985b
    • Nicholas Piggin's avatar
      powerpc/pasemi: Do not process external or decrementer interrupts from sreset · 461e96a3
      Nicholas Piggin authored
      PA Semi will wake from low power state at the system reset interrupt,
      with the event encoded in SRR1, rather than waking at the interrupt
      vector that corresponds to that event.
      
      The system reset handler for this platform decodes SRR1 event reason
      and calls the interrupt handler to process it directly from the system
      reset handlre.
      
      A subsequent change will treat the system reset interrupt as a Linux NMI
      with its own per-CPU stack, and this will no longer work. Remove the
      external and decrementer handlers from the system reset handler.
      
      - The external exception remains raised and will fire again at the
        EE interrupt vector when system reset returns.
      
      - The decrementer is set to 1 so it will be raised again and fire when
        the system reset returns.
      
      It is possible to branch to an idle handler from the system reset
      interrupt (like POWER does), then restore a normal stack and restore
      this optimisation. But simplicity wins for now.
      Tested-by: default avatarChristian Zigotzky <chzigotzky@xenosoft.de>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      461e96a3
    • Michael Ellerman's avatar
      Merge branch 'topic/ppc-kvm' into next · b13f6683
      Michael Ellerman authored
      Merge the topic branch we were sharing with kvm-ppc, Paul has also
      merged it.
      b13f6683
  2. 27 Apr, 2017 8 commits
  3. 26 Apr, 2017 3 commits
    • Michael Ellerman's avatar
      powerpc/powernv: Fix oops on P9 DD1 in cause_ipi() · 45b21cfe
      Michael Ellerman authored
      Recently we merged the native xive support for Power9, and then separately some
      reworks for doorbell IPI support. In isolation both series were OK, but the
      merged result had a bug in one case.
      
      On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then
      falls back to the interrupt controller. However the fallback is implemented by
      calling icp_ops->cause_ipi. But now that xive support is merged we might be
      using xive, in which case icp_ops is not initialised, it's a xics specific
      structure. This leads to an oops such as:
      
        Unable to handle kernel paging request for data at address 0x00000028
        Oops: Kernel access of bad area, sig: 11 [#1]
        NIP pnv_p9_dd1_cause_ipi+0x74/0xe0
        LR smp_muxed_ipi_message_pass+0x54/0x70
      
      To fix it, rather than using icp_ops which might be NULL, have both xics and
      xive set smp_ops->cause_ipi, and then in the powernv code we save that as
      ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a WARN_ON()
      to check if somehow smp_ops->cause_ipi is NULL.
      
      Fixes: b866cc21 ("powerpc: Change the doorbell IPI calling convention")
      Tested-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      45b21cfe
    • Michael Ellerman's avatar
      powerpc/powernv: Fix missing attr initialisation in opal_export_attrs() · 83c49190
      Michael Ellerman authored
      In opal_export_attrs() we dynamically allocate some bin_attributes. They're
      allocated with kmalloc() and although we initialise most of the fields, we don't
      initialise write() or mmap(), and in particular we don't initialise the lockdep
      related fields in the embedded struct attribute.
      
      This leads to a lockdep warning at boot:
      
        BUG: key c0000000f11906d8 not in .data!
        WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 lockdep_init_map+0x28c/0x2a0
        ...
        Call Trace:
          lockdep_init_map+0x288/0x2a0 (unreliable)
          __kernfs_create_file+0x8c/0x170
          sysfs_add_file_mode_ns+0xc8/0x240
          __machine_initcall_powernv_opal_init+0x60c/0x684
          do_one_initcall+0x60/0x1c0
          kernel_init_freeable+0x2f4/0x3d4
          kernel_init+0x24/0x160
          ret_from_kernel_thread+0x5c/0xb0
      
      Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and
      mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep
      fields.
      
      Fixes: 11fe909d ("powerpc/powernv: Add OPAL exports attributes to sysfs")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      83c49190
    • Michael Ellerman's avatar
      powerpc/mm: Fix possible out-of-bounds shift in arch_mmap_rnd() · b409946b
      Michael Ellerman authored
      The recent patch to add runtime configuration of the ASLR limits added a bug in
      arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits,
      leading to undefined behaviour.
      
      In practice it exhibits as every process seg faulting instantly, presumably
      because the rnd value hasn't been restricited by the modulus at all. We didn't
      notice because it only happens under certain kernel configurations and if the
      number of bits is actually set to a large value.
      
      Fix it by switching to unsigned long.
      
      Fixes: 9fea59bd ("powerpc/mm: Add support for runtime configuration of ASLR limits")
      Reported-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b409946b
  4. 24 Apr, 2017 9 commits
    • David Gibson's avatar
      powerpc/mm: Ensure IRQs are off in switch_mm() · 9765ad13
      David Gibson authored
      powerpc expects IRQs to already be (soft) disabled when switch_mm() is
      called, as made clear in the commit message of 9c1e1052 ("powerpc: Allow
      perf_counters to access user memory at interrupt time").
      
      Aside from any race conditions that might exist between switch_mm() and an IRQ,
      there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't
      followed at some point by an IRQ enable then interrupts will remain disabled
      until we return to userspace.
      
      It is true that when switch_mm() is called from the scheduler IRQs are off, but
      not when it's called by use_mm(). Looking closer we see that last year in commit
      f98db601 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
      this was made more explicit by the addition of switch_mm_irqs_off() which is now
      called by the scheduler, vs switch_mm() which is used by use_mm().
      
      Arguably it is a bug in use_mm() to call switch_mm() in a different context than
      it expects, but fixing that will take time.
      
      This was discovered recently when vhost started throwing warnings such as:
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:578
        in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760
        no locks held by vhost-10760/10768.
        irq event stamp: 10
        hardirqs last  enabled at (9):  _raw_spin_unlock_irq+0x40/0x80
        hardirqs last disabled at (10): switch_slb+0x2e4/0x490
        softirqs last  enabled at (0):  copy_process+0x5e8/0x1260
        softirqs last disabled at (0):  (null)
        Call Trace:
          show_stack+0x88/0x390 (unreliable)
          dump_stack+0x30/0x44
          __might_sleep+0x1c4/0x2d0
          mutex_lock_nested+0x74/0x5c0
          cgroup_attach_task_all+0x5c/0x180
          vhost_attach_cgroups_work+0x58/0x80 [vhost]
          vhost_worker+0x24c/0x3d0 [vhost]
          kthread+0xec/0x100
          ret_from_kernel_thread+0x5c/0xd4
      
      Prior to commit 04b96e55 ("vhost: lockless enqueuing") (Aug 2016) the
      vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(),
      which had the effect of reenabling IRQs. Since that commit removed the locking
      in vhost_worker() the body of the vhost_worker() loop now runs with interrupts
      off causing the warnings.
      
      This patch addresses the problem by making the powerpc code mirror the x86 code,
      ie. we disable interrupts in switch_mm(), and optimise the scheduler case by
      defining switch_mm_irqs_off().
      
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Flesh out/rewrite change log, add stable]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9765ad13
    • Tyrel Datwyler's avatar
      powerpc/sysfs: Fix reference leak of cpu device_nodes present at boot · e76ca277
      Tyrel Datwyler authored
      For CPUs present at boot each logical CPU acquires a reference to the
      associated device node of the core. This happens in register_cpu() which
      is called by topology_init(). The result of this is that we end up with
      a reference held by each thread of the core. However, these references
      are never freed if the CPU core is DLPAR removed.
      
      This patch fixes the reference leaks by acquiring and releasing the references
      in the CPU hotplug callbacks un/register_cpu_online(). With this patch symmetric
      reference counting is observed with both CPUs present at boot, and those DLPAR
      added after boot.
      
      Fixes: f86e4718 ("driver/core: cpu: initialize of_node in cpu's device struture")
      Cc: stable@vger.kernel.org # v3.12+
      Signed-off-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e76ca277
    • Tyrel Datwyler's avatar
      powerpc/pseries: Fix of_node_put() underflow during DLPAR remove · 68baf692
      Tyrel Datwyler authored
      Historically struct device_node references were tracked using a kref embedded as
      a struct field. Commit 75b57ecf ("of: Make device nodes kobjects so they
      show up in sysfs") (Mar 2014) refactored device_nodes to be kobjects such that
      the device tree could by more simply exposed to userspace using sysfs.
      
      Commit 0829f6d1 ("of: device_node kobject lifecycle fixes") (Mar 2014)
      followed up these changes to better control the kobject lifecycle and in
      particular the referecne counting via of_node_get(), of_node_put(), and
      of_node_init().
      
      A result of this second commit was that it introduced an of_node_put() call when
      a dynamic node is detached, in of_node_remove(), that removes the initial kobj
      reference created by of_node_init().
      
      Traditionally as the original dynamic device node user the pseries code had
      assumed responsibilty for releasing this final reference in its platform
      specific DLPAR detach code.
      
      This patch fixes a refcount underflow introduced by commit 0829f6d1, and
      recently exposed by the upstreaming of the recount API.
      
      Messages like the following are no longer seen in the kernel log with this
      patch following DLPAR remove operations of cpus and pci devices.
      
        rpadlpar_io: slot PHB 72 removed
        refcount_t: underflow; use-after-free.
        ------------[ cut here ]------------
        WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 refcount_sub_and_test+0xf4/0x110
      
      Fixes: 0829f6d1 ("of: device_node kobject lifecycle fixes")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: default avatarTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      [mpe: Make change log commit references more verbose]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      68baf692
    • Michael Ellerman's avatar
      powerpc/xmon: Deindent the SLB dumping logic · 85673646
      Michael Ellerman authored
      Currently the code that dumps SLB entries uses a double-nested if. This
      means the actual dumping logic is a bit squashed. Deindent it by using
      continue.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarRashmica Gupta <rashmica.g@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      85673646
    • Michael Ellerman's avatar
      Merge branch 'topic/kprobes' into next · 9fc84914
      Michael Ellerman authored
      Although most of these kprobes patches are powerpc specific, there's a couple
      that touch generic code (with Acks). At the moment there's one conflict with
      acme's tree, but it's not too bad. Still just in case some other conflicts show
      up, we've put these in a topic branch so another tree could merge some or all of
      it if necessary.
      9fc84914
    • Naveen N. Rao's avatar
      powerpc/kprobes: Prefer ftrace when probing function entry · 24bd909e
      Naveen N. Rao authored
      KPROBES_ON_FTRACE avoids much of the overhead of regular kprobes as it
      eliminates the need for a trap, as well as the need to emulate or single-step
      instructions.
      
      Though OPTPROBES provides us with similar performance, we have limited
      optprobes trampoline slots. As such, when asked to probe at a function
      entry, default to using the ftrace infrastructure.
      
      With:
        # cd /sys/kernel/debug/tracing
        # echo 'p _do_fork' > kprobe_events
      
      before patch:
        # cat ../kprobes/list
        c0000000000daf08  k  _do_fork+0x8    [DISABLED]
        c000000000044fc0  k  kretprobe_trampoline+0x0    [OPTIMIZED]
      
      and after patch:
        # cat ../kprobes/list
        c0000000000d074c  k  _do_fork+0xc    [DISABLED][FTRACE]
        c0000000000412b0  k  kretprobe_trampoline+0x0    [OPTIMIZED]
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      24bd909e
    • Naveen N. Rao's avatar
      powerpc: Introduce a new helper to obtain function entry points · 1b32cd17
      Naveen N. Rao authored
      kprobe_lookup_name() is specific to the kprobe subsystem and may not always
      return the function entry point (in a subsequent patch for KPROBES_ON_FTRACE).
      For looking up function entry points, introduce a separate helper and use it
      in optprobes.c
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1b32cd17
    • Naveen N. Rao's avatar
      powerpc/kprobes: Add support for KPROBES_ON_FTRACE · ead514d5
      Naveen N. Rao authored
      Allow kprobes to be placed on ftrace _mcount() call sites. This optimization
      avoids the use of a trap, by riding on ftrace infrastructure.
      
      This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on MPROFILE_KERNEL,
      which is only currently enabled on powerpc64le with newer toolchains.
      
      Based on the x86 code by Masami.
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ead514d5
    • Naveen N. Rao's avatar
      powerpc/ftrace: Restore LR from pt_regs · 2f59be5b
      Naveen N. Rao authored
      Pass the real LR to the ftrace handler. This is needed for KPROBES_ON_FTRACE for
      the pre handlers.
      
      Also, with KPROBES_ON_FTRACE, the link register may be updated by the pre
      handlers or by a registed kretprobe. Honor updated LR by restoring it from
      pt_regs, rather than from the stack save area.
      
      Live patch and function graph continue to work fine with this change.
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2f59be5b
  5. 23 Apr, 2017 12 commits