1. 19 Apr, 2017 2 commits
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Use mm->task_size for boundary checking instead of addr_limit · be77e999
      Aneesh Kumar K.V authored
      We don't init addr_limit correctly for 32 bit applications. So default to using
      mm->task_size for boundary condition checking. We use addr_limit to only control
      free space search. This makes sure that we do the right thing with 32 bit
      applications.
      
      We should consolidate the usage of TASK_SIZE/mm->task_size and
      mm->context.addr_limit later.
      
      This partially reverts commit fbfef902 (powerpc/mm: Switch some
      TASK_SIZE checks to use mm_context addr_limit).
      
      Fixes: fbfef902 ("powerpc/mm: Switch some TASK_SIZE checks to use mm_context addr_limit")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      be77e999
    • Nicholas Piggin's avatar
      powerpc/64s: Revert setting of LPCR[LPES] on POWER9 · 8d1b48ef
      Nicholas Piggin authored
      The XIVE enablement patches included a change to set the LPES (Logical
      Partitioning Environment Selector) bit (bit # 3) in LPCR (Logical Partitioning
      Control Register) on POWER9 hosts. This bit sets external interrupts to guest
      delivery mode, which uses SRR0/1. The host's EE interrupt handler is written to
      expect HSRR0/1 (for earlier CPUs). This should be fine because XIVE is
      configured not to deliver EEs to the host (Hypervisor Virtulization Interrupt is
      used instead) so the EE handler should never be executed.
      
      However a bug in interrupt controller code, hardware, or odd configuration of a
      simulator could result in the host getting an EE incorrectly. Keeping the EE
      delivery mode matching the host EE handler prevents strange crashes due to using
      the wrong exception registers.
      
      KVM will configure the LPCR to set LPES prior to running a guest so that EEs are
      delivered to the guest using SRR0/1.
      
      Fixes: 08a1e650 ("powerpc: Fixup LPCR:PECE and HEIC setting on POWER9")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Massage change log to avoid referring to LPES0 which is now renamed LPES]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8d1b48ef
  2. 13 Apr, 2017 18 commits
  3. 12 Apr, 2017 6 commits
    • Rashmica Gupta's avatar
      powerpc/mm: Fix hash table dump when memory is not contiguous · 9e4114b3
      Rashmica Gupta authored
      The current behaviour of the hash table dump assumes that memory is contiguous
      and iterates from the start of memory to (start + size of memory). When memory
      isn't physically contiguous, this doesn't work.
      
      If memory exists at 0-5 GB and 6-10 GB then the current approach will check if
      entries exist in the hash table from 0GB to 9GB. This patch changes the
      behaviour to iterate over any holes up to the end of memory.
      
      Fixes: 1515ab93 ("powerpc/mm: Dump hash table")
      Signed-off-by: default avatarRashmica Gupta <rashmica.g@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9e4114b3
    • Oliver O'Halloran's avatar
      powerpc/mm: Add physical address to Linux page table dump · aaa22952
      Oliver O'Halloran authored
      The current page table dumper scans the Linux page tables and coalesces mappings
      with adjacent virtual addresses and similar PTE flags. This behaviour is
      somewhat broken when you consider the IOREMAP space where entirely unrelated
      mappings will appear to be virtually contiguous. This patch modifies the range
      coalescing so that only ranges that are both physically and virtually contiguous
      are combined. This patch also adds to the dump output the physical address at
      the start of each range.
      
      Fixes: 8eb07b18 ("powerpc/mm: Dump linux pagetables")
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      [mpe: Print the physicall address with 0x like the other addresses]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      aaa22952
    • Oliver O'Halloran's avatar
      powerpc/mm: Fix missing _PAGE_NON_IDEMPOTENT in pgtable dump · 70538eaa
      Oliver O'Halloran authored
      On Book3s we have two PTE flags used to mark cache-inhibited mappings:
      _PAGE_TOLERANT and _PAGE_NON_IDEMPOTENT. Currently the kernel page table dumper
      only looks at the generic _PAGE_NO_CACHE which is defined to be _PAGE_TOLERANT.
      This patch modifies the dumper so both flags are shown in the dump.
      
      Fixes: 8eb07b18 ("powerpc/mm: Dump linux pagetables")
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      70538eaa
    • Balbir Singh's avatar
      powerpc/tracing: Allow tracing of mmap syscalls · 9c355917
      Balbir Singh authored
      Currently sys_mmap() and sys_mmap2() (32-bit only), are not visible to the
      syscall tracing machinery. This means users are not able to see the execution of
      mmap() syscalls using the syscall tracer.
      
      Fix that by using SYSCALL_DEFINE6 for sys_mmap() and sys_mmap2() so that the
      meta-data associated with these syscalls is visible to the syscall tracer.
      
      A side-effect of this change is that the return type has changed from unsigned
      long to long. However this should have no effect, the only code in the kernel
      which uses the result of these syscalls is in the syscall return path, which is
      written in asm and treats the result as unsigned regardless.
      
      Example output:
        cat-3399  [001] ....   196.542410: sys_mmap(addr: 7fff922a0000, len: 20000, prot: 3, flags: 812, fd: 3, offset: 1b0000)
        cat-3399  [001] ....   196.542443: sys_mmap -> 0x7fff922a0000
        cat-3399  [001] ....   196.542668: sys_munmap(addr: 7fff922c0000, len: 6d2c)
        cat-3399  [001] ....   196.542677: sys_munmap -> 0x0
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [mpe: Massage change log, add detail on return type change]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9c355917
    • Michael Ellerman's avatar
      powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages · 03dfee6d
      Michael Ellerman authored
      Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"),
      we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This
      makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to
      calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong.
      
      The end result is that the PGD (Page Global Directory, ie top level page table)
      of the kernel (aka. swapper_pg_dir), is too small.
      
      This generally doesn't lead to a crash, as we don't use the full range in normal
      operation. However if we try to dump the kernel pagetables we can trigger a
      crash because we walk off the end of the pgd into other memory and eventually
      try to dereference something bogus:
      
        $ cat /sys/kernel/debug/kernel_pagetables
        Unable to handle kernel paging request for data at address 0xe8fece0000000000
        Faulting instruction address: 0xc000000000072314
        cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890]
            pc: c000000000072314: ptdump_show+0x164/0x430
            lr: c000000000072550: ptdump_show+0x3a0/0x430
           dar: e802cf0000000000
        seq_read+0xf8/0x560
        full_proxy_read+0x84/0xc0
        __vfs_read+0x6c/0x1d0
        vfs_read+0xbc/0x1b0
        SyS_read+0x6c/0x110
        system_call+0x38/0xfc
      
      The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be
      the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move
      the calculation into asm-offsets.c where we can do it easily using
      max().
      
      Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      03dfee6d
    • Michael Ellerman's avatar
      Merge branch 'topic/xive' (early part) into next · 3c19d5ad
      Michael Ellerman authored
      This merges the arch part of the XIVE support, leaving the final commit
      with the KVM specific pieces dangling on the branch for Paul to merge
      via the kvm-ppc tree.
      3c19d5ad
  4. 10 Apr, 2017 14 commits
    • Gautham R. Shenoy's avatar
      powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 · 17ed4c8f
      Gautham R. Shenoy authored
      POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up
      from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus
      the HSPRG0 of a thread waking up from can contain the paca pointer of
      its sibling.
      
      This patch implements a context recovery framework within threads of a
      core, by provisioning space in paca_struct for saving every sibling
      threads's paca pointers. Basically, we should be able to arrive at the
      right paca pointer from any of the thread's existing paca pointer.
      
      At bootup, during powernv idle-init, we save the paca address of every
      CPU in each one its siblings paca_struct in the slot corresponding to
      this CPU's index in the core.
      
      On wakeup from a stop, the thread will determine its index in the core
      from the TIR register and recover its PACA pointer by indexing into
      the correct slot in the provisioned space in the current PACA.
      
      Furthermore, ensure that the NVGPRs are restored from the stack on the
      way out by setting the NAPSTATELOST in paca.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Call it a bug]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      17ed4c8f
    • Gautham R. Shenoy's avatar
      powerpc/powernv/idle: Don't override default/deepest directly in kernel · f3b3f284
      Gautham R. Shenoy authored
      Currently during idle-init on power9, if we don't find suitable stop
      states in the device tree that can be used as the
      default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
      stop state psscr to be used by power9_idle and deepest stop state
      which is used by CPU-Hotplug.
      
      However, if the platform firmware has not configured or enabled a stop
      state, the kernel should not make any assumptions and fallback to a
      default choice.
      
      If the kernel uses a stop state that is not configured by the platform
      firmware, it may lead to further failures which should be avoided.
      
      In this patch, we modify the init code to ensure that the kernel uses
      only the stop states exposed by the firmware through the device
      tree. When a suitable default stop state isn't found, we disable
      ppc_md.power_save for power9. Similarly, when a suitable
      deepest_stop_state is not found in the device tree exported by the
      firmware, fall back to the default busy-wait loop in the CPU-Hotplug
      code.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f3b3f284
    • Gautham R. Shenoy's avatar
      powerpc/powernv/smp: Add busy-wait loop as fall back for CPU-Hotplug · 90061231
      Gautham R. Shenoy authored
      Currently, the powernv cpu-offline function assumes that platform idle
      states such as stop on POWER9, winkle/sleep/nap on POWER8 are always
      available. On POWER8, it picks nap as the default state if other deep
      idle states like sleep/winkle are not available and enabled in the
      platform.
      
      On POWER9, nap is not available and all idle states are managed by
      STOP instruction.  The parameters to the idle state are passed through
      processor stop status control register (PSSCR).  Hence as such
      executing STOP would take parameters from current PSSCR. We do not
      want to make any assumptions in kernel on what STOP states and PSSCR
      features are configured by the platform.
      
      Ideally platform will configure a good set of stop states that can be
      used in the kernel.  We would like to start with a clean slate, if the
      platform choose to not configure any state or there is an error in
      platform firmware that lead to no stop states being configured or
      allowed to be requested.
      
      This patch adds a fallback method for CPU-Hotplug that is similar to
      snooze loop at idle where the threads are left to spin at low priority
      and hence reduce the cycles consumed.
      
      This is a safe fallback mechanism in the case when no stop state would
      be requested if the platform firmware did not configure them most
      likely due to an error condition.
      
      Requesting a stop state when the platform has not configured them or
      enabled them would lead to further error conditions which could be
      difficult to debug.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      90061231
    • Gautham R. Shenoy's avatar
      powerpc/powernv: Move CPU-Offline idle state invocation from smp.c to idle.c · a7cd88da
      Gautham R. Shenoy authored
      Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
      transitions the CPU to the deepest available platform idle state to a
      new function named pnv_cpu_offline() in powernv/idle.c. The rationale
      behind this code movement is that the data required to determine the
      deepest available platform state resides in powernv/idle.c.
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a7cd88da
    • Anshuman Khandual's avatar
      powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes · 2c9faa76
      Anshuman Khandual authored
      Add user space exported API definitions for 512KB, 1MB, 2MB, 8MB, 16MB,
      1GB, 16GB non default huge page sizes to be used with mmap() system
      call.
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      [mpe: Reword the comment to emphasise that these are only needed to use
       the non-default huge page size, and updated the change log.]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2c9faa76
    • Anshuman Khandual's avatar
      powerpc/mm: Remove reduntant initmem information from log · ea614555
      Anshuman Khandual authored
      Generic core VM already prints these information in the log
      buffer, hence there is no need for a second print. This just
      removes the second print from arch powerpc NUMA init path.
      
      Before the patch:
      
        $ dmesg | grep "Initmem"
      
        numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
        numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]
        numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff]
        numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff]
        numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff]
        numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff]
        numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff]
        numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff]
        Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
        Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff]
        Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff]
        Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff]
        Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff]
        Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff]
        Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff]
        Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff]
      
      After the patch just the latter set is printed.
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ea614555
    • Michael Ellerman's avatar
      powerpc: Make sparsemem the default on 64-bit Book3S · 7b3912f4
      Michael Ellerman authored
      Make sparsemem the default on all 64-bit Book3S platforms. It already is
      for pseries and ps3, and we need to enable it for powernv because on
      POWER9 memory between chips is discontiguous.
      
      For the other platforms sparsemem should work fine, though it might add
      a small amount of overhead. We can always force FLATMEM in the
      defconfigs if necessary.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7b3912f4
    • Michael Ellerman's avatar
      powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit() · 4868e350
      Michael Ellerman authored
      setup_initial_memory_limit() is called from early_init_devtree(), which
      runs prior to feature patching. If the kernel is built with CONFIG_JUMP_LABEL=y
      and CONFIG_JUMP_LABEL_FEATURE_CHECKS=y then we will potentially get the
      wrong value.
      
      If we also have CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG=y we get a warning
      and backtrace:
      
        Warning! mmu_has_feature() used prior to jump label init!
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc4-gccN-next-20170331-g6af2434c #1
        Call Trace:
        [c000000000fc3d50] [c000000000a26c30] .dump_stack+0xa8/0xe8 (unreliable)
        [c000000000fc3de0] [c00000000002e6b8] .setup_initial_memory_limit+0xa4/0x104
        [c000000000fc3e60] [c000000000d5c23c] .early_init_devtree+0xd0/0x2f8
        [c000000000fc3f00] [c000000000d5d3b0] .early_setup+0x90/0x11c
        [c000000000fc3f90] [c000000000000520] start_here_multiplatform+0x68/0x80
      
      Fix it by using early_mmu_has_feature().
      
      Fixes: c12e6f24 ("powerpc: Add option to use jump label for mmu_has_feature()")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4868e350
    • Michael Ellerman's avatar
      powerpc: Remove unnecessary includes of asm/debug.h · 3ae05fb3
      Michael Ellerman authored
      These files don't seem to have any need for asm/debug.h, now that all it
      includes are the debugger hooks and breakpoint definitions.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3ae05fb3
    • Michael Ellerman's avatar
      powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there · 7644d581
      Michael Ellerman authored
      powerpc_debugfs_root is the dentry representing the root of the
      "powerpc" directory tree in debugfs.
      
      Currently it sits in asm/debug.h, a long with some other things that
      have "debug" in the name, but are otherwise unrelated.
      
      Pull it out into a separate header, which also includes linux/debugfs.h,
      and convert all the users to include debugfs.h instead of debug.h.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7644d581
    • Alistair Popple's avatar
      powerpc/powernv: Require MMU_NOTIFIER to fix NPU build · abfe8026
      Alistair Popple authored
      In the recent commit 1ab66d1f ("powerpc/powernv: Introduce address
      translation services for Nvlink2") the NPU code gained a dependency on MMU
      notifiers.
      
      All our defconfigs have KVM enabled, which selects MMU_NOTIFIER, but if KVM is
      not enabled then the build breaks.
      
      Fix it by always selecting MMU_NOTIFIER when we're building powernv.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Signed-off-by: default avatarAlistair Popple <alistair@popple.id.au>
      [mpe: Reword change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      abfe8026
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Remove unnecessary ptesync · f7327e0b
      Aneesh Kumar K.V authored
      For a tlbiel with pid, we need to issue tlbiel with set number encoded. We
      don't need to do ptesync for each of those. Instead we need one for the entire
      tlbiel pid operation.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f7327e0b
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Don't do page walk cache flush when doing full mm flush · f6b0df55
      Aneesh Kumar K.V authored
      For fullmm tlb flush, we do a flush with RIC_FLUSH_ALL which will invalidate all
      related caches (radix__tlb_flush()). Hence the pwc flush is not needed.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f6b0df55
    • Benjamin Herrenschmidt's avatar
      powerpc: Fixup LPCR:PECE and HEIC setting on POWER9 · 08a1e650
      Benjamin Herrenschmidt authored
      We need to set LPES in order for normal external interrupts (0x500)
      to be directed to the guest while running in guest state.
      
      We also need HEIC set to prevent them to be sent to the host while
      in host state.
      
      With XIVE the host never gets one of these and wouldn't know how to
      handle it. All host external interrupts come in via the new
      hypervisor virtualization interrupts vector.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      08a1e650