1. 10 Apr, 2017 13 commits
    • Gautham R. Shenoy's avatar
      powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 · 17ed4c8f
      Gautham R. Shenoy authored
      POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up
      from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus
      the HSPRG0 of a thread waking up from can contain the paca pointer of
      its sibling.
      
      This patch implements a context recovery framework within threads of a
      core, by provisioning space in paca_struct for saving every sibling
      threads's paca pointers. Basically, we should be able to arrive at the
      right paca pointer from any of the thread's existing paca pointer.
      
      At bootup, during powernv idle-init, we save the paca address of every
      CPU in each one its siblings paca_struct in the slot corresponding to
      this CPU's index in the core.
      
      On wakeup from a stop, the thread will determine its index in the core
      from the TIR register and recover its PACA pointer by indexing into
      the correct slot in the provisioned space in the current PACA.
      
      Furthermore, ensure that the NVGPRs are restored from the stack on the
      way out by setting the NAPSTATELOST in paca.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Call it a bug]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      17ed4c8f
    • Gautham R. Shenoy's avatar
      powerpc/powernv/idle: Don't override default/deepest directly in kernel · f3b3f284
      Gautham R. Shenoy authored
      Currently during idle-init on power9, if we don't find suitable stop
      states in the device tree that can be used as the
      default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
      stop state psscr to be used by power9_idle and deepest stop state
      which is used by CPU-Hotplug.
      
      However, if the platform firmware has not configured or enabled a stop
      state, the kernel should not make any assumptions and fallback to a
      default choice.
      
      If the kernel uses a stop state that is not configured by the platform
      firmware, it may lead to further failures which should be avoided.
      
      In this patch, we modify the init code to ensure that the kernel uses
      only the stop states exposed by the firmware through the device
      tree. When a suitable default stop state isn't found, we disable
      ppc_md.power_save for power9. Similarly, when a suitable
      deepest_stop_state is not found in the device tree exported by the
      firmware, fall back to the default busy-wait loop in the CPU-Hotplug
      code.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f3b3f284
    • Gautham R. Shenoy's avatar
      powerpc/powernv/smp: Add busy-wait loop as fall back for CPU-Hotplug · 90061231
      Gautham R. Shenoy authored
      Currently, the powernv cpu-offline function assumes that platform idle
      states such as stop on POWER9, winkle/sleep/nap on POWER8 are always
      available. On POWER8, it picks nap as the default state if other deep
      idle states like sleep/winkle are not available and enabled in the
      platform.
      
      On POWER9, nap is not available and all idle states are managed by
      STOP instruction.  The parameters to the idle state are passed through
      processor stop status control register (PSSCR).  Hence as such
      executing STOP would take parameters from current PSSCR. We do not
      want to make any assumptions in kernel on what STOP states and PSSCR
      features are configured by the platform.
      
      Ideally platform will configure a good set of stop states that can be
      used in the kernel.  We would like to start with a clean slate, if the
      platform choose to not configure any state or there is an error in
      platform firmware that lead to no stop states being configured or
      allowed to be requested.
      
      This patch adds a fallback method for CPU-Hotplug that is similar to
      snooze loop at idle where the threads are left to spin at low priority
      and hence reduce the cycles consumed.
      
      This is a safe fallback mechanism in the case when no stop state would
      be requested if the platform firmware did not configure them most
      likely due to an error condition.
      
      Requesting a stop state when the platform has not configured them or
      enabled them would lead to further error conditions which could be
      difficult to debug.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      90061231
    • Gautham R. Shenoy's avatar
      powerpc/powernv: Move CPU-Offline idle state invocation from smp.c to idle.c · a7cd88da
      Gautham R. Shenoy authored
      Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
      transitions the CPU to the deepest available platform idle state to a
      new function named pnv_cpu_offline() in powernv/idle.c. The rationale
      behind this code movement is that the data required to determine the
      deepest available platform state resides in powernv/idle.c.
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      a7cd88da
    • Anshuman Khandual's avatar
      powerpc/hugetlb: Add ABI defines for supported HugeTLB page sizes · 2c9faa76
      Anshuman Khandual authored
      Add user space exported API definitions for 512KB, 1MB, 2MB, 8MB, 16MB,
      1GB, 16GB non default huge page sizes to be used with mmap() system
      call.
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      [mpe: Reword the comment to emphasise that these are only needed to use
       the non-default huge page size, and updated the change log.]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2c9faa76
    • Anshuman Khandual's avatar
      powerpc/mm: Remove reduntant initmem information from log · ea614555
      Anshuman Khandual authored
      Generic core VM already prints these information in the log
      buffer, hence there is no need for a second print. This just
      removes the second print from arch powerpc NUMA init path.
      
      Before the patch:
      
        $ dmesg | grep "Initmem"
      
        numa: Initmem setup node 0 [mem 0x00000000-0xffffffff]
        numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff]
        numa: Initmem setup node 2 [mem 0x200000000-0x2ffffffff]
        numa: Initmem setup node 3 [mem 0x300000000-0x3ffffffff]
        numa: Initmem setup node 4 [mem 0x400000000-0x4ffffffff]
        numa: Initmem setup node 5 [mem 0x500000000-0x5ffffffff]
        numa: Initmem setup node 6 [mem 0x600000000-0x6ffffffff]
        numa: Initmem setup node 7 [mem 0x700000000-0x7ffffffff]
        Initmem setup node 0 [mem 0x0000000000000000-0x00000000ffffffff]
        Initmem setup node 1 [mem 0x0000000100000000-0x00000001ffffffff]
        Initmem setup node 2 [mem 0x0000000200000000-0x00000002ffffffff]
        Initmem setup node 3 [mem 0x0000000300000000-0x00000003ffffffff]
        Initmem setup node 4 [mem 0x0000000400000000-0x00000004ffffffff]
        Initmem setup node 5 [mem 0x0000000500000000-0x00000005ffffffff]
        Initmem setup node 6 [mem 0x0000000600000000-0x00000006ffffffff]
        Initmem setup node 7 [mem 0x0000000700000000-0x00000007ffffffff]
      
      After the patch just the latter set is printed.
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ea614555
    • Michael Ellerman's avatar
      powerpc: Make sparsemem the default on 64-bit Book3S · 7b3912f4
      Michael Ellerman authored
      Make sparsemem the default on all 64-bit Book3S platforms. It already is
      for pseries and ps3, and we need to enable it for powernv because on
      POWER9 memory between chips is discontiguous.
      
      For the other platforms sparsemem should work fine, though it might add
      a small amount of overhead. We can always force FLATMEM in the
      defconfigs if necessary.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7b3912f4
    • Michael Ellerman's avatar
      powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit() · 4868e350
      Michael Ellerman authored
      setup_initial_memory_limit() is called from early_init_devtree(), which
      runs prior to feature patching. If the kernel is built with CONFIG_JUMP_LABEL=y
      and CONFIG_JUMP_LABEL_FEATURE_CHECKS=y then we will potentially get the
      wrong value.
      
      If we also have CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG=y we get a warning
      and backtrace:
      
        Warning! mmu_has_feature() used prior to jump label init!
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc4-gccN-next-20170331-g6af2434c #1
        Call Trace:
        [c000000000fc3d50] [c000000000a26c30] .dump_stack+0xa8/0xe8 (unreliable)
        [c000000000fc3de0] [c00000000002e6b8] .setup_initial_memory_limit+0xa4/0x104
        [c000000000fc3e60] [c000000000d5c23c] .early_init_devtree+0xd0/0x2f8
        [c000000000fc3f00] [c000000000d5d3b0] .early_setup+0x90/0x11c
        [c000000000fc3f90] [c000000000000520] start_here_multiplatform+0x68/0x80
      
      Fix it by using early_mmu_has_feature().
      
      Fixes: c12e6f24 ("powerpc: Add option to use jump label for mmu_has_feature()")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4868e350
    • Michael Ellerman's avatar
      powerpc: Remove unnecessary includes of asm/debug.h · 3ae05fb3
      Michael Ellerman authored
      These files don't seem to have any need for asm/debug.h, now that all it
      includes are the debugger hooks and breakpoint definitions.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3ae05fb3
    • Michael Ellerman's avatar
      powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there · 7644d581
      Michael Ellerman authored
      powerpc_debugfs_root is the dentry representing the root of the
      "powerpc" directory tree in debugfs.
      
      Currently it sits in asm/debug.h, a long with some other things that
      have "debug" in the name, but are otherwise unrelated.
      
      Pull it out into a separate header, which also includes linux/debugfs.h,
      and convert all the users to include debugfs.h instead of debug.h.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7644d581
    • Alistair Popple's avatar
      powerpc/powernv: Require MMU_NOTIFIER to fix NPU build · abfe8026
      Alistair Popple authored
      In the recent commit 1ab66d1f ("powerpc/powernv: Introduce address
      translation services for Nvlink2") the NPU code gained a dependency on MMU
      notifiers.
      
      All our defconfigs have KVM enabled, which selects MMU_NOTIFIER, but if KVM is
      not enabled then the build breaks.
      
      Fix it by always selecting MMU_NOTIFIER when we're building powernv.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Signed-off-by: default avatarAlistair Popple <alistair@popple.id.au>
      [mpe: Reword change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      abfe8026
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Remove unnecessary ptesync · f7327e0b
      Aneesh Kumar K.V authored
      For a tlbiel with pid, we need to issue tlbiel with set number encoded. We
      don't need to do ptesync for each of those. Instead we need one for the entire
      tlbiel pid operation.
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f7327e0b
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Don't do page walk cache flush when doing full mm flush · f6b0df55
      Aneesh Kumar K.V authored
      For fullmm tlb flush, we do a flush with RIC_FLUSH_ALL which will invalidate all
      related caches (radix__tlb_flush()). Hence the pwc flush is not needed.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f6b0df55
  2. 04 Apr, 2017 3 commits
    • Matt Brown's avatar
      powerpc/powernv: Add OPAL exports attributes to sysfs · 11fe909d
      Matt Brown authored
      New versions of OPAL have a device node /ibm,opal/firmware/exports, each
      property of which describes a range of memory in OPAL that Linux might
      want to export to userspace for debugging.
      
      This patch adds a sysfs file under 'opal/exports' for each property
      found there, and makes it read-only by root.
      Signed-off-by: default avatarMatt Brown <matthew.brown.dev@gmail.com>
      [mpe: Drop counting of props, rename to attr, free on sysfs error, c'log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      11fe909d
    • Sukadev Bhattiprolu's avatar
      powerpc/prom: Increase minimum RMA size to 512MB · 687da8fc
      Sukadev Bhattiprolu authored
      When booting very large systems with a large initrd, we run out of
      space early in boot for either RTAS or the flattened device tree (FDT).
      Boot fails with messages like:
      
      	Could not allocate memory for RTAS
      or
      	No memory for flatten_device_tree (no room)
      
      Increasing the minimum RMA size to 512MB fixes the problem. This
      should not have an impact on smaller LPARs (with 256MB memory),
      as the firmware will cap the RMA to the memory assigned to the LPAR.
      
      Fix is based on input/discussions with Michael Ellerman. Thanks to
      Praveen K. Pandey for testing on a large system.
      Reported-by: default avatarPraveen K. Pandey <preveen.pandey@in.ibm.com>
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      687da8fc
    • Alistair Popple's avatar
      powerpc/powernv: Introduce address translation services for Nvlink2 · 1ab66d1f
      Alistair Popple authored
      Nvlink2 supports address translation services (ATS) allowing devices
      to request address translations from an mmu known as the nest MMU
      which is setup to walk the CPU page tables.
      
      To access this functionality certain firmware calls are required to
      setup and manage hardware context tables in the nvlink processing unit
      (NPU). The NPU also manages forwarding of TLB invalidates (known as
      address translation shootdowns/ATSDs) to attached devices.
      
      This patch exports several methods to allow device drivers to register
      a process id (PASID/PID) in the hardware tables and to receive
      notification of when a device should stop issuing address translation
      requests (ATRs). It also adds a fault handler to allow device drivers
      to demand fault pages in.
      Signed-off-by: default avatarAlistair Popple <alistair@popple.id.au>
      [mpe: Fix up comment formatting, use flush_tlb_mm()]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1ab66d1f
  3. 03 Apr, 2017 6 commits
  4. 01 Apr, 2017 5 commits
    • Aneesh Kumar K.V's avatar
      powerpc/mm: Enable mappings above 128TB · f4ea6dcb
      Aneesh Kumar K.V authored
      Not all user space application is ready to handle wide addresses. It's
      known that at least some JIT compilers use higher bits in pointers to
      encode their information. It collides with valid pointers with 512TB
      addresses and leads to crashes.
      
      To mitigate this, we are not going to allocate virtual address space
      above 128TB by default.
      
      But userspace can ask for allocation from full address space by
      specifying hint address (with or without MAP_FIXED) above 128TB.
      
      If hint address set above 128TB, but MAP_FIXED is not specified, we try
      to look for unmapped area by specified address. If it's already
      occupied, we look for unmapped area in *full* address space, rather than
      from 128TB window.
      
      This approach helps to easily make application's memory allocator aware
      about large address space without manually tracking allocated virtual
      address space.
      
      This is going to be a per mmap decision. ie, we can have some mmaps with
      larger addresses and other that do not.
      
      A sample memory layout looks like:
      
        10000000-10010000 r-xp 00000000 fc:00 9057045          /home/max_addr_512TB
        10010000-10020000 r--p 00000000 fc:00 9057045          /home/max_addr_512TB
        10020000-10030000 rw-p 00010000 fc:00 9057045          /home/max_addr_512TB
        10029630000-10029660000 rw-p 00000000 00:00 0          [heap]
        7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0
        7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190  /lib/powerpc64le-linux-gnu/libc-2.23.so
        7fff83690000-7fff836a0000 rw-p 00000000 00:00 0
        7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0        [vdso]
        7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193  /lib/powerpc64le-linux-gnu/ld-2.23.so
        7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0        [stack]
        1000000000000-1000000010000 rw-p 00000000 00:00 0
        1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f4ea6dcb
    • Aneesh Kumar K.V's avatar
    • Aneesh Kumar K.V's avatar
      powerpc/pseries: Skip using reserved virtual address range · 82228e36
      Aneesh Kumar K.V authored
      Now that we use all the available virtual address range, we need to make
      sure we don't generate VSID such that it overlaps with the reserved vsid
      range. Reserved vsid range include the virtual address range used by the
      adjunct partition and also the VRMA virtual segment. We find the context
      value that can result in generating such a VSID and reserve it early in
      boot.
      
      We don't look at the adjunct range, because for now we disable the
      adjunct usage in a Linux LPAR via CAS interface.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      82228e36
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Store addr_limit in PACA · bb183221
      Aneesh Kumar K.V authored
      We optmize the slice page size array copy to paca by copying only the
      range based on addr_limit. This will require us to not look at page size
      array beyond addr_limit in PACA on slb fault. To enable that copy task
      size to paca which will be used during slb fault.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Rename from task_size to addr_limit, consolidate #ifdefs]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bb183221
    • Aneesh Kumar K.V's avatar
      powerpc/mm: Add addr_limit to mm_context and use it to derive max slice index · 957b778a
      Aneesh Kumar K.V authored
      In the followup patch, we will increase the slice array size to handle
      512TB range, but will limit the max addr to 128TB. Avoid doing
      unnecessary computation and avoid doing slice mask related operation
      above address limit.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      957b778a
  5. 31 Mar, 2017 13 commits