- 12 Nov, 2017 7 commits
-
-
Vaidyanathan Srinivasan authored
On PowerNV platforms, firmware provides exit latency and target residency for each of the idle states in nano seconds. Cpuidle framework expects the values in micro seconds. Round up to nearest micro seconds to avoid errors in cases where the values are defined as fractional micro seconds. Default idle state of 'snooze' has exit latency of zero. If other states have fractional micro second exit latency, they would get rounded down to zero micro second and make cpuidle framework choose deeper idle state when snooze loop is the right choice. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Naveen N. Rao authored
Use safer string manipulation functions when dealing with a user-provided string in kprobe_lookup_name(). Reported-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Naveen N. Rao authored
Commit 3cdfcbfd ("powerpc: Change analyse_instr so it doesn't modify *regs") introduced emulate_update_regs() to perform part of what emulate_step() was doing earlier. However, this function was not added to the kprobes blacklist. Add it so as to prevent it from being probed. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Naveen N. Rao authored
Per Documentation/kprobes.txt, we don't necessarily need to disable interrupts before invoking the kprobe handlers. Masami submitted similar changes for x86 via commit a19b2e3d ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized kprobes"). Do the same for powerpc. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Naveen N. Rao authored
Per Documentation/kprobes.txt, probe handlers need to be invoked with preemption disabled. Update optimized_callback() to do so. Also move get_kprobe_ctlblk() invocation post preemption disable, since it accesses pre-cpu data. This was not an issue so far since optprobes wasn't selected if CONFIG_PREEMPT was enabled. Commit a30b85df ("kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y") changes this. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Stephen Rothwell authored
Commit 78adf6c2 ("powerpc/64s: Implement system reset idle wakeup reason"), added a call to ppc_save_regs() in the book3s code. ppc_save_regs() is only built if XMON and/or KEXEC_CORE are enabled, which is usually the case, however if they're not enabled then the build breaks. Fix it by making the Makefile check also build ppc_save_regs.o if CONFIG_PPC_BOOK3S is enabled. Fixes: 78adf6c2 ("powerpc/64s: Implement system reset idle wakeup reason") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> [mpe: Write change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
When using the radix MMU on Power9 DD1, to work around a hardware problem, radix__pte_update() is required to do a two stage update of the PTE. First we write a zero value into the PTE, then we flush the TLB, and then we write the new PTE value. In the normal case that works OK, but it does not work if we're updating the PTE that maps the code we're executing, because the mapping is removed by the TLB flush and we can no longer execute from it. Unfortunately the STRICT_RWX code needs to do exactly that. The exact symptoms when we hit this case vary, sometimes we print an oops and then get stuck after that, but I've also seen a machine just get stuck continually page faulting with no oops printed. The variance is presumably due to the exact layout of the text and the page size used for the mappings. In all cases we are unable to boot to a shell. There are possible solutions such as creating a second mapping of the TLB flush code, executing from that, and then jumping back to the original. However we don't want to add that level of complexity for a DD1 work around. So just detect that we're running on Power9 DD1 and refrain from changing the permissions, effectively disabling STRICT_RWX on Power9 DD1. Fixes: 7614ff32 ("powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix") Cc: stable@vger.kernel.org # v4.13+ Reported-by: Andrew Jeffery <andrew@aj.id.au> [Changelog as suggested by Michael Ellerman <mpe@ellerman.id.au>] Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 11 Nov, 2017 18 commits
-
-
Haren Myneni authored
We are using percpu send window on P9 NX (powerNV) instead of opening / closing per each crypto session. Means txwin is removed from workmem. So we do not need to initialize workmem for each request. Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Haren Myneni authored
For P9 NX, the send window is opened for each crypto session and closed upon free. But VAS supports 64K windows per chip for all coprocessors including in user space support. So there is a possibility of not getting the window for kernel requests. This patch reserves windows for each coprocessor type (NX842) and are available forever for kernel requests, Opens each window for each CPU on the corresponding chip during driver initialization. So then use the percpu txwin for NX requests depends on the CPU on which the process is executing. Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Add support for user space receive window (for the Fast thread-wakeup coprocessor type) Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Define an interface to return a system-wide unique id for a given VAS window. The vas_win_id() will be used in a follow-on patch to generate an unique handle for a user space receive window. Applications can use this handle to pair send and receive windows for fast thread-wakeup. The hardware refers to this system-wide unique id as a Partition Send Window ID which is expected to be used during fault handling. Hence the "pswid" in the function names. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Define an interface that the NX drivers can use to find the physical paste address of a send window. This interface is expected to be used with the mmap() operation of the NX driver's device. i.e the user space process can use driver's mmap() operation to map the send window's paste address into their address space and then use copy and paste instructions to submit the CRBs to the NX engine. Note that kernel drivers will use vas_paste_crb() directly and don't need this interface. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
A CP_ABORT instruction is required in processes that have mapped a VAS "paste address" with the intention of using COPY/PASTE instructions. But since CP_ABORT is expensive, we want to restrict it to only processes that use/intend to use COPY/PASTE. Define an interface, set_thread_uses_vas(), that VAS can use to indicate that the current process opened a send window. During context switch, issue CP_ABORT only for processes that have the flag set. Thanks for input from Nick Piggin, Michael Ellerman. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> [mpe: Fix to not use new_thread after _switch() returns] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
We need the SPRN_TIDR to be set for use with fast thread-wakeup (core- to-core wakeup) and also with CAPI. Each thread in a process needs to have a unique id within the process. But for now, we assign globally unique thread ids to all threads in the system. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com> Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> [mpe: Simplify tidr clearing on fork() and ctx switch code] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Export the VAS Window context information to debugfs. We need to hold a mutex when closing the window to prevent a race with the debugfs read(). Rather than introduce a per-instance mutex, we use the global vas_mutex for now, since it is not heavily contended. The window->cop field is only relevant to a receive window so we were not setting it for a send window (which is is paired to a receive window anyway). But to simplify reporting in debugfs, set the 'cop' field for the send window also. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Define a helper, chip_to_vas_id() to map a given chip id to corresponding vas id. Normally, callers of vas_rx_win_open() and vas_tx_win_open() want the VAS window to be on the same chip where the calling thread is executing. These callers can pass in -1 for the VAS id. This interface will be useful if a thread running on one chip wants to open a window on another chip (like the NX-842 driver does during start up). Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Create a cpu to vasid mapping so callers can specify -1 instead of trying to find a VAS id. Changelog[v2] [Michael Ellerman] Use per-cpu variables to simplify code. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Normally, the NX driver waits for the CRBs to be processed before closing the window. But it is better to ensure that the credits are returned before the window gets reassigned later. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Save the configured max window credits for a window in the vas_window structure. We will need this when polling for return of window credits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
A VAS window is normally in "busy" state for only a short duration. Reduce the time we wait for the window to go to "not-busy" state to speed-up vas_win_close() a bit. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Use a helper to have the hardware unpin and mark a window closed. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Polling for window cast out is listed in the spec, but turns out that it is not strictly necessary and slows down window close. Making it a stub for now. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Clean up vas.h and the debug code around ifdef vas_debug. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
NX-842, the only user of VAS, sets the window credits to default values but VAS should check the credits against the possible max values. The VAS_WCREDS_MIN is not needed and can be dropped. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Initialize a few missing window context fields from the window attributes specified by the caller. These fields are currently set to their default values by the caller (NX-842), but would be good to apply them anyway. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 10 Nov, 2017 8 commits
-
-
Nicholas Piggin authored
Take the DSCR value set by firmware as the dscr_default value, rather than zero. POWER9 recommends DSCR default to a non-zero value. Signed-off-by: From: Nicholas Piggin <npiggin@gmail.com> [mpe: Make record_spr_defaults() __init] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
OPAL boot does not insert secondaries at 0x60 to wait at the secondary hold spinloop. Instead they are started later, and inserted at generic_secondary_smp_init(), which is after the secondary hold spinloop. Avoid waiting on this spinloop when booting with OPAL firmware. This wait always times out that case. This saves 100ms boot time on powernv, and 10s of seconds of real time when booting on the simulator in SMP. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Unmaps that free page tables always flush the entire PID, which is sub-optimal. Provide TLB range flushing with an additional PWC flush that can be use for va range invalidations with PWC flush. Time to munmap N pages of memory including last level page table teardown (after mmap, touch), local invalidate: N 1 2 4 8 16 32 64 vanilla 3.2us 3.3us 3.4us 3.6us 4.1us 5.2us 7.2us patched 1.4us 1.5us 1.7us 1.9us 2.6us 3.7us 6.2us Global invalidate: N 1 2 4 8 16 32 64 vanilla 2.2us 2.3us 2.4us 2.6us 3.2us 4.1us 6.2us patched 2.1us 2.5us 3.4us 5.2us 8.7us 15.7us 6.2us Local invalidates get much better across the board. Global ones have the same issue where multiple tlbies for va flush do get slower than the single tlbie to invalidate the PID. None of this test captures the TLB benefits of avoiding killing everything. Global gets worse, but it is brought in to line with global invalidate for munmap()s that do not free page tables. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The single page flush ceiling is the cut-off point at which we switch from invalidating individual pages, to invalidating the entire process address space in response to a range flush. Introduce a local variant of this heuristic because local and global tlbie have significantly different properties: - Local tlbiel requires 128 instructions to invalidate a PID, global tlbie only 1 instruction. - Global tlbie instructions are expensive broadcast operations. The local ceiling has been made much higher, 2x the number of instructions required to invalidate the entire PID (i.e., 256 pages). Time to mprotect N pages of memory (after mmap, touch), local invalidate: N 32 34 64 128 256 512 vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us The behaviour of both is identical at N=32 and N=512. Between there, the vanilla kernel does a PID invalidate and the patched kernel does a va range invalidate. At N=128, these require the same number of tlbiel instructions, so the patched version can be sen to be cheaper when < 128, and more expensive when > 128. However this does not well capture the cost of invalidated TLB. The additional cost at 256 pages does not seem prohibitive. It may be the case that increasing the limit further would continue to be beneficial to avoid invalidating all of the process's TLB entries. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Currently for radix, flush_tlb_range flushes the entire PID, because the Linux mm code does not tell us about page size here for THP vs regular pages. This is quite sub-optimal for small mremap / mprotect / change_protection. So implement va range flushes with two flush passes, one for each page size (regular and THP). The second flush has an order of matnitude fewer tlbie instructions than the first, so it is a relatively small additional cost. There is still room for improvement here with some changes to generic APIs, particularly if there are mostly THP pages to be invalidated, the small page flushes could be reduced. Time to mprotect 1 page of memory (after mmap, touch): vanilla 2.9us 1.8us patched 1.2us 1.6us Time to mprotect 30 pages of memory (after mmap, touch): vanilla 8.2us 7.2us patched 6.9us 17.9us Time to mprotect 34 pages of memory (after mmap, touch): vanilla 9.1us 8.0us patched 9.0us 8.0us 34 pages is the point at which the invalidation switches from va to entire PID, which tlbie can do in a single instruction. This is why in the case of 30 pages, the new code runs slower for this test. This is a deliberate tradeoff already present in the unmap and THP promotion code, the idea is that the benefit from avoiding flushing entire TLB for this PID on all threads in the system. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Move the barriers and range iteration down into the _tlbie* level, which improves readability. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Short range flushes issue a sequences of tlbie(l) instructions for individual effective addresses. These do not all require individual barrier sequences, only one covering all tlbie(l) instructions. Commit f7327e0b ("powerpc/mm/radix: Remove unnecessary ptesync") made a similar optimization for tlbiel for PID flushing. For tlbie, the ISA says: The tlbsync instruction provides an ordering function for the effects of all tlbie instructions executed by the thread executing the tlbsync instruction, with respect to the memory barrier created by a subsequent ptesync instruction executed by the same thread. Time to munmap 30 pages of memory (after mmap, touch): local global vanilla 10.9us 22.3us patched 3.4us 14.4us Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
We have some dependencies & conflicts between patches in fixes and things to go in next, both in the radix TLB flush code and the IMC PMU driver. So merge fixes into next.
-
- 09 Nov, 2017 1 commit
-
-
Gustavo Romero authored
Add a self test to check if FP/VEC/VSX registers are sane (restored correctly) after a FP/VEC/VSX unavailable exception is caught during a transaction. This test checks all possibilities in a thread regarding the combination of MSR.[FP|VEC] states in a thread and for each scenario raises a FP/VEC/VSX unavailable exception in transactional state, verifying if vs0 and vs32 registers, which are representatives of FP/VEC/VSX reg sets, are not corrupted. Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 08 Nov, 2017 1 commit
-
-
Balbir Singh authored
It would be nice to be able to dump page tables in a particular context. eg: dumping vmalloc space: 0:mon> dv 0xd00037fffff00000 pgd @ 0xc0000000017c0000 pgdp @ 0xc0000000017c00d8 = 0x00000000f10b1000 pudp @ 0xc0000000f10b13f8 = 0x00000000f10d0000 pmdp @ 0xc0000000f10d1ff8 = 0x00000000f1102000 ptep @ 0xc0000000f1102780 = 0xc0000000f1ba018e Maps physical address = 0x00000000f1ba0000 Flags = Accessed Dirty Read Write This patch does not replicate the complex code of dump_pagetable and has no support for bolted linear mapping, thats why I've it's called dump virtual page table support. The format of the PTE can be expanded even further to add more useful information about the flags in the PTE if required. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Bike shed the output format, show the pgdir, fix build failures] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 07 Nov, 2017 3 commits
-
-
Michal Suchanek authored
In commit e6f81a92 ("powerpc/mm/hash: Support 68 bit VA") the masking is folded into ASM_VSID_SCRAMBLE but the comment about masking is removed only from the firt use of ASM_VSID_SCRAMBLE. Signed-off-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
DMA windows can only have a size of power of two on IODA2 hardware and using memory_hotplug_max() to determine the upper limit won't work correcly if it returns not power of two value. This removes the check as the platform code does this check in pnv_pci_ioda2_setup_default_config() anyway; the other client is VFIO and that thing checks against locked_vm limit which prevents the userspace from locking too much memory. It is expected to impact DPDK on machines with non-power-of-two RAM size, mostly. KVM guests are less likely to be affected as usually guests get less than half of hosts RAM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Shriya authored
The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which returns the last frequency requested by the kernel, but may not reflect the actual frequency the processor is running at. This patch makes a call to cpufreq_get() instead which returns the current frequency reported by the hardware. Fixes: fb5153d0 ("powerpc: powernv: Implement ppc_md.get_proc_freq()") Signed-off-by: Shriya <shriyak@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 06 Nov, 2017 2 commits
-
-
Nicholas Piggin authored
DD2.1 does not have to save MMCR0 for all state-loss idle states, only after deep idle states (like other PMU registers). Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
DD2.1 does not have to flush the ERAT after a state-loss idle. Performance testing was done on a DD2.1 using only the stop0 idle state (the shallowest state which supports state loss), using context_switch selftest configured to ping-poing between two threads on the same core and two different cores. Performance improvement for same core is 7.0%, different cores is 14.8%. Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-