- 12 Nov, 2017 2 commits
-
-
Stephen Rothwell authored
Commit 78adf6c2 ("powerpc/64s: Implement system reset idle wakeup reason"), added a call to ppc_save_regs() in the book3s code. ppc_save_regs() is only built if XMON and/or KEXEC_CORE are enabled, which is usually the case, however if they're not enabled then the build breaks. Fix it by making the Makefile check also build ppc_save_regs.o if CONFIG_PPC_BOOK3S is enabled. Fixes: 78adf6c2 ("powerpc/64s: Implement system reset idle wakeup reason") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> [mpe: Write change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Balbir Singh authored
When using the radix MMU on Power9 DD1, to work around a hardware problem, radix__pte_update() is required to do a two stage update of the PTE. First we write a zero value into the PTE, then we flush the TLB, and then we write the new PTE value. In the normal case that works OK, but it does not work if we're updating the PTE that maps the code we're executing, because the mapping is removed by the TLB flush and we can no longer execute from it. Unfortunately the STRICT_RWX code needs to do exactly that. The exact symptoms when we hit this case vary, sometimes we print an oops and then get stuck after that, but I've also seen a machine just get stuck continually page faulting with no oops printed. The variance is presumably due to the exact layout of the text and the page size used for the mappings. In all cases we are unable to boot to a shell. There are possible solutions such as creating a second mapping of the TLB flush code, executing from that, and then jumping back to the original. However we don't want to add that level of complexity for a DD1 work around. So just detect that we're running on Power9 DD1 and refrain from changing the permissions, effectively disabling STRICT_RWX on Power9 DD1. Fixes: 7614ff32 ("powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix") Cc: stable@vger.kernel.org # v4.13+ Reported-by: Andrew Jeffery <andrew@aj.id.au> [Changelog as suggested by Michael Ellerman <mpe@ellerman.id.au>] Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 11 Nov, 2017 18 commits
-
-
Haren Myneni authored
We are using percpu send window on P9 NX (powerNV) instead of opening / closing per each crypto session. Means txwin is removed from workmem. So we do not need to initialize workmem for each request. Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Haren Myneni authored
For P9 NX, the send window is opened for each crypto session and closed upon free. But VAS supports 64K windows per chip for all coprocessors including in user space support. So there is a possibility of not getting the window for kernel requests. This patch reserves windows for each coprocessor type (NX842) and are available forever for kernel requests, Opens each window for each CPU on the corresponding chip during driver initialization. So then use the percpu txwin for NX requests depends on the CPU on which the process is executing. Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Add support for user space receive window (for the Fast thread-wakeup coprocessor type) Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Define an interface to return a system-wide unique id for a given VAS window. The vas_win_id() will be used in a follow-on patch to generate an unique handle for a user space receive window. Applications can use this handle to pair send and receive windows for fast thread-wakeup. The hardware refers to this system-wide unique id as a Partition Send Window ID which is expected to be used during fault handling. Hence the "pswid" in the function names. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Define an interface that the NX drivers can use to find the physical paste address of a send window. This interface is expected to be used with the mmap() operation of the NX driver's device. i.e the user space process can use driver's mmap() operation to map the send window's paste address into their address space and then use copy and paste instructions to submit the CRBs to the NX engine. Note that kernel drivers will use vas_paste_crb() directly and don't need this interface. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
A CP_ABORT instruction is required in processes that have mapped a VAS "paste address" with the intention of using COPY/PASTE instructions. But since CP_ABORT is expensive, we want to restrict it to only processes that use/intend to use COPY/PASTE. Define an interface, set_thread_uses_vas(), that VAS can use to indicate that the current process opened a send window. During context switch, issue CP_ABORT only for processes that have the flag set. Thanks for input from Nick Piggin, Michael Ellerman. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> [mpe: Fix to not use new_thread after _switch() returns] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
We need the SPRN_TIDR to be set for use with fast thread-wakeup (core- to-core wakeup) and also with CAPI. Each thread in a process needs to have a unique id within the process. But for now, we assign globally unique thread ids to all threads in the system. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com> Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> [mpe: Simplify tidr clearing on fork() and ctx switch code] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Export the VAS Window context information to debugfs. We need to hold a mutex when closing the window to prevent a race with the debugfs read(). Rather than introduce a per-instance mutex, we use the global vas_mutex for now, since it is not heavily contended. The window->cop field is only relevant to a receive window so we were not setting it for a send window (which is is paired to a receive window anyway). But to simplify reporting in debugfs, set the 'cop' field for the send window also. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Define a helper, chip_to_vas_id() to map a given chip id to corresponding vas id. Normally, callers of vas_rx_win_open() and vas_tx_win_open() want the VAS window to be on the same chip where the calling thread is executing. These callers can pass in -1 for the VAS id. This interface will be useful if a thread running on one chip wants to open a window on another chip (like the NX-842 driver does during start up). Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Create a cpu to vasid mapping so callers can specify -1 instead of trying to find a VAS id. Changelog[v2] [Michael Ellerman] Use per-cpu variables to simplify code. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Normally, the NX driver waits for the CRBs to be processed before closing the window. But it is better to ensure that the credits are returned before the window gets reassigned later. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Save the configured max window credits for a window in the vas_window structure. We will need this when polling for return of window credits. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
A VAS window is normally in "busy" state for only a short duration. Reduce the time we wait for the window to go to "not-busy" state to speed-up vas_win_close() a bit. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Use a helper to have the hardware unpin and mark a window closed. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Polling for window cast out is listed in the spec, but turns out that it is not strictly necessary and slows down window close. Making it a stub for now. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Clean up vas.h and the debug code around ifdef vas_debug. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
NX-842, the only user of VAS, sets the window credits to default values but VAS should check the credits against the possible max values. The VAS_WCREDS_MIN is not needed and can be dropped. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Sukadev Bhattiprolu authored
Initialize a few missing window context fields from the window attributes specified by the caller. These fields are currently set to their default values by the caller (NX-842), but would be good to apply them anyway. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 10 Nov, 2017 8 commits
-
-
Nicholas Piggin authored
Take the DSCR value set by firmware as the dscr_default value, rather than zero. POWER9 recommends DSCR default to a non-zero value. Signed-off-by: From: Nicholas Piggin <npiggin@gmail.com> [mpe: Make record_spr_defaults() __init] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
OPAL boot does not insert secondaries at 0x60 to wait at the secondary hold spinloop. Instead they are started later, and inserted at generic_secondary_smp_init(), which is after the secondary hold spinloop. Avoid waiting on this spinloop when booting with OPAL firmware. This wait always times out that case. This saves 100ms boot time on powernv, and 10s of seconds of real time when booting on the simulator in SMP. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Unmaps that free page tables always flush the entire PID, which is sub-optimal. Provide TLB range flushing with an additional PWC flush that can be use for va range invalidations with PWC flush. Time to munmap N pages of memory including last level page table teardown (after mmap, touch), local invalidate: N 1 2 4 8 16 32 64 vanilla 3.2us 3.3us 3.4us 3.6us 4.1us 5.2us 7.2us patched 1.4us 1.5us 1.7us 1.9us 2.6us 3.7us 6.2us Global invalidate: N 1 2 4 8 16 32 64 vanilla 2.2us 2.3us 2.4us 2.6us 3.2us 4.1us 6.2us patched 2.1us 2.5us 3.4us 5.2us 8.7us 15.7us 6.2us Local invalidates get much better across the board. Global ones have the same issue where multiple tlbies for va flush do get slower than the single tlbie to invalidate the PID. None of this test captures the TLB benefits of avoiding killing everything. Global gets worse, but it is brought in to line with global invalidate for munmap()s that do not free page tables. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
The single page flush ceiling is the cut-off point at which we switch from invalidating individual pages, to invalidating the entire process address space in response to a range flush. Introduce a local variant of this heuristic because local and global tlbie have significantly different properties: - Local tlbiel requires 128 instructions to invalidate a PID, global tlbie only 1 instruction. - Global tlbie instructions are expensive broadcast operations. The local ceiling has been made much higher, 2x the number of instructions required to invalidate the entire PID (i.e., 256 pages). Time to mprotect N pages of memory (after mmap, touch), local invalidate: N 32 34 64 128 256 512 vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us The behaviour of both is identical at N=32 and N=512. Between there, the vanilla kernel does a PID invalidate and the patched kernel does a va range invalidate. At N=128, these require the same number of tlbiel instructions, so the patched version can be sen to be cheaper when < 128, and more expensive when > 128. However this does not well capture the cost of invalidated TLB. The additional cost at 256 pages does not seem prohibitive. It may be the case that increasing the limit further would continue to be beneficial to avoid invalidating all of the process's TLB entries. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Currently for radix, flush_tlb_range flushes the entire PID, because the Linux mm code does not tell us about page size here for THP vs regular pages. This is quite sub-optimal for small mremap / mprotect / change_protection. So implement va range flushes with two flush passes, one for each page size (regular and THP). The second flush has an order of matnitude fewer tlbie instructions than the first, so it is a relatively small additional cost. There is still room for improvement here with some changes to generic APIs, particularly if there are mostly THP pages to be invalidated, the small page flushes could be reduced. Time to mprotect 1 page of memory (after mmap, touch): vanilla 2.9us 1.8us patched 1.2us 1.6us Time to mprotect 30 pages of memory (after mmap, touch): vanilla 8.2us 7.2us patched 6.9us 17.9us Time to mprotect 34 pages of memory (after mmap, touch): vanilla 9.1us 8.0us patched 9.0us 8.0us 34 pages is the point at which the invalidation switches from va to entire PID, which tlbie can do in a single instruction. This is why in the case of 30 pages, the new code runs slower for this test. This is a deliberate tradeoff already present in the unmap and THP promotion code, the idea is that the benefit from avoiding flushing entire TLB for this PID on all threads in the system. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Move the barriers and range iteration down into the _tlbie* level, which improves readability. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Short range flushes issue a sequences of tlbie(l) instructions for individual effective addresses. These do not all require individual barrier sequences, only one covering all tlbie(l) instructions. Commit f7327e0b ("powerpc/mm/radix: Remove unnecessary ptesync") made a similar optimization for tlbiel for PID flushing. For tlbie, the ISA says: The tlbsync instruction provides an ordering function for the effects of all tlbie instructions executed by the thread executing the tlbsync instruction, with respect to the memory barrier created by a subsequent ptesync instruction executed by the same thread. Time to munmap 30 pages of memory (after mmap, touch): local global vanilla 10.9us 22.3us patched 3.4us 14.4us Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
We have some dependencies & conflicts between patches in fixes and things to go in next, both in the radix TLB flush code and the IMC PMU driver. So merge fixes into next.
-
- 09 Nov, 2017 1 commit
-
-
Gustavo Romero authored
Add a self test to check if FP/VEC/VSX registers are sane (restored correctly) after a FP/VEC/VSX unavailable exception is caught during a transaction. This test checks all possibilities in a thread regarding the combination of MSR.[FP|VEC] states in a thread and for each scenario raises a FP/VEC/VSX unavailable exception in transactional state, verifying if vs0 and vs32 registers, which are representatives of FP/VEC/VSX reg sets, are not corrupted. Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 08 Nov, 2017 1 commit
-
-
Balbir Singh authored
It would be nice to be able to dump page tables in a particular context. eg: dumping vmalloc space: 0:mon> dv 0xd00037fffff00000 pgd @ 0xc0000000017c0000 pgdp @ 0xc0000000017c00d8 = 0x00000000f10b1000 pudp @ 0xc0000000f10b13f8 = 0x00000000f10d0000 pmdp @ 0xc0000000f10d1ff8 = 0x00000000f1102000 ptep @ 0xc0000000f1102780 = 0xc0000000f1ba018e Maps physical address = 0x00000000f1ba0000 Flags = Accessed Dirty Read Write This patch does not replicate the complex code of dump_pagetable and has no support for bolted linear mapping, thats why I've it's called dump virtual page table support. The format of the PTE can be expanded even further to add more useful information about the flags in the PTE if required. Signed-off-by: Balbir Singh <bsingharora@gmail.com> [mpe: Bike shed the output format, show the pgdir, fix build failures] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 07 Nov, 2017 3 commits
-
-
Michal Suchanek authored
In commit e6f81a92 ("powerpc/mm/hash: Support 68 bit VA") the masking is folded into ASM_VSID_SCRAMBLE but the comment about masking is removed only from the firt use of ASM_VSID_SCRAMBLE. Signed-off-by: Michal Suchanek <msuchanek@suse.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Alexey Kardashevskiy authored
DMA windows can only have a size of power of two on IODA2 hardware and using memory_hotplug_max() to determine the upper limit won't work correcly if it returns not power of two value. This removes the check as the platform code does this check in pnv_pci_ioda2_setup_default_config() anyway; the other client is VFIO and that thing checks against locked_vm limit which prevents the userspace from locking too much memory. It is expected to impact DPDK on machines with non-power-of-two RAM size, mostly. KVM guests are less likely to be affected as usually guests get less than half of hosts RAM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Shriya authored
The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which returns the last frequency requested by the kernel, but may not reflect the actual frequency the processor is running at. This patch makes a call to cpufreq_get() instead which returns the current frequency reported by the hardware. Fixes: fb5153d0 ("powerpc: powernv: Implement ppc_md.get_proc_freq()") Signed-off-by: Shriya <shriyak@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
- 06 Nov, 2017 7 commits
-
-
Nicholas Piggin authored
DD2.1 does not have to save MMCR0 for all state-loss idle states, only after deep idle states (like other PMU registers). Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
DD2.1 does not have to flush the ERAT after a state-loss idle. Performance testing was done on a DD2.1 using only the stop0 idle state (the shallowest state which supports state loss), using context_switch selftest configured to ping-poing between two threads on the same core and two different cores. Performance improvement for same core is 7.0%, different cores is 14.8%. Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Cc: Michael Neuling <mikey@neuling.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Cyril Bur authored
After handling a transactional FP, Altivec or VSX unavailable exception. The return to userspace code will detect that the TIF_RESTORE_TM bit is set and call restore_tm_state(). restore_tm_state() will call restore_math() to ensure that the correct facilities are loaded. This means that all the loadup code in {fp,altivec,vsx}_unavailable_tm() is doing pointless work and can simply be removed. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Cyril Bur authored
Lazy save and restore of FP/Altivec means that a userspace process can be sent to userspace with FP or Altivec disabled and loaded only as required (by way of an FP/Altivec unavailable exception). Transactional Memory complicates this situation as a transaction could be started without FP/Altivec being loaded up. This causes the hardware to checkpoint incorrect registers. Handling FP/Altivec unavailable exceptions while a thread is transactional requires a reclaim and recheckpoint to ensure the CPU has correct state for both sets of registers. tm_reclaim() has optimisations to not always save the FP/Altivec registers to the checkpointed save area. This was originally done because the caller might have information that the checkpointed registers aren't valid due to lazy save and restore. We've also been a little vague as to how tm_reclaim() leaves the FP/Altivec state since it doesn't necessarily always save it to the thread struct. This has lead to an (incorrect) assumption that it leaves the checkpointed state on the CPU. tm_recheckpoint() has similar optimisations in reverse. It may not always reload the checkpointed FP/Altivec registers from the thread struct before the trecheckpoint. It is therefore quite unclear where it expects to get the state from. This didn't help with the assumption made about tm_reclaim(). These optimisations sit in what is by definition a slow path. If a process has to go through a reclaim/recheckpoint then its transaction will be doomed on returning to userspace. This mean that the process will be unable to complete its transaction and be forced to its failure handler. This is already an out if line case for userspace. Furthermore, the cost of copying 64 times 128 bits from registers isn't very long[0] (at all) on modern processors. As such it appears these optimisations have only served to increase code complexity and are unlikely to have had a measurable performance impact. Our transactional memory handling has been riddled with bugs. A cause of this has been difficulty in following the code flow, code complexity has not been our friend here. It makes sense to remove these optimisations in favour of a (hopefully) more stable implementation. This patch does mean that some times the assembly will needlessly save 'junk' registers which will subsequently get overwritten with the correct value by the C code which calls the assembly function. This small inefficiency is far outweighed by the reduction in complexity for general TM code, context switching paths, and transactional facility unavailable exception handler. 0: I tried to measure it once for other work and found that it was hiding in the noise of everything else I was working with. I find it exceedingly likely this will be the case here. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Cyril Bur authored
Lazy save and restore of FP/Altivec means that a userspace process can be sent to userspace with FP or Altivec disabled and loaded only as required (by way of an FP/Altivec unavailable exception). Transactional Memory complicates this situation as a transaction could be started without FP/Altivec being loaded up. This causes the hardware to checkpoint incorrect registers. Handling FP/Altivec unavailable exceptions while a thread is transactional requires a reclaim and recheckpoint to ensure the CPU has correct state for both sets of registers. tm_reclaim() has optimisations to not always save the FP/Altivec registers to the checkpointed save area. This was originally done because the caller might have information that the checkpointed registers aren't valid due to lazy save and restore. We've also been a little vague as to how tm_reclaim() leaves the FP/Altivec state since it doesn't necessarily always save it to the thread struct. This has lead to an (incorrect) assumption that it leaves the checkpointed state on the CPU. tm_recheckpoint() has similar optimisations in reverse. It may not always reload the checkpointed FP/Altivec registers from the thread struct before the trecheckpoint. It is therefore quite unclear where it expects to get the state from. This didn't help with the assumption made about tm_reclaim(). This patch is a minimal fix for ease of backporting. A more correct fix which removes the msr parameter to tm_reclaim() and tm_recheckpoint() altogether has been upstreamed to apply on top of this patch. Fixes: dc310669 ("powerpc: tm: Always use fp_state and vr_state to store live registers") Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-
Cyril Bur authored
Lazy save and restore of FP/Altivec means that a userspace process can be sent to userspace with FP or Altivec disabled and loaded only as required (by way of an FP/Altivec unavailable exception). Transactional Memory complicates this situation as a transaction could be started without FP/Altivec being loaded up. This causes the hardware to checkpoint incorrect registers. Handling FP/Altivec unavailable exceptions while a thread is transactional requires a reclaim and recheckpoint to ensure the CPU has correct state for both sets of registers. Lazy save and restore of FP/Altivec cannot be done if a process is transactional. If a facility was enabled it must remain enabled whenever a thread is transactional. Commit dc16b553 ("powerpc: Always restore FPU/VEC/VSX if hardware transactional memory in use") ensures that the facilities are always enabled if a thread is transactional. A bug in the introduced code may cause it to inadvertently enable a facility that was (and should remain) disabled. The problem with this extraneous enablement is that the registers for the erroneously enabled facility have not been correctly recheckpointed - the recheckpointing code assumed the facility would remain disabled. Further compounding the issue, the transactional {fp,altivec,vsx} unavailable code has been incorrectly using the MSR to enable facilities. The presence of the {FP,VEC,VSX} bit in the regs->msr simply means if the registers are live on the CPU, not if the kernel should load them before returning to userspace. This has worked due to the bug mentioned above. This causes transactional threads which return to their failure handler to observe incorrect checkpointed registers. Perhaps an example will help illustrate the problem: A userspace process is running and uses both FP and Altivec registers. This process then continues to run for some time without touching either sets of registers. The kernel subsequently disables the facilities as part of lazy save and restore. The userspace process then performs a tbegin and the CPU checkpoints 'junk' FP and Altivec registers. The process then performs a floating point instruction triggering a fp unavailable exception in the kernel. The kernel then loads the FP registers - and only the FP registers. Since the thread is transactional it must perform a reclaim and recheckpoint to ensure both the checkpointed registers and the transactional registers are correct. It then (correctly) enables MSR[FP] for the process. Later (on exception exist) the kernel also (inadvertently) enables MSR[VEC]. The process is then returned to userspace. Since the act of loading the FP registers doomed the transaction we know CPU will fail the transaction, restore its checkpointed registers, and return the process to its failure handler. The problem is that we're now running with Altivec enabled and the 'junk' checkpointed registers are restored. The kernel had only recheckpointed FP. This patch solves this by only activating FP/Altivec if userspace was using them when it entered the kernel and not simply if the process is transactional. Fixes: dc16b553 ("powerpc: Always restore FPU/VEC/VSX if hardware transactional memory in use") Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
-