- 06 May, 2020 1 commit
-
-
Michael Ellerman authored
This merges the lockless page table walk rework series from Aneesh. Because it touches powerpc KVM code we are sharing it with the kvm-ppc tree in our topic/ppc-kvm branch. This is the cover letter from Aneesh: Avoid IPI while updating page table entries. Problem Summary: Slow termination of KVM guest with large guest RAM config due to a large number of IPIs that were caused by clearing level 1 PTE entries (THP) entries. This is shown in the stack trace below. - qemu-system-ppc [kernel.vmlinux] [k] smp_call_function_many - smp_call_function_many - 36.09% smp_call_function_many serialize_against_pte_lookup radix__pmdp_huge_get_and_clear zap_huge_pmd unmap_page_range unmap_vmas unmap_region __do_munmap __vm_munmap sys_munmap system_call __munmap qemu_ram_munmap qemu_anon_ram_free reclaim_ramblock call_rcu_thread qemu_thread_start start_thread __clone Why we need to do IPI when clearing PMD entries: This was added as part of commit: 13bd817b ("powerpc/thp: Serialize pmd clear against a linux page table walk") serialize_against_pte_lookup makes sure that all parallel lockless page table walk completes before we convert a PMD pte entry to regular pmd entry. We end up doing that conversion in the below scenarios 1) __split_huge_zero_page_pmd 2) do_huge_pmd_wp_page_fallback 3) MADV_DONTNEED running parallel to page faults. local_irq_disable and lockless page table walk: The lockless page table walk work with the assumption that we can dereference the page table contents without holding a lock. For this to work, we need to make sure we read the page table contents atomically and page table pages are not going to be freed/released while we are walking the table pages. We can achieve by using a rcu based freeing for page table pages or if the architecture implements broadcast tlbie, we can block the IPI as we walk the page table pages. To support both the above framework, lockless page table walk is done with irq disabled instead of rcu_read_lock() We do have two interface for lockless page table walk, gup fast and __find_linux_pte. This patch series makes __find_linux_pte table walk safe against the conversion of PMD PTE to regular PMD. gup fast: gup fast is already safe against THP split because kernel now differentiate between a pmd split and a compound page split. gup fast can run parallel to a pmd split and we prevent a parallel gup fast to a hugepage split, by freezing the page refcount and failing the speculative page ref increment. Similar to how gup is safe against parallel pmd split, this patch series updates the __find_linux_pte callers to be safe against a parallel pmd split. We do that by enforcing the following rules. 1) Don't reload the pte value, because that can be updated in parallel. 2) Code should be able to work with a stale PTE value and not the recent one. ie, the pte value that we are looking at may not be the latest value in the page table. 3) Before looking at pte value check for _PAGE_PTE bit. We now do this as part of pte_present() check. Performance: This speeds up Qemu guest RAM del/unplug time as below 128 core, 496GB guest: Without patch: munmap start: timer = 13162 ms, PID=7684 munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms With patch (upto removing IPI) munmap start: timer = 196449 ms, PID=6681 munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms With patch (with adding the tlb invalidate in pmdp_huge_get_and_clear_full) munmap start: timer = 196345 ms, PID=6879 munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms Link: https://lore.kernel.org/r/20200505071729.54912-1-aneesh.kumar@linux.ibm.com
-
- 05 May, 2020 23 commits
-
-
Aneesh Kumar K.V authored
MADV_DONTNEED holds mmap_sem in read mode and that implies a parallel page fault is possible and the kernel can end up with a level 1 PTE entry (THP entry) converted to a level 0 PTE entry without flushing the THP TLB entry. Most architectures including POWER have issues with kernel instantiating a level 0 PTE entry while holding level 1 TLB entries. The code sequence I am looking at is down_read(mmap_sem) down_read(mmap_sem) zap_pmd_range() zap_huge_pmd() pmd lock held pmd_cleared table details added to mmu_gather pmd_unlock() insert a level 0 PTE entry() tlb_finish_mmu(). Fix this by forcing a tlb flush before releasing pmd lock if this is not a fullmm invalidate. We can safely skip this invalidate for task exit case (fullmm invalidate) because in that case we are sure there can be no parallel fault handlers. This do change the Qemu guest RAM del/unplug time as below 128 core, 496GB guest: Without patch: munmap start: timer = 196449 ms, PID=6681 munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms With patch: munmap start: timer = 196345 ms, PID=6879 munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-23-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Now that all the lockless page table walk is careful w.r.t the PTE address returned, we can now revert commit: 13bd817b ("powerpc/thp: Serialize pmd clear against a linux page table walk.") We also drop the equivalent IPI from other pte updates routines. We still keep IPI in hash pmdp collapse and that is to take care of parallel hash page table insert. The radix pmdp collapse flush can possibly be removed once I am sure generic code doesn't have the any expectations around parallel gup walk. This speeds up Qemu guest RAM del/unplug time as below 128 core, 496GB guest: Without patch: munmap start: timer = 13162 ms, PID=7684 munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms With patch: munmap start: timer = 196449 ms, PID=6681 munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-21-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
This adds _PAGE_PTE check and makes sure we validate the pte value returned via find_kvm_host_pte. NOTE: this also considers _PAGE_INVALID to the software valid bit. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-20-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-19-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-18-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
We now depend on kvm->mmu_lock Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-17-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Current code just hold rmap lock to ensure parallel page table update is prevented. That is not sufficient. The kernel should also check whether a mmu_notifer callback was running in parallel. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-16-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Since kvmppc_do_h_enter can get called in realmode use low level arch_spin_lock which is safe to be called in realmode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-15-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-14-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-13-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
update kvmppc_hv_handle_set_rc to use find_kvm_nested_guest_pte and find_kvm_secondary_pte Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-12-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
The locking rules for walking nested shadow linux page table is different from process scoped table. Hence add a helper for nested page table walk and also add check whether we are holding the right locks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-11-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
The locking rules for walking partition scoped table is different from process scoped table. Hence add a helper for secondary linux page table walk and also add check whether we are holding the right locks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-10-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
These functions can get called in realmode. Hence use low level arch_spin_lock which is safe to be called in realmode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-9-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
read_user_stack_slow is called with interrupts soft disabled and it copies contents from the page which we find mapped to a specific address. To convert userspace address to pfn, the kernel now uses lockless page table walk. The kernel needs to make sure the pfn value read remains stable and is not released and reused for another process while the contents are read from the page. This can only be achieved by holding a page reference. One of the first approaches I tried was to check the pte value after the kernel copies the contents from the page. But as shown below we can still get it wrong CPU0 CPU1 pte = READ_ONCE(*ptep); pte_clear(pte); put_page(page); page = alloc_page(); memcpy(page_address(page), "secret password", nr); memcpy(buf, kaddr + offset, nb); put_page(page); handle_mm_fault() page = alloc_page(); set_pte(pte, page); if (pte_val(pte) != pte_val(*ptep)) Hence switch to __get_user_pages_fast. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-8-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
A lockless page table walk should be safe against parallel THP collapse, THP split and madvise(MADV_DONTNEED)/parallel fault. This patch makes sure kernel won't reload the pteval when checking for different conditions. The patch also added a check for pte_present to make sure the kernel is indeed operating on a PTE and not a pointer to level 0 table page. The pfn value we find here can be different from the actual pfn on which machine check happened. This can happen if we raced with a parallel update of the page table. In such a scenario we end up isolating a wrong pfn. But that doesn't have any other side effect. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-7-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Don't fetch the pte value using lockless page table walk. Instead use the value from the caller. hash_preload is called with ptl lock held. So it is safe to use the pte_t address directly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-6-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
This is only used with init_mm currently. Walking init_mm is much simpler because we don't need to handle concurrent page table like other mm_context Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-5-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
This makes the pte_present check stricter by checking for additional _PAGE_PTE bit. A level 1 pte pointer (THP pte) can be switched to a pointer to level 0 pte page table page by following two operations. 1) THP split. 2) madvise(MADV_DONTNEED) in parallel to page fault. A lockless page table walk need to make sure we can handle such changes gracefully. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-4-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
If multiple threads in userspace keep changing the protection keys mapping a range, there can be a scenario where kernel takes a key fault but the pkey value found in the siginfo struct is a permissive one. This can confuse the userspace as shown in the below test case. /* use this to control the number of test iterations */ static void pkeyreg_set(int pkey, unsigned long rights) { unsigned long reg, shift; shift = (NR_PKEYS - pkey - 1) * PKEY_BITS_PER_PKEY; asm volatile("mfspr %0, 0xd" : "=r"(reg)); reg &= ~(((unsigned long) PKEY_BITS_MASK) << shift); reg |= (rights & PKEY_BITS_MASK) << shift; asm volatile("mtspr 0xd, %0" : : "r"(reg)); } static unsigned long pkeyreg_get(void) { unsigned long reg; asm volatile("mfspr %0, 0xd" : "=r"(reg)); return reg; } static int sys_pkey_mprotect(void *addr, size_t len, int prot, int pkey) { return syscall(SYS_pkey_mprotect, addr, len, prot, pkey); } static int sys_pkey_alloc(unsigned long flags, unsigned long access_rights) { return syscall(SYS_pkey_alloc, flags, access_rights); } static int sys_pkey_free(int pkey) { return syscall(SYS_pkey_free, pkey); } static int faulting_pkey; static int permissive_pkey; static pthread_barrier_t pkey_set_barrier; static pthread_barrier_t mprotect_barrier; static void pkey_handle_fault(int signum, siginfo_t *sinfo, void *ctx) { unsigned long pkeyreg; /* FIXME: printf is not signal-safe but for the current purpose, it gets the job done. */ printf("pkey: exp = %d, got = %d\n", faulting_pkey, sinfo->si_pkey); fflush(stdout); assert(sinfo->si_code == SEGV_PKUERR); assert(sinfo->si_pkey == faulting_pkey); /* clear pkey permissions to let the faulting instruction continue */ pkeyreg_set(faulting_pkey, 0x0); } static void *do_mprotect_fault(void *p) { unsigned long rights, pkeyreg, pgsize; unsigned int i; void *region; int pkey; srand(time(NULL)); pgsize = sysconf(_SC_PAGESIZE); rights = PKEY_DISABLE_WRITE; region = p; /* allocate key, no permissions */ assert((pkey = sys_pkey_alloc(0, PKEY_DISABLE_ACCESS)) > 0); pkeyreg_set(4, 0x0); /* cache the pkey here as the faulting pkey for future reference in the signal handler */ faulting_pkey = pkey; printf("%s: faulting pkey = %d\n", __func__, faulting_pkey); /* try to allocate, mprotect and free pkeys repeatedly */ for (i = 0; i < NUM_ITERATIONS; i++) { /* sync up with the other thread here */ pthread_barrier_wait(&pkey_set_barrier); /* make sure that the pkey used by the non-faulting thread is made permissive for this thread's context too so that no faults are triggered because it still might have been set to a restrictive value */ // pkeyreg_set(permissive_pkey, 0x0); /* sync up with the other thread here */ pthread_barrier_wait(&mprotect_barrier); /* perform mprotect */ assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey)); /* choose a random byte from the protected region and attempt to write to it, this will generate a fault */ *((char *) region + (rand() % pgsize)) = rand(); /* restore pkey permissions as the signal handler may have cleared the bit out for the sake of continuing */ pkeyreg_set(pkey, PKEY_DISABLE_WRITE); } /* free pkey */ sys_pkey_free(pkey); return NULL; } static void *do_mprotect_nofault(void *p) { unsigned long pgsize; unsigned int i, j; void *region; int pkey; pgsize = sysconf(_SC_PAGESIZE); region = p; /* try to allocate, mprotect and free pkeys repeatedly */ for (i = 0; i < NUM_ITERATIONS; i++) { /* allocate pkey, all permissions */ assert((pkey = sys_pkey_alloc(0, 0)) > 0); permissive_pkey = pkey; /* sync up with the other thread here */ pthread_barrier_wait(&pkey_set_barrier); pthread_barrier_wait(&mprotect_barrier); /* perform mprotect on the common page, no faults will be triggered as this is most permissive */ assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey)); /* free pkey */ assert(!sys_pkey_free(pkey)); } return NULL; } int main(int argc, char **argv) { pthread_t fault_thread, nofault_thread; unsigned long pgsize; struct sigaction act; pthread_attr_t attr; cpu_set_t fault_cpuset, nofault_cpuset; unsigned int i; void *region; /* allocate memory region to protect */ pgsize = sysconf(_SC_PAGESIZE); assert(region = memalign(pgsize, pgsize)); CPU_ZERO(&fault_cpuset); CPU_SET(0, &fault_cpuset); CPU_ZERO(&nofault_cpuset); CPU_SET(8, &nofault_cpuset); assert(!pthread_attr_init(&attr)); /* setup sigsegv signal handler */ act.sa_handler = 0; act.sa_sigaction = pkey_handle_fault; assert(!sigprocmask(SIG_SETMASK, 0, &act.sa_mask)); act.sa_flags = SA_SIGINFO; act.sa_restorer = 0; assert(!sigaction(SIGSEGV, &act, NULL)); /* setup barrier for the two threads */ pthread_barrier_init(&pkey_set_barrier, NULL, 2); pthread_barrier_init(&mprotect_barrier, NULL, 2); /* setup and start threads */ assert(!pthread_create(&fault_thread, &attr, &do_mprotect_fault, region)); assert(!pthread_setaffinity_np(fault_thread, sizeof(cpu_set_t), &fault_cpuset)); assert(!pthread_create(&nofault_thread, &attr, &do_mprotect_nofault, region)); assert(!pthread_setaffinity_np(nofault_thread, sizeof(cpu_set_t), &nofault_cpuset)); /* cleanup */ assert(!pthread_attr_destroy(&attr)); assert(!pthread_join(fault_thread, NULL)); assert(!pthread_join(nofault_thread, NULL)); assert(!pthread_barrier_destroy(&pkey_set_barrier)); assert(!pthread_barrier_destroy(&mprotect_barrier)); free(region); puts("PASS"); return EXIT_SUCCESS; } The above test can result the below failure without this patch. pkey: exp = 3, got = 3 pkey: exp = 3, got = 4 a.out: pkey-siginfo-race.c:100: pkey_handle_fault: Assertion `sinfo->si_pkey == faulting_pkey' failed. Aborted Check for vma access before considering this a key fault. If vma pkey allow access retry the acess again. Test case is written by Sandipan Das <sandipan@linux.ibm.com> hence added SOB from him. Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-3-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
Fetch pkey from vma instead of linux page table. Also document the fact that in some cases the pkey returned in siginfo won't be the same as the one we took keyfault on. Even with linux page table walk, we can end up in a similar scenario. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-2-aneesh.kumar@linux.ibm.com
-
Aneesh Kumar K.V authored
We will use this in later patch to do tlb flush when clearing pmd entries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200505071729.54912-22-aneesh.kumar@linux.ibm.com
-
Michael Ellerman authored
This brings in a fix from the kvm-ppc tree that was merged to mainline after rc2, and so isn't in the base of our topic branch. We'd like it in the topic branch because it interacts with patches we plan to carry in this branch.
-
- 04 May, 2020 3 commits
-
-
Hari Bathini authored
Commit 0962e800 ("powerpc/prom: Scan reserved-ranges node for memory reservations") enabled support to parse reserved-ranges DT node and reserve kernel memory falling in these ranges for F/W purposes. Memory reserved for FADump should not overlap with these ranges as it could corrupt memory meant for F/W or crash'ed kernel memory to be exported as vmcore. But since commit 579ca1a2 ("powerpc/fadump: make use of memblock's bottom up allocation mode"), memblock_find_in_range() is being used to find the appropriate area to reserve memory for FADump, which can't account for reserved-ranges as these ranges are reserved only after FADump memory reservation. With reserved-ranges now being populated during early boot, look out for these memory ranges while reserving memory for FADump. Without this change, MPIPL on PowerNV systems aborts with hostboot failure, when memory reserved for FADump is less than 4096MB. Fixes: 579ca1a2 ("powerpc/fadump: make use of memblock's bottom up allocation mode") Cc: stable@vger.kernel.org Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/158737297693.26700.16193820746269425424.stgit@hbathini.in.ibm.com
-
Hari Bathini authored
At times, memory ranges have to be looked up during early boot, when kernel couldn't be initialized for dynamic memory allocation. In fact, reserved-ranges look up is needed during FADump memory reservation. Without accounting for reserved-ranges in reserving memory for FADump, MPIPL boot fails with memory corruption issues. So, extend memory ranges handling to support static allocation and populate reserved memory ranges during early boot. Fixes: dda9dbfe ("powerpc/fadump: consider reserved ranges while releasing memory") Cc: stable@vger.kernel.org Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/158737294432.26700.4830263187856221314.stgit@hbathini.in.ibm.com
-
Xiongfeng Wang authored
Move the static keyword to the front of declaration of 'vuart_bus_priv', and resolve the following compiler warning that can be seen when building with warnings enabled (W=1): drivers/ps3/ps3-vuart.c:867:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] } static vuart_bus_priv; ^ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1588154448-56759-1-git-send-email-wangxiongfeng2@huawei.com
-
- 30 Apr, 2020 6 commits
-
-
Naveen N. Rao authored
Currently, it is possible to have CONFIG_FUNCTION_TRACER disabled, but CONFIG_MPROFILE_KERNEL enabled. Though all existing users of MPROFILE_KERNEL are doing the right thing, it is weird to have MPROFILE_KERNEL enabled when the function tracer isn't. Fix this by making MPROFILE_KERNEL depend on FUNCTION_TRACER. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200422092612.514301-1-naveen.n.rao@linux.vnet.ibm.com
-
Gautham R. Shenoy authored
Add documentation for the following sysfs interfaces: /sys/devices/system/cpu/cpuX/purr /sys/devices/system/cpu/cpuX/spurr /sys/devices/system/cpu/cpuX/idle_purr /sys/devices/system/cpu/cpuX/idle_spurr Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1586249263-14048-6-git-send-email-ego@linux.vnet.ibm.com
-
Gautham R. Shenoy authored
On Pseries LPARs, to calculate utilization, we need to know the [S]PURR ticks when the CPUs were busy or idle. The total PURR and SPURR ticks are already exposed via the per-cpu sysfs files "purr" and "spurr". This patch adds support for exposing the idle PURR and SPURR ticks via new per-cpu sysfs files named "idle_purr" and "idle_spurr". This patch also adds helper functions to accurately read the values of idle_purr and idle_spurr especially from an interrupt context between when the interrupt has occurred between the pseries_idle_prolog() and pseries_idle_epilog(). This will ensure that the idle purr/spurr values corresponding to the latest idle period is accounted for before these values are read. Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1586249263-14048-5-git-send-email-ego@linux.vnet.ibm.com
-
Gautham R. Shenoy authored
On Pseries LPARs, to calculate utilization, we need to know the [S]PURR ticks when the CPUs were busy or idle. Via pseries_idle_prolog(), pseries_idle_epilog(), we track the idle PURR ticks in the VPA variable "wait_state_cycles". This patch extends the support to account for the idle SPURR ticks. Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1586249263-14048-4-git-send-email-ego@linux.vnet.ibm.com
-
Gautham R. Shenoy authored
Currently when CPU goes idle, we take a snapshot of PURR via pseries_idle_prolog() which is used at the CPU idle exit to compute the idle PURR cycles via the function pseries_idle_epilog(). Thus, the value of idle PURR cycle thus read before pseries_idle_prolog() and after pseries_idle_epilog() is always correct. However, if we were to read the idle PURR cycles from an interrupt context between pseries_idle_prolog() and pseries_idle_epilog() (this will be done in a future patch), then, the value of the idle PURR thus read will not include the cycles spent in the most recent idle period. Thus, in that interrupt context, we will need access to the snapshot of the PURR before going idle, in order to compute the idle PURR cycles for the latest idle duration. In this patch, we save the snapshot of PURR in pseries_idle_prolog() in a per-cpu variable, instead of on the stack, so that it can be accessed from an interrupt context. Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
-
Gautham R. Shenoy authored
Currently prior to entering an idle state on a Linux Guest, the pseries cpuidle driver implement an idle_loop_prolog() and idle_loop_epilog() functions which ensure that idle_purr is correctly computed, and the hypervisor is informed that the CPU cycles have been donated. These prolog and epilog functions are also required in the default idle call, i.e pseries_lpar_idle(). Hence move these accessor functions to a common header file and call them from pseries_lpar_idle(). Since the existing header files such as asm/processor.h have enough clutter, create a new header file asm/idle.h. Finally rename idle_loop_prolog() and idle_loop_epilog() to pseries_idle_prolog() and pseries_idle_epilog() as they are only relavent for on pseries guests. Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1586249263-14048-2-git-send-email-ego@linux.vnet.ibm.com
-
- 22 Apr, 2020 1 commit
-
-
Stephen Rothwell authored
allyesconfig fails with: ./usr/include/asm/vas-api.h:15:2: error: unknown type name '__u32' 15 | __u32 version; | ^~~~~ ./usr/include/asm/vas-api.h:16:2: error: unknown type name '__s16' 16 | __s16 vas_id; /* specific instance of vas or -1 for default */ | ^~~~~ ./usr/include/asm/vas-api.h:17:2: error: unknown type name '__u16' 17 | __u16 reserved1; | ^~~~~ ./usr/include/asm/vas-api.h:18:2: error: unknown type name '__u64' 18 | __u64 flags; /* Future use */ | ^~~~~ ./usr/include/asm/vas-api.h:19:2: error: unknown type name '__u64' 19 | __u64 reserved2[6]; | ^~~~~ uapi headers should be self contained, so add an include of linux/types.h. Fixes: 45f25a79 ("powerpc/vas: Define VAS_TX_WIN_OPEN ioctl API") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Haren Myneni <haren@linux.ibm.com> [mpe: Flesh out change log from linux-next error report] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200422154129.11f988fd@canb.auug.org.au
-
- 21 Apr, 2020 6 commits
-
-
Raphael Moreira Zinsly authored
Include a README file with the instructions to use the testcases at selftests/powerpc/nx-gzip. Signed-off-by: Bulent Abali <abali@us.ibm.com> Signed-off-by: Raphael Moreira Zinsly <rzinsly@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200420205538.25181-6-rzinsly@linux.ibm.com
-
Raphael Moreira Zinsly authored
Include a decompression testcase for the powerpc NX-GZIP engine. Signed-off-by: Bulent Abali <abali@us.ibm.com> Signed-off-by: Raphael Moreira Zinsly <rzinsly@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200420205538.25181-5-rzinsly@linux.ibm.com
-
Raphael Moreira Zinsly authored
Add a compression testcase for the powerpc NX-GZIP engine. Signed-off-by: Bulent Abali <abali@us.ibm.com> Signed-off-by: Raphael Moreira Zinsly <rzinsly@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200420205538.25181-4-rzinsly@linux.ibm.com
-
Raphael Moreira Zinsly authored
Add files to be able to compress and decompress files using the powerpc NX-GZIP engine. Signed-off-by: Bulent Abali <abali@us.ibm.com> Signed-off-by: Raphael Moreira Zinsly <rzinsly@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200420205538.25181-3-rzinsly@linux.ibm.com
-
Raphael Moreira Zinsly authored
Add files to access the powerpc NX-GZIP engine in user space. Signed-off-by: Bulent Abali <abali@us.ibm.com> Signed-off-by: Raphael Moreira Zinsly <rzinsly@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200420205538.25181-2-rzinsly@linux.ibm.com
-
Michael Ellerman authored
As described by Haren: Power9 processor supports Virtual Accelerator Switchboard (VAS) which allows kernel and userspace to send compression requests to Nest Accelerator (NX) directly. The NX unit comprises of 2 842 compression engines and 1 GZIP engine. Linux kernel already has 842 compression support on kernel. This patch series adds GZIP compression support from user space. The GZIP Compression engine implements the ZLIB and GZIP compression algorithms. No plans of adding NX-GZIP compression support in kernel right now. Applications can send requests to NX directly with COPY/PASTE instructions. But kernel has to establish channel / window on NX-GZIP device for the userspace. So userspace access to the GZIP engine is provided through /dev/crypto/nx-gzip device with several operations. An application must open the this device to obtain a file descriptor (fd). Using the fd, application should issue the VAS_TX_WIN_OPEN ioctl to establish a connection to the engine. Once window is opened, should use mmap() system call to map the hardware address of engine's request queue into the application's virtual address space. Then user space forms the request as co-processor Request Block (CRB) and paste this CRB on the mapped HW address using COPY/PASTE instructions. Application can poll on status flags (part of CRB) with timeout for request completion. For VAS_TX_WIN_OPEN ioctl, if user space passes vas_id = -1 (struct vas_tx_win_open_attr), kernel determines the VAS instance on the corresponding chip based on the CPU on which the process is executing. Otherwise, the specified VAS instance is used if application passes the proper VAS instance (vas_id listed in /proc/device-tree/vas@*/ibm,vas_id). Process can open multiple windows with different FDs or can send several requests to NX on the same window at the same time.
-