1. 15 Jun, 2017 8 commits
    • Nicholas Piggin's avatar
      powerpc/64s: Avoid cpabort in context switch when possible · 07d2a628
      Nicholas Piggin authored
      The ISA v3.0B copy-paste facility only requires cpabort when switching
      to a process that has foreign real addresses mapped (direct access to
      accelerators), to clear a potential copy buffer filled by a previous
      thread. There is no accelerator driver implemented yet, so cpabort can
      be removed. It can be be re-added when a driver is implemented.
      
      POWER9 DD1 requires the copy buffer to always be cleared on context
      switch, but if accelerators are not in use, then an unpaired copy from
      a dummy region is sufficient to clear data out of the copy buffer.
      
      This increases context switch performance by about 5% on POWER9.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      07d2a628
    • Nicholas Piggin's avatar
      powerpc/64: Drop explicit hwsync in context switch · 9145effd
      Nicholas Piggin authored
      The sync (aka. hwsync, aka. heavyweight sync) in the context switch
      code to prevent MMIO access being reordered from the point of view of
      a single process if it gets migrated to a different CPU is not
      required because there is an hwsync performed earlier in the context
      switch path.
      
      Comment this so it's clear enough if anything changes on the scheduler
      or the powerpc sides. Remove the hwsync from _switch.
      
      This improves context switch performance by 2-3% on POWER8.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9145effd
    • Nicholas Piggin's avatar
      powerpc/64: Drop reservation-clearing ldarx in context switch · 837e72f7
      Nicholas Piggin authored
      There is no need to explicitly break the reservation in _switch,
      because we are guaranteed that the context switch path will include a
      larx/stcx.
      
      Comment the guarantee and remove the reservation clear from _switch.
      
      This is worth 1-2% in context switch performance.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      837e72f7
    • Nicholas Piggin's avatar
      powerpc/64s: Leave interrupts hard enabled in context switch for radix · e4c0fc5f
      Nicholas Piggin authored
      Commit 4387e9ff25 ("[POWERPC] Fix PMU + soft interrupt disable bug")
      hard disabled interrupts over the low level context switch, because
      the SLB management can't cope with a PMU interrupt accesing the stack
      in that window.
      
      Radix based kernel mapping does not use the SLB so it does not require
      interrupts hard disabled here.
      
      This is worth 1-2% in context switch performance on POWER9.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e4c0fc5f
    • Nicholas Piggin's avatar
      powerpc/64: Avoid restore_math call if possible in syscall exit · bc4f65e4
      Nicholas Piggin authored
      The syscall exit code that branches to restore_math is quite heavy on
      Book3S, consisting of 2 mtmsr instructions. Threads that don't use both
      FP and vector can get caught here if the kernel ever uses FP or vector.
      Lazy-FP/vec context switching also trips this case.
      
      So check for lazy FP and vector before switching RI for restore_math.
      Move most of this case out of line.
      
      For threads that do want to restore math registers, the MSR switches are
      still suboptimal. Future direction may be to use a soft-RI bit to avoid
      MSR switches in kernel (similar to soft-EE), but for now at least the
      no-restore
      
      POWER9 context switch rate increases by about 5% due to sched_yield(2)
      return performance. I haven't constructed a test to measure the syscall
      cost.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bc4f65e4
    • Nicholas Piggin's avatar
      powerpc/64s: Optimize hypercall/syscall entry · acd7d8ce
      Nicholas Piggin authored
      After bc355125 ("powerpc/64: Allow for relocation-on interrupts from
      guest to host"), a getppid() system call goes from 307 cycles to 358
      cycles (+17%) on POWER8. This is due significantly to the scratch SPR
      used by the hypercall check.
      
      It turns out there are a some volatile registers common to both system
      call and hypercall (in particular, r12, cr0, ctr), which can be used to
      avoid the SPR and some other overheads. This brings getppid to 320 cycles
      (+4%).
      
      Testing hcall entry performance by running "sc 1" in guest userspace
      before this patch is 854 cycles, afterwards is 826. Also a small win
      there.
      
      POWER9 syscall is improved by about the same amount, hcall not tested.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      acd7d8ce
    • Michael Ellerman's avatar
      powerpc/mm/radix: Only add X for pages overlapping kernel text · 9abcc981
      Michael Ellerman authored
      Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we
      should check if the page overlaps the kernel text and only then add
      PAGE_KERNEL_X.
      
      Note that we still use 1G pages if they're available, so this will
      typically still result in a 1G executable page at KERNELBASE. So this fix is
      primarily useful for catching stray branches to high linear mapping addresses.
      
      Without this patch, we can execute at 1G in xmon using:
      
        0:mon> m c000000040000000
        c000000040000000  00 l
        c000000040000000  00000000 01006038
        c000000040000004  00000000 2000804e
        c000000040000008  00000000 x
        0:mon> di c000000040000000
        c000000040000000  38600001      li      r3,1
        c000000040000004  4e800020      blr
        0:mon> p c000000040000000
        return value is 0x1
      
      After we get a 400 as expected:
      
        0:mon> p c000000040000000
        *** 400 exception occurred
      
      Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines")
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      9abcc981
    • Michael Ellerman's avatar
      Revert "powerpc: Handle simultaneous interrupts at once" · 0edc2ca9
      Michael Ellerman authored
      This reverts commit 45cb08f4.
      
      For some reason this is causing IRQ problems on Freescale Book3E
      machines, eg on my p5020ds:
      
        irq 25: nobody cared (try booting with the "irqpoll" option)
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc3-gcc-6.3.1-00037-g45cb08f4 #624
        Call Trace:
        [c0000000fffdbb10] [c00000000049962c] .dump_stack+0xa8/0xe8 (unreliable)
        [c0000000fffdbba0] [c0000000000babf4] .__report_bad_irq+0x54/0x140
        [c0000000fffdbc40] [c0000000000bb11c] .note_interrupt+0x324/0x380
        [c0000000fffdbd00] [c0000000000b7110] .handle_irq_event_percpu+0x68/0x88
        [c0000000fffdbd90] [c0000000000b718c] .handle_irq_event+0x5c/0xa8
        [c0000000fffdbe10] [c0000000000bc01c] .handle_fasteoi_irq+0xe4/0x298
        [c0000000fffdbe90] [c0000000000b59c4] .generic_handle_irq+0x50/0x74
        [c0000000fffdbf10] [c0000000000075d8] .__do_irq+0x74/0x1f0
        [c0000000fffdbf90] [c0000000000189f8] .call_do_irq+0x14/0x24
        [c0000000f7173060] [c0000000000077e4] .do_IRQ+0x90/0x120
        [c0000000f7173100] [c00000000001d93c] exc_0x500_common+0xfc/0x100
        --- interrupt: 501 at .prepare_to_wait_event+0xc/0x14c
            LR = .fsl_elbc_run_command+0xc8/0x23c
        [c0000000f71734d0] [c00000000065f418] .nand_reset+0xb8/0x168
        [c0000000f7173560] [c00000000065fec4] .nand_scan_ident+0x2b0/0x1638
        [c0000000f7173650] [c000000000666cd8] .fsl_elbc_nand_probe+0x34c/0x5f0
        ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
        [c0000000f7173750] [c0000000005a3c60] .platform_drv_probe+0x64/0xb0
        [c0000000f71737d0] [c0000000005a12e0] .really_probe+0x290/0x334
        [c0000000f7173870] [c0000000005a14a0] .__driver_attach+0x11c/0x120
        [c0000000f7173900] [c00000000059e6a0] .bus_for_each_dev+0x98/0xfc
        [c0000000f71739a0] [c0000000005a0b3c] .driver_attach+0x34/0x4c
        [c0000000f7173a20] [c0000000005a04b0] .bus_add_driver+0x1ac/0x2e0
        [c0000000f7173ac0] [c0000000005a2170] .driver_register+0x94/0x160
        [c0000000f7173b40] [c0000000005a3be0] .__platform_driver_register+0x60/0x7c
        [c0000000f7173bc0] [c000000000d6aab4] .fsl_elbc_nand_driver_init+0x24/0x38
        [c0000000f7173c30] [c000000000001934] .do_one_initcall+0x68/0x1b8
        [c0000000f7173d00] [c000000000d210f8] .kernel_init_freeable+0x260/0x338
        [c0000000f7173db0] [c0000000000021b0] .kernel_init+0x20/0xe70
        [c0000000f7173e30] [c0000000000009bc] .ret_from_kernel_thread+0x58/0x9c
        handlers:
        [<c000000000ed85c8>] .fsl_lbc_ctrl_irq
        Disabling IRQ #25
      
      Ben also had concerns with the implementation being potentially slow on
      some PICs, so revert it for now.
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0edc2ca9
  2. 06 Jun, 2017 2 commits
  3. 05 Jun, 2017 9 commits
  4. 02 Jun, 2017 18 commits
    • Matt Brown's avatar
      powerpc/lib/xor_vmx: Ensure no altivec code executes before enable_kernel_altivec() · f718d426
      Matt Brown authored
      The xor_vmx.c file is used for the RAID5 xor operations. In these functions
      altivec is enabled to run the operation and then disabled.
      
      The code uses enable_kernel_altivec() around the core of the algorithm, however
      the whole file is built with -maltivec, so the compiler is within its rights to
      generate altivec code anywhere. This has been seen at least once in the wild:
      
        0:mon> di $xor_altivec_2
        c0000000000b97d0  3c4c01d9	addis   r2,r12,473
        c0000000000b97d4  3842db30	addi    r2,r2,-9424
        c0000000000b97d8  7c0802a6	mflr    r0
        c0000000000b97dc  f8010010	std     r0,16(r1)
        c0000000000b97e0  60000000	nop
        c0000000000b97e4  7c0802a6	mflr    r0
        c0000000000b97e8  faa1ffa8	std     r21,-88(r1)
        ...
        c0000000000b981c  f821ff41	stdu    r1,-192(r1)
        c0000000000b9820  7f8101ce	stvx    v28,r1,r0		<-- POP
        c0000000000b9824  38000030	li      r0,48
        c0000000000b9828  7fa101ce	stvx    v29,r1,r0
        ...
        c0000000000b984c  4bf6a06d	bl      c0000000000238b8 # enable_kernel_altivec
      
      This patch splits the non-altivec code into xor_vmx_glue.c which calls the
      altivec functions in xor_vmx.c. By compiling xor_vmx_glue.c without
      -maltivec we can guarantee that altivec instruction will not be executed
      outside of the enable/disable block.
      Signed-off-by: default avatarMatt Brown <matthew.brown.dev@gmail.com>
      [mpe: Rework change log and include disassembly]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f718d426
    • Hari Bathini's avatar
      powerpc/fadump: Set an upper limit for boot memory size · 48a316e3
      Hari Bathini authored
      By default, 5% of system RAM is reserved for preserving boot memory.
      Alternatively, a user can specify the amount of memory to reserve.
      See Documentation/powerpc/firmware-assisted-dump.txt for details. In
      addition to the memory reserved for preserving boot memory, some more
      memory is reserved, to save HPTE region, CPU state data and ELF core
      headers.
      
      Memory Reservation during first kernel looks like below:
      
        Low memory                                        Top of memory
        0      boot memory size                                       |
        |           |                       |<--Reserved dump area -->|
        V           V                       |   Permanent Reservation V
        +-----------+----------/ /----------+---+----+-----------+----+
        |           |                       |CPU|HPTE|  DUMP     |ELF |
        +-----------+----------/ /----------+---+----+-----------+----+
              |                                           ^
              |                                           |
              \                                           /
               -------------------------------------------
                Boot memory content gets transferred to
                reserved area by firmware at the time of
                crash
      
      This implicitly means that the sum of the sizes of boot memory, CPU
      state data, HPTE region, DUMP preserving area and ELF core headers
      can't be greater than the total memory size. But currently, a user is
      allowed to specify any value as boot memory size. So, the above rule
      is violated when a boot memory size around 50% of the total available
      memory is specified. As the kernel is not handling this currently, it
      may lead to undefined behavior. Fix it by setting an upper limit for
      boot memory size to 25% of the total available memory. Also, instead
      of using memblock_end_of_DRAM(), which doesn't take the holes, if any,
      in the memory layout into account, use memblock_phys_mem_size() to
      calculate the percentage of total available memory.
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      48a316e3
    • Hari Bathini's avatar
      powerpc/fadump: Update comment about offset where fadump is reserved · e7467dc6
      Hari Bathini authored
      With commit f6e6bedb ("powerpc/fadump: Reserve memory at an offset
      closer to bottom of RAM"), memory for fadump is no longer reserved at
      the top of RAM. But there are still a few places which say so. Change
      them appropriately.
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e7467dc6
    • Hari Bathini's avatar
      powerpc/fadump: Add a warning when 'fadump_reserve_mem=' is used · 81d9eca5
      Hari Bathini authored
      With commit 11550dc0 ("powerpc/fadump: reuse crashkernel parameter
      for fadump memory reservation"), 'fadump_reserve_mem=' parameter is
      deprecated in favor of 'crashkernel=' parameter. Add a warning if
      'fadump_reserve_mem=' is still used.
      
      Fixes: 11550dc0 ("powerpc/fadump: reuse crashkernel parameter for fadump memory reservation")
      Suggested-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      [mpe: Unsplit long printk strings]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      81d9eca5
    • Michal Suchanek's avatar
      powerpc/fadump: Return error when fadump registration fails · 98b8cd7f
      Michal Suchanek authored
       - log an error message when registration fails and no error code listed
         in the switch is returned
       - translate the hv error code to posix error code and return it from
         fw_register
       - return the posix error code from fw_register to the process writing
         to sysfs
       - return EEXIST on re-registration
       - return success on deregistration when fadump is not registered
       - return ENODEV when no memory is reserved for fadump
      Signed-off-by: default avatarMichal Suchanek <msuchanek@suse.de>
      Tested-by: default avatarHari Bathini <hbathini@linux.vnet.ibm.com>
      [mpe: Use pr_err() to shrink the error print]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      98b8cd7f
    • Christophe Leroy's avatar
      powerpc: Remove __ilog2()s and use generic ones · f782ddf2
      Christophe Leroy authored
      With the __ilog2() function as defined in
      arch/powerpc/include/asm/bitops.h, GCC will not optimise the code
      in case of constant parameter.
      
      The generic ilog2() function in include/linux/log2.h is written
      to handle the case of the constant parameter.
      
      This patch discards the three __ilog2() functions and
      defines __ilog2() as ilog2()
      
      For non constant calls, the generated code is doing the same:
      int test__ilog2(unsigned long x)
      {
      	return __ilog2(x);
      }
      
      int test__ilog2_u32(u32 n)
      {
      	return __ilog2_u32(n);
      }
      
      int test__ilog2_u64(u64 n)
      {
      	return __ilog2_u64(n);
      }
      
      On PPC32 before the patch:
      00000000 <test__ilog2>:
         0:	7c 63 00 34 	cntlzw  r3,r3
         4:	20 63 00 1f 	subfic  r3,r3,31
         8:	4e 80 00 20 	blr
      
      0000000c <test__ilog2_u32>:
         c:	7c 63 00 34 	cntlzw  r3,r3
        10:	20 63 00 1f 	subfic  r3,r3,31
        14:	4e 80 00 20 	blr
      
      On PPC32 after the patch:
      00000000 <test__ilog2>:
         0:	7c 63 00 34 	cntlzw  r3,r3
         4:	20 63 00 1f 	subfic  r3,r3,31
         8:	4e 80 00 20 	blr
      
      0000000c <test__ilog2_u32>:
         c:	7c 63 00 34 	cntlzw  r3,r3
        10:	20 63 00 1f 	subfic  r3,r3,31
        14:	4e 80 00 20 	blr
      
      On PPC64 before the patch:
      0000000000000000 <.test__ilog2>:
         0:	7c 63 00 74 	cntlzd  r3,r3
         4:	20 63 00 3f 	subfic  r3,r3,63
         8:	7c 63 07 b4 	extsw   r3,r3
         c:	4e 80 00 20 	blr
      
      0000000000000010 <.test__ilog2_u32>:
        10:	7c 63 00 34 	cntlzw  r3,r3
        14:	20 63 00 1f 	subfic  r3,r3,31
        18:	7c 63 07 b4 	extsw   r3,r3
        1c:	4e 80 00 20 	blr
      
      0000000000000020 <.test__ilog2_u64>:
        20:	7c 63 00 74 	cntlzd  r3,r3
        24:	20 63 00 3f 	subfic  r3,r3,63
        28:	7c 63 07 b4 	extsw   r3,r3
        2c:	4e 80 00 20 	blr
      
      On PPC64 after the patch:
      0000000000000000 <.test__ilog2>:
         0:	7c 63 00 74 	cntlzd  r3,r3
         4:	20 63 00 3f 	subfic  r3,r3,63
         8:	7c 63 07 b4 	extsw   r3,r3
         c:	4e 80 00 20 	blr
      
      0000000000000010 <.test__ilog2_u32>:
        10:	7c 63 00 34 	cntlzw  r3,r3
        14:	20 63 00 1f 	subfic  r3,r3,31
        18:	7c 63 07 b4 	extsw   r3,r3
        1c:	4e 80 00 20 	blr
      
      0000000000000020 <.test__ilog2_u64>:
        20:	7c 63 00 74 	cntlzd  r3,r3
        24:	20 63 00 3f 	subfic  r3,r3,63
        28:	7c 63 07 b4 	extsw   r3,r3
        2c:	4e 80 00 20 	blr
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f782ddf2
    • Christophe Leroy's avatar
      powerpc: Replace ffz() by equivalent generic function · 22ef33b3
      Christophe Leroy authored
      With the ffz() function as defined in arch/powerpc/include/asm/bitops.h
      GCC will not optimise the code in case of constant parameter.
      
      This patch replaces ffz() by the generic function.
      
      The generic ffz(x) expects to never be called with ~x == 0
      as written in the comment in include/asm-generic/bitops/ffz.h
      The only user of ffz() within arch/powerpc/ is
      platforms/512x/mpc5121_ads_cpld.c, which checks if x is not 0xff
      
      For non constant calls, the generated code is doing the same:
      
      unsigned long testffz(unsigned long x)
      {
      	return ffz(x);
      }
      
      On PPC32, before the patch:
      00000018 <testffz>:
        18:	7c 63 18 f9 	not.    r3,r3
        1c:	40 82 00 0c 	bne     28 <testffz+0x10>
        20:	38 60 00 20 	li      r3,32
        24:	4e 80 00 20 	blr
        28:	7d 23 00 d0 	neg     r9,r3
        2c:	7d 23 18 38 	and     r3,r9,r3
        30:	7c 63 00 34 	cntlzw  r3,r3
        34:	20 63 00 1f 	subfic  r3,r3,31
        38:	4e 80 00 20 	blr
      
      On PPC32, after the patch:
      00000018 <testffz>:
        18:	39 23 00 01 	addi    r9,r3,1
        1c:	7d 23 18 78 	andc    r3,r9,r3
        20:	7c 63 00 34 	cntlzw  r3,r3
        24:	20 63 00 1f 	subfic  r3,r3,31
        28:	4e 80 00 20 	blr
      
      On PPC64, before the patch:
      0000000000000030 <.testffz>:
        30:	7c 60 18 f9 	not.    r0,r3
        34:	38 60 00 40 	li      r3,64
        38:	4d 82 00 20 	beqlr
        3c:	7c 60 00 d0 	neg     r3,r0
        40:	7c 63 00 38 	and     r3,r3,r0
        44:	7c 63 00 74 	cntlzd  r3,r3
        48:	20 63 00 3f 	subfic  r3,r3,63
        4c:	7c 63 07 b4 	extsw   r3,r3
        50:	4e 80 00 20 	blr
      
      On PPC64, after the patch:
      0000000000000030 <.testffz>:
        30:	38 03 00 01 	addi    r0,r3,1
        34:	7c 03 18 78 	andc    r3,r0,r3
        38:	7c 63 00 74 	cntlzd  r3,r3
        3c:	20 63 00 3f 	subfic  r3,r3,63
        40:	4e 80 00 20 	blr
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      22ef33b3
    • Christophe Leroy's avatar
      powerpc: Use builtin functions for fls()/__fls()/fls64() · 2fcff790
      Christophe Leroy authored
      With the fls() functions as defined in arch/powerpc/include/asm/bitops.h
      GCC will not optimise the code in case of constant parameter.
      
      This patch replaces __fls() by the builtin function, and modifies
      fls() and fls64() to use builtins instead of inline assembly
      
      For non constant calls, the generated code is doing the same:
      
      int testfls(unsigned int x)
      {
      	return fls(x);
      }
      
      unsigned long test__fls(unsigned long x)
      {
      	return __fls(x);
      }
      
      int testfls64(__u64 x)
      {
      	return fls64(x);
      }
      
      On PPC32, before the patch:
      00000064 <testfls>:
        64:	7c 63 00 34 	cntlzw  r3,r3
        68:	20 63 00 20 	subfic  r3,r3,32
        6c:	4e 80 00 20 	blr
      
      00000070 <test__fls>:
        70:	7c 63 00 34 	cntlzw  r3,r3
        74:	20 63 00 1f 	subfic  r3,r3,31
        78:	4e 80 00 20 	blr
      
      0000007c <testfls64>:
        7c:	2c 03 00 00 	cmpwi   r3,0
        80:	40 82 00 10 	bne     90 <testfls64+0x14>
        84:	7c 83 00 34 	cntlzw  r3,r4
        88:	20 63 00 20 	subfic  r3,r3,32
        8c:	4e 80 00 20 	blr
        90:	7c 63 00 34 	cntlzw  r3,r3
        94:	20 63 00 40 	subfic  r3,r3,64
        98:	4e 80 00 20 	blr
      
      On PPC32, after the patch:
      00000054 <testfls>:
        54:	7c 63 00 34 	cntlzw  r3,r3
        58:	20 63 00 20 	subfic  r3,r3,32
        5c:	4e 80 00 20 	blr
      
      00000060 <test__fls>:
        60:	7c 63 00 34 	cntlzw  r3,r3
        64:	20 63 00 1f 	subfic  r3,r3,31
        68:	4e 80 00 20 	blr
      
      0000006c <testfls64>:
        6c:	2c 03 00 00 	cmpwi   r3,0
        70:	41 82 00 10 	beq     80 <testfls64+0x14>
        74:	7c 63 00 34 	cntlzw  r3,r3
        78:	20 63 00 40 	subfic  r3,r3,64
        7c:	4e 80 00 20 	blr
        80:	7c 83 00 34 	cntlzw  r3,r4
        84:	20 63 00 40 	subfic  r3,r3,32
        88:	4e 80 00 20 	blr
      
      On PPC64, before the patch:
      00000000000000a0 <.testfls>:
        a0:	7c 63 00 34 	cntlzw  r3,r3
        a4:	20 63 00 20 	subfic  r3,r3,32
        a8:	7c 63 07 b4 	extsw   r3,r3
        ac:	4e 80 00 20 	blr
      
      00000000000000b0 <.test__fls>:
        b0:	7c 63 00 74 	cntlzd  r3,r3
        b4:	20 63 00 3f 	subfic  r3,r3,63
        b8:	7c 63 07 b4 	extsw   r3,r3
        bc:	4e 80 00 20 	blr
      
      00000000000000c0 <.testfls64>:
        c0:	7c 63 00 74 	cntlzd  r3,r3
        c4:	20 63 00 40 	subfic  r3,r3,64
        c8:	7c 63 07 b4 	extsw   r3,r3
        cc:	4e 80 00 20 	blr
      
      On PPC64, after the patch:
      0000000000000090 <.testfls>:
        90:	7c 63 00 34 	cntlzw  r3,r3
        94:	20 63 00 20 	subfic  r3,r3,32
        98:	7c 63 07 b4 	extsw   r3,r3
        9c:	4e 80 00 20 	blr
      
      00000000000000a0 <.test__fls>:
        a0:	7c 63 00 74 	cntlzd  r3,r3
        a4:	20 63 00 3f 	subfic  r3,r3,63
        a8:	4e 80 00 20 	blr
        ac:	60 00 00 00 	nop
      
      00000000000000b0 <.testfls64>:
        b0:	7c 63 00 74 	cntlzd  r3,r3
        b4:	20 63 00 40 	subfic  r3,r3,64
        b8:	7c 63 07 b4 	extsw   r3,r3
        bc:	4e 80 00 20 	blr
      
      Those builtins have been in GCC since at least 3.4.6 (see
      https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Other-Builtins.html )
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2fcff790
    • Christophe Leroy's avatar
      powerpc: Discard ffs()/__ffs() function and use builtin functions instead · f83647d6
      Christophe Leroy authored
      With the ffs() function as defined in arch/powerpc/include/asm/bitops.h
      GCC will not optimise the code in case of constant parameter, as shown
      by the small exemple below.
      
      int ffs_test(void)
      {
      	return 4 << ffs(31);
      }
      
      c0012334 <ffs_test>:
      c0012334:       39 20 00 01     li      r9,1
      c0012338:       38 60 00 04     li      r3,4
      c001233c:       7d 29 00 34     cntlzw  r9,r9
      c0012340:       21 29 00 20     subfic  r9,r9,32
      c0012344:       7c 63 48 30     slw     r3,r3,r9
      c0012348:       4e 80 00 20     blr
      
      With this patch, the same function will compile as follows:
      
      c0012334 <ffs_test>:
      c0012334:       38 60 00 08     li      r3,8
      c0012338:       4e 80 00 20     blr
      
      The same happens with __ffs()
      
      For non constant calls, the generated code is doing the same,
      allthought it is slightly different on 64 bits for ffs():
      
      unsigned long test__ffs(unsigned long x)
      {
      	return __ffs(x);
      }
      
      int testffs(int x)
      {
      	return ffs(x);
      }
      
      On PPC32, before the patch:
      0000003c <test__ffs>:
        3c:	7d 23 00 d0 	neg     r9,r3
        40:	7d 23 18 38 	and     r3,r9,r3
        44:	7c 63 00 34 	cntlzw  r3,r3
        48:	20 63 00 1f 	subfic  r3,r3,31
        4c:	4e 80 00 20 	blr
      
      00000050 <testffs>:
        50:	7d 23 00 d0 	neg     r9,r3
        54:	7d 23 18 38 	and     r3,r9,r3
        58:	7c 63 00 34 	cntlzw  r3,r3
        5c:	20 63 00 20 	subfic  r3,r3,32
        60:	4e 80 00 20 	blr
      
      On PPC32, after the patch:
      0000002c <test__ffs>:
        2c:	7d 23 00 d0 	neg     r9,r3
        30:	7d 23 18 38 	and     r3,r9,r3
        34:	7c 63 00 34 	cntlzw  r3,r3
        38:	20 63 00 1f 	subfic  r3,r3,31
        3c:	4e 80 00 20 	blr
      
      00000040 <testffs>:
        40:	7d 23 00 d0 	neg     r9,r3
        44:	7d 23 18 38 	and     r3,r9,r3
        48:	7c 63 00 34 	cntlzw  r3,r3
        4c:	20 63 00 20 	subfic  r3,r3,32
        50:	4e 80 00 20 	blr
      
      On PPC64, before the patch:
      0000000000000060 <.test__ffs>:
        60:	7c 03 00 d0 	neg     r0,r3
        64:	7c 03 18 38 	and     r3,r0,r3
        68:	7c 63 00 74 	cntlzd  r3,r3
        6c:	20 63 00 3f 	subfic  r3,r3,63
        70:	7c 63 07 b4 	extsw   r3,r3
        74:	4e 80 00 20 	blr
      
      0000000000000080 <.testffs>:
        80:	7c 03 00 d0 	neg     r0,r3
        84:	7c 03 18 38 	and     r3,r0,r3
        88:	7c 63 00 74 	cntlzd  r3,r3
        8c:	20 63 00 40 	subfic  r3,r3,64
        90:	7c 63 07 b4 	extsw   r3,r3
        94:	4e 80 00 20 	blr
      
      On PPC64, after the patch:
      0000000000000050 <.test__ffs>:
        50:	7c 03 00 d0 	neg     r0,r3
        54:	7c 03 18 38 	and     r3,r0,r3
        58:	7c 63 00 74 	cntlzd  r3,r3
        5c:	20 63 00 3f 	subfic  r3,r3,63
        60:	4e 80 00 20 	blr
      
      0000000000000070 <.testffs>:
        70:	7c 03 00 d0 	neg     r0,r3
        74:	7c 03 18 38 	and     r3,r0,r3
        78:	7c 63 00 34 	cntlzw  r3,r3
        7c:	20 63 00 20 	subfic  r3,r3,32
        80:	7c 63 07 b4 	extsw   r3,r3
        84:	4e 80 00 20 	blr
      (ffs() operates on an int so cntlzw is equivalent to cntlzd)
      
      In addition, when reading the generated vmlinux, we can observe
      that with the builtin functions, GCC sometimes efficiently spreads
      the instructions within the generated functions while the inline
      assembly force them to remain grouped together.
      
      __builtin_ffs() is already used in arch/powerpc/include/asm/page_32.h
      
      Those builtins have been in GCC since at least 3.4.6 (see
      https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Other-Builtins.html )
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f83647d6
    • Christophe Leroy's avatar
      powerpc: Handle simultaneous interrupts at once · 45cb08f4
      Christophe Leroy authored
      It often happens to have simultaneous interrupts, for instance
      when having double Ethernet attachment. With the current
      implementation, we suffer the cost of kernel entry/exit for each
      interrupt.
      
      This patch introduces a loop in __do_irq() to handle all interrupts
      at once before returning.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      45cb08f4
    • Christophe Leroy's avatar
      powerpc/8xx: fix mpc8xx_get_irq() return on no irq · 3c29b603
      Christophe Leroy authored
      IRQ 0 is a valid HW interrupt. So get_irq() shall return 0 when
      there is no irq, instead of returning irq_linear_revmap(... ,0)
      
      Fixes: f2a0bd37 ("[POWERPC] 8xx: powerpc port of core CPM PIC")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      3c29b603
    • Christophe Leroy's avatar
    • Christophe Leroy's avatar
      powerpc/mm: The 8xx doesn't call do_page_fault() for breakpoints · 92aa2fe0
      Christophe Leroy authored
      The 8xx has a dedicated exception for breakpoints, that directly
      calls do_break()
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      92aa2fe0
    • Christophe Leroy's avatar
      powerpc/mm: Evaluate user_mode(regs) only once in do_page_fault() · da929f6a
      Christophe Leroy authored
      Analysis of the assembly code shows that when using user_mode(regs),
      at least the 'andi.' is redone all the time, and also
      the 'lwz ,132(r31)' most of the time. With the new form, the 'is_user'
      is mapped to cr4, then all further use of is_user results in just
      things like 'beq cr4,218 <do_page_fault+0x218>'
      
      Without the patch:
      
        50:	81 1e 00 84 	lwz     r8,132(r30)
        54:	71 09 40 00 	andi.   r9,r8,16384
        58:	40 82 00 0c 	bne     64 <do_page_fault+0x64>
      
        84:	81 3e 00 84 	lwz     r9,132(r30)
        8c:	71 2a 40 00 	andi.   r10,r9,16384
        90:	41 a2 01 64 	beq     1f4 <do_page_fault+0x1f4>
      
        d4:	81 3e 00 84 	lwz     r9,132(r30)
        dc:	71 28 40 00 	andi.   r8,r9,16384
        e0:	41 82 02 08 	beq     2e8 <do_page_fault+0x2e8>
      
       108:	81 3e 00 84 	lwz     r9,132(r30)
       110:	71 28 40 00 	andi.   r8,r9,16384
       118:	41 82 02 28 	beq     340 <do_page_fault+0x340>
      
       1e4:	81 3e 00 84 	lwz     r9,132(r30)
       1e8:	71 2a 40 00 	andi.   r10,r9,16384
       1ec:	40 82 01 68 	bne     354 <do_page_fault+0x354>
      
       228:	81 3e 00 84 	lwz     r9,132(r30)
       22c:	71 28 40 00 	andi.   r8,r9,16384
       230:	41 82 ff c4 	beq     1f4 <do_page_fault+0x1f4>
      
       288:	71 2a 40 00 	andi.   r10,r9,16384
       294:	41 a2 fe 60 	beq     f4 <do_page_fault+0xf4>
      
       50c:	81 3e 00 84 	lwz     r9,132(r30)
       514:	71 2a 40 00 	andi.   r10,r9,16384
       518:	40 a2 fc e0 	bne     1f8 <do_page_fault+0x1f8>
      
       534:	81 3e 00 84 	lwz     r9,132(r30)
       53c:	71 2a 40 00 	andi.   r10,r9,16384
       540:	41 82 fc b8 	beq     1f8 <do_page_fault+0x1f8>
      
      This patch creates a local var called 'is_user' which contains the
      result of user_mode(regs)
      
      With the patch:
      
        20:	81 03 00 84 	lwz     r8,132(r3)
        48:	55 09 97 fe 	rlwinm  r9,r8,18,31,31
        58:	2e 09 00 00 	cmpwi   cr4,r9,0
        5c:	40 92 00 0c 	bne     cr4,68 <do_page_fault+0x68>
      
        88:	41 b2 01 90 	beq     cr4,218 <do_page_fault+0x218>
      
        d4:	40 92 01 d0 	bne     cr4,2a4 <do_page_fault+0x2a4>
      
       120:	41 b2 00 f8 	beq     cr4,218 <do_page_fault+0x218>
      
       138:	41 b2 ff a0 	beq     cr4,d8 <do_page_fault+0xd8>
      
       1d4:	40 92 00 e0 	bne     cr4,2b4 <do_page_fault+0x2b4>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      da929f6a
    • Christophe Leroy's avatar
      powerpc/mm: Remove a redundant test in do_page_fault() · 97a011e6
      Christophe Leroy authored
      The result of (trap == 0x400) is already in is_exec.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      97a011e6
    • Christophe Leroy's avatar
      powerpc/mm: Only call store_updates_sp() on stores in do_page_fault() · e8de85ca
      Christophe Leroy authored
      Function store_updates_sp() checks whether the faulting
      instruction is a store updating r1. Therefore we can limit its calls
      to store exceptions.
      
      This patch is an improvement of commit a7a9dcd8 ("powerpc: Avoid
      taking a data miss on every userspace instruction miss")
      
      With the same microbenchmark app, run with 500 as argument, on an
      MPC885 we get:
      
      Before this patch: 152000 DTLB misses
      After this patch:  147000 DTLB misses
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e8de85ca
    • Christophe Leroy's avatar
      powerpc/mm: Remove __this_fixmap_does_not_exist() · 9affa9e2
      Christophe Leroy authored
      This function has not been used since commit 9494a1e8
      ("powerpc: use generic fixmap.h)
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9affa9e2
    • Balbir Singh's avatar
      powerpc/mm/ptdump: Dump the first entry of the linear mapping as well · e63739b1
      Balbir Singh authored
      The check in hpte_find() should be < and not <= for PAGE_OFFSET
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e63739b1
  5. 30 May, 2017 3 commits
    • Nicholas Piggin's avatar
      powerpc: Link warning for orphan sections · 83a092cf
      Nicholas Piggin authored
      Add --orphan-handling=warn to final link flags. This ensures we can
      handle all sections explicitly. This would have caught subtle breakage
      such as 7de3b27b at build-time.
      
      Also bring existing orphan sections into the fold:
      - .text.hot and .text.unlikely are compiler generated sections.
      - .sdata2, .dynsbss, .plt are used by PPC32
      - We previously did not specify DWARF_DEBUG or STABS_DEBUG
      - DWARF_DEBUG did not include all DWARF sections that can be emitted
      - A number of sections are unused and can be discarded.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      83a092cf
    • Nicholas Piggin's avatar
      powerpc/64: Tool to check head sections location sanity · c494adef
      Nicholas Piggin authored
      Use a tool to check that the location of "fixed sections" are where
      we expected them to be, which catches cases the linker script can't
      (stubs being added to start of .text section), and which ends up
      being neater.
      
      Sample output:
      
        ERROR: start_text address is c000000000008100, should be c000000000008000
        ERROR: see comments in arch/powerpc/tools/head_check.sh
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      [mpe: Fold in fix from Nick for 4.6 era toolchains]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c494adef
    • Nicholas Piggin's avatar
      powerpc/64: Handle linker stubs in low .text code · 951eedeb
      Nicholas Piggin authored
      Very large kernels may require linker stubs for branches from HEAD
      text code. The linker may place these stubs before the HEAD text
      sections, which breaks the assumption that HEAD text is located at 0
      (or the .text section being located at 0x7000/0x8000 on Book3S
      kernels).
      
      Provide an option to create a small section just before the .text
      section with an empty 256 - 4 bytes, and adjust the start of the .text
      section to match. The linker will tend to put stubs in that section
      and not break our relative-to-absolute offset assumptions.
      
      This causes a small waste of space on common kernels, but allows large
      kernels to build and boot. For now, it is an EXPERT config option,
      defaulting to =n, but a reference is provided for it in the build-time
      check for such breakage. This is good enough for allyesconfig and
      custom users / hackers.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      951eedeb