Commit f867d556 authored by Christophe Leroy's avatar Christophe Leroy Committed by Scott Wood

powerpc32: optimise csum_partial() loop

On the 8xx, load latency is 2 cycles and taking branches also takes
2 cycles. So let's unroll the loop.

This patch improves csum_partial() speed by around 10% on both:
* 8xx (single issue processor with parallel execution)
* 83xx (superscalar 6xx processor with dual instruction fetch
and parallel execution)
Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: default avatarScott Wood <oss@buserror.net>
parent 48821a34
......@@ -38,10 +38,24 @@ _GLOBAL(csum_partial)
srwi. r6,r4,2 /* # words to do */
adde r5,r5,r0
beq 3f
1: mtctr r6
1: andi. r6,r6,3 /* Prepare to handle words 4 by 4 */
beq 21f
mtctr r6
2: lwzu r0,4(r3)
adde r5,r5,r0
bdnz 2b
21: srwi. r6,r4,4 /* # blocks of 4 words to do */
beq 3f
mtctr r6
22: lwz r0,4(r3)
lwz r6,8(r3)
lwz r7,12(r3)
lwzu r8,16(r3)
adde r5,r5,r0
adde r5,r5,r6
adde r5,r5,r7
adde r5,r5,r8
bdnz 22b
3: andi. r0,r4,2
beq+ 4f
lhz r0,4(r3)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment