internal/bytealg: use word-wise comparison for Equal on arm

Follow CL 165338 and use word-wise comparison for aligned buffers in Equal on arm, otherwise fall back to the current byte-wise comparison. name old time/op new time/op delta Equal/0-4 25.7ns ± 1% 23.5ns ± 1% -8.78% (p=0.000 n=10+10) Equal/1-4 65.8ns ± 0% 60.1ns ± 1% -8.69% (p=0.000 n=10+9) Equal/6-4 82.9ns ± 1% 86.7ns ± 0% +4.59% (p=0.000 n=10+10) Equal/9-4 90.0ns ± 0% 101.0ns ± 0% +12.18% (p=0.000 n=9+10) Equal/15-4 108ns ± 0% 119ns ± 0% +10.19% (p=0.000 n=8+8) Equal/16-4 111ns ± 0% 82ns ± 0% -26.37% (p=0.000 n=8+10) Equal/20-4 124ns ± 1% 87ns ± 1% -29.94% (p=0.000 n=9+10) Equal/32-4 160ns ± 1% 97ns ± 1% -39.40% (p=0.000 n=10+10) Equal/4K-4 14.0µs ± 0% 3.6µs ± 1% -74.57% (p=0.000 n=9+10) Equal/4M-4 12.8ms ± 1% 3.2ms ± 0% -74.93% (p=0.000 n=9+9) Equal/64M-4 204ms ± 1% 51ms ± 0% -74.78% (p=0.000 n=10+10) EqualPort/1-4 47.0ns ± 1% 46.8ns ± 0% -0.40% (p=0.015 n=10+6) EqualPort/6-4 82.6ns ± 1% 81.9ns ± 1% -0.87% (p=0.002 n=10+10) EqualPort/32-4 232ns ± 0% 232ns ± 0% ~ (p=0.496 n=8+10) EqualPort/4K-4 29.0µs ± 1% 29.0µs ± 1% ~ (p=0.604 n=9+10) EqualPort/4M-4 24.0ms ± 1% 23.8ms ± 0% -0.65% (p=0.001 n=9+9) EqualPort/64M-4 383ms ± 1% 382ms ± 0% ~ (p=0.218 n=10+10) CompareBytesEqual-4 61.2ns ± 1% 61.0ns ± 1% ~ (p=0.539 n=10+10) name old speed new speed delta Equal/1-4 15.2MB/s ± 0% 16.6MB/s ± 1% +9.52% (p=0.000 n=10+9) Equal/6-4 72.4MB/s ± 1% 69.2MB/s ± 0% -4.40% (p=0.000 n=10+10) Equal/9-4 100MB/s ± 0% 89MB/s ± 0% -11.40% (p=0.000 n=9+10) Equal/15-4 138MB/s ± 1% 125MB/s ± 1% -9.41% (p=0.000 n=10+10) Equal/16-4 144MB/s ± 1% 196MB/s ± 0% +36.41% (p=0.000 n=10+10) Equal/20-4 162MB/s ± 1% 231MB/s ± 1% +42.98% (p=0.000 n=9+10) Equal/32-4 200MB/s ± 1% 331MB/s ± 1% +65.64% (p=0.000 n=10+10) Equal/4K-4 292MB/s ± 0% 1149MB/s ± 1% +293.19% (p=0.000 n=9+10) Equal/4M-4 328MB/s ± 1% 1307MB/s ± 0% +298.87% (p=0.000 n=9+9) Equal/64M-4 329MB/s ± 1% 1306MB/s ± 0% +296.56% (p=0.000 n=10+10) EqualPort/1-4 21.3MB/s ± 1% 21.4MB/s ± 0% +0.42% (p=0.002 n=10+9) EqualPort/6-4 72.6MB/s ± 1% 73.2MB/s ± 1% +0.87% (p=0.003 n=10+10) EqualPort/32-4 138MB/s ± 0% 138MB/s ± 0% ~ (p=0.953 n=9+10) EqualPort/4K-4 141MB/s ± 1% 141MB/s ± 1% ~ (p=0.382 n=10+10) EqualPort/4M-4 175MB/s ± 1% 176MB/s ± 0% +0.65% (p=0.001 n=9+9) EqualPort/64M-4 175MB/s ± 1% 176MB/s ± 0% ~ (p=0.225 n=10+10) The 5-12% decrease in performance on Equal/{6,9,15} are due to the benchmarks splitting the bytes buffer in half. The b argument to Equal then ends up being unaligned and thus the fast word-wise compare doesn't kick in. Updates #29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609 Reviewed-on: https://go-review.googlesource.com/c/go/+/167557 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

internal/bytealg: use word-wise comparison for Equal on arm
Follow CL 165338 and use word-wise comparison for aligned buffers in Equal on arm, otherwise fall back to the current byte-wise comparison. name old time/op new time/op delta Equal/0-4 25.7ns ± 1% 23.5ns ± 1% -8.78% (p=0.000 n=10+10) Equal/1-4 65.8ns ± 0% 60.1ns ± 1% -8.69% (p=0.000 n=10+9) Equal/6-4 82.9ns ± 1% 86.7ns ± 0% +4.59% (p=0.000 n=10+10) Equal/9-4 90.0ns ± 0% 101.0ns ± 0% +12.18% (p=0.000 n=9+10) Equal/15-4 108ns ± 0% 119ns ± 0% +10.19% (p=0.000 n=8+8) Equal/16-4 111ns ± 0% 82ns ± 0% -26.37% (p=0.000 n=8+10) Equal/20-4 124ns ± 1% 87ns ± 1% -29.94% (p=0.000 n=9+10) Equal/32-4 160ns ± 1% 97ns ± 1% -39.40% (p=0.000 n=10+10) Equal/4K-4 14.0µs ± 0% 3.6µs ± 1% -74.57% (p=0.000 n=9+10) Equal/4M-4 12.8ms ± 1% 3.2ms ± 0% -74.93% (p=0.000 n=9+9) Equal/64M-4 204ms ± 1% 51ms ± 0% -74.78% (p=0.000 n=10+10) EqualPort/1-4 47.0ns ± 1% 46.8ns ± 0% -0.40% (p=0.015 n=10+6) EqualPort/6-4 82.6ns ± 1% 81.9ns ± 1% -0.87% (p=0.002 n=10+10) EqualPort/32-4 232ns ± 0% 232ns ± 0% ~ (p=0.496 n=8+10) EqualPort/4K-4 29.0µs ± 1% 29.0µs ± 1% ~ (p=0.604 n=9+10) EqualPort/4M-4 24.0ms ± 1% 23.8ms ± 0% -0.65% (p=0.001 n=9+9) EqualPort/64M-4 383ms ± 1% 382ms ± 0% ~ (p=0.218 n=10+10) CompareBytesEqual-4 61.2ns ± 1% 61.0ns ± 1% ~ (p=0.539 n=10+10) name old speed new speed delta Equal/1-4 15.2MB/s ± 0% 16.6MB/s ± 1% +9.52% (p=0.000 n=10+9) Equal/6-4 72.4MB/s ± 1% 69.2MB/s ± 0% -4.40% (p=0.000 n=10+10) Equal/9-4 100MB/s ± 0% 89MB/s ± 0% -11.40% (p=0.000 n=9+10) Equal/15-4 138MB/s ± 1% 125MB/s ± 1% -9.41% (p=0.000 n=10+10) Equal/16-4 144MB/s ± 1% 196MB/s ± 0% +36.41% (p=0.000 n=10+10) Equal/20-4 162MB/s ± 1% 231MB/s ± 1% +42.98% (p=0.000 n=9+10) Equal/32-4 200MB/s ± 1% 331MB/s ± 1% +65.64% (p=0.000 n=10+10) Equal/4K-4 292MB/s ± 0% 1149MB/s ± 1% +293.19% (p=0.000 n=9+10) Equal/4M-4 328MB/s ± 1% 1307MB/s ± 0% +298.87% (p=0.000 n=9+9) Equal/64M-4 329MB/s ± 1% 1306MB/s ± 0% +296.56% (p=0.000 n=10+10) EqualPort/1-4 21.3MB/s ± 1% 21.4MB/s ± 0% +0.42% (p=0.002 n=10+9) EqualPort/6-4 72.6MB/s ± 1% 73.2MB/s ± 1% +0.87% (p=0.003 n=10+10) EqualPort/32-4 138MB/s ± 0% 138MB/s ± 0% ~ (p=0.953 n=9+10) EqualPort/4K-4 141MB/s ± 1% 141MB/s ± 1% ~ (p=0.382 n=10+10) EqualPort/4M-4 175MB/s ± 1% 176MB/s ± 0% +0.65% (p=0.001 n=9+9) EqualPort/64M-4 175MB/s ± 1% 176MB/s ± 0% ~ (p=0.225 n=10+10) The 5-12% decrease in performance on Equal/{6,9,15} are due to the benchmarks splitting the bytes buffer in half. The b argument to Equal then ends up being unaligned and thus the fast word-wise compare doesn't kick in. Updates #29001 Change-Id: I73be501c18e67d211ed19da7771b4f254254e609 Reviewed-on: https://go-review.googlesource.com/c/go/+/167557 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
a734601b · Tobias Klauser · Tobias Klauser · 746f405f · a734601b
Commit a734601b authored Mar 14, 2019 by Tobias Klauser Committed by Tobias Klauser Mar 17, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 48 additions and 9 deletions

src/internal/bytealg/equal_arm.s src/internal/bytealg/equal_arm.s +48 -9

No files found.
--- a/src/internal/bytealg/equal_arm.s
+++ b/src/internal/bytealg/equal_arm.s
@@ -8,14 +8,19 @@
 TEXT ·Equal(SB),NOSPLIT,$0-25
 	MOVW	a_len+4(FP), R1
 	MOVW	b_len+16(FP), R3
-
 	CMP	R1, R3		// unequal lengths are not equal
 	B.NE	notequal
+	CMP	$0, R1		// short path to handle 0-byte case
+	B.EQ	equal

 	MOVW	a_base+0(FP), R0
 	MOVW	b_base+12(FP), R2
 	MOVW	$ret+24(FP), R7
 	B	memeqbody<>(SB)
+equal:
+	MOVW	$1, R0
+	MOVB	R0, ret+24(FP)
+	RET
 notequal:
 	MOVW	$0, R0
 	MOVBU	R0, ret+24(FP)
@@ -28,6 +33,8 @@ TEXT runtime·memequal(SB),NOSPLIT|NOFRAME,$0-13
 	CMP	R0, R2
 	B.EQ	eq
 	MOVW	size+8(FP), R1
+	CMP	$0, R1
+	B.EQ	eq		// short path to handle 0-byte case
 	MOVW	$ret+12(FP), R7
 	B	memeqbody<>(SB)
 eq:
@@ -41,7 +48,9 @@ TEXT runtime·memequal_varlen(SB),NOSPLIT|NOFRAME,$0-9
 	MOVW	b+4(FP), R2
 	CMP	R0, R2
 	B.EQ	eq
-	MOVW	4(R7), R1    // compiler stores size at offset 4 in the closure
+	MOVW	4(R7), R1	// compiler stores size at offset 4 in the closure
+	CMP	$0, R1
+	B.EQ	eq		// short path to handle 0-byte case
 	MOVW	$ret+8(FP), R7
 	B	memeqbody<>(SB)
 eq:
@@ -54,20 +63,50 @@ eq:
 // R1: length
 // R2: data of b
 // R7: points to return value
+//
+// On exit:
+// R4, R5 and R6 are clobbered
 TEXT memeqbody<>(SB),NOSPLIT|NOFRAME,$0-0
-	ADD	R0, R1		// end
-loop:
+	CMP	$1, R1
+	B.EQ	one		// 1-byte special case for better performance
+
+	CMP	$4, R1
+	ADD	R0, R1		// R1 is the end of the range to compare
+	B.LT	byte_loop	// length < 4
+	AND	$3, R0, R6
+	CMP	$0, R6
+	B.NE	byte_loop	// unaligned a, use byte-wise compare (TODO: try to align a)
+	AND	$3, R2, R6
+	CMP	$0, R6
+	B.NE	byte_loop	// unaligned b, use byte-wise compare
+	AND	$0xfffffffc, R1, R6
+	// length >= 4
+chunk4_loop:
+	MOVW.P	4(R0), R4
+	MOVW.P	4(R2), R5
+	CMP	R4, R5
+	B.NE	notequal
+	CMP	R0, R6
+	B.NE	chunk4_loop
 	CMP	R0, R1
 	B.EQ	equal		// reached the end
+byte_loop:
 	MOVBU.P	1(R0), R4
 	MOVBU.P	1(R2), R5
 	CMP	R4, R5
-	B.EQ	loop
-notequal:
-	MOVW	$0, R0
-	MOVB	R0, (R7)
-	RET
+	B.NE	notequal
+	CMP	R0, R1
+	B.NE	byte_loop
 equal:
 	MOVW	$1, R0
 	MOVB	R0, (R7)
 	RET
+one:
+	MOVBU	(R0), R4
+	MOVBU	(R2), R5
+	CMP	R4, R5
+	B.EQ	equal
+notequal:
+	MOVW	$0, R0
+	MOVB	R0, (R7)
+	RET