Commit a734601b authored by Tobias Klauser's avatar Tobias Klauser Committed by Tobias Klauser

internal/bytealg: use word-wise comparison for Equal on arm

Follow CL 165338 and use word-wise comparison for aligned buffers in
Equal on arm, otherwise fall back to the current byte-wise comparison.

name                 old time/op    new time/op    delta
Equal/0-4              25.7ns ± 1%    23.5ns ± 1%    -8.78%  (p=0.000 n=10+10)
Equal/1-4              65.8ns ± 0%    60.1ns ± 1%    -8.69%  (p=0.000 n=10+9)
Equal/6-4              82.9ns ± 1%    86.7ns ± 0%    +4.59%  (p=0.000 n=10+10)
Equal/9-4              90.0ns ± 0%   101.0ns ± 0%   +12.18%  (p=0.000 n=9+10)
Equal/15-4              108ns ± 0%     119ns ± 0%   +10.19%  (p=0.000 n=8+8)
Equal/16-4              111ns ± 0%      82ns ± 0%   -26.37%  (p=0.000 n=8+10)
Equal/20-4              124ns ± 1%      87ns ± 1%   -29.94%  (p=0.000 n=9+10)
Equal/32-4              160ns ± 1%      97ns ± 1%   -39.40%  (p=0.000 n=10+10)
Equal/4K-4             14.0µs ± 0%     3.6µs ± 1%   -74.57%  (p=0.000 n=9+10)
Equal/4M-4             12.8ms ± 1%     3.2ms ± 0%   -74.93%  (p=0.000 n=9+9)
Equal/64M-4             204ms ± 1%      51ms ± 0%   -74.78%  (p=0.000 n=10+10)
EqualPort/1-4          47.0ns ± 1%    46.8ns ± 0%    -0.40%  (p=0.015 n=10+6)
EqualPort/6-4          82.6ns ± 1%    81.9ns ± 1%    -0.87%  (p=0.002 n=10+10)
EqualPort/32-4          232ns ± 0%     232ns ± 0%      ~     (p=0.496 n=8+10)
EqualPort/4K-4         29.0µs ± 1%    29.0µs ± 1%      ~     (p=0.604 n=9+10)
EqualPort/4M-4         24.0ms ± 1%    23.8ms ± 0%    -0.65%  (p=0.001 n=9+9)
EqualPort/64M-4         383ms ± 1%     382ms ± 0%      ~     (p=0.218 n=10+10)
CompareBytesEqual-4    61.2ns ± 1%    61.0ns ± 1%      ~     (p=0.539 n=10+10)

name                 old speed      new speed      delta
Equal/1-4            15.2MB/s ± 0%  16.6MB/s ± 1%    +9.52%  (p=0.000 n=10+9)
Equal/6-4            72.4MB/s ± 1%  69.2MB/s ± 0%    -4.40%  (p=0.000 n=10+10)
Equal/9-4             100MB/s ± 0%    89MB/s ± 0%   -11.40%  (p=0.000 n=9+10)
Equal/15-4            138MB/s ± 1%   125MB/s ± 1%    -9.41%  (p=0.000 n=10+10)
Equal/16-4            144MB/s ± 1%   196MB/s ± 0%   +36.41%  (p=0.000 n=10+10)
Equal/20-4            162MB/s ± 1%   231MB/s ± 1%   +42.98%  (p=0.000 n=9+10)
Equal/32-4            200MB/s ± 1%   331MB/s ± 1%   +65.64%  (p=0.000 n=10+10)
Equal/4K-4            292MB/s ± 0%  1149MB/s ± 1%  +293.19%  (p=0.000 n=9+10)
Equal/4M-4            328MB/s ± 1%  1307MB/s ± 0%  +298.87%  (p=0.000 n=9+9)
Equal/64M-4           329MB/s ± 1%  1306MB/s ± 0%  +296.56%  (p=0.000 n=10+10)
EqualPort/1-4        21.3MB/s ± 1%  21.4MB/s ± 0%    +0.42%  (p=0.002 n=10+9)
EqualPort/6-4        72.6MB/s ± 1%  73.2MB/s ± 1%    +0.87%  (p=0.003 n=10+10)
EqualPort/32-4        138MB/s ± 0%   138MB/s ± 0%      ~     (p=0.953 n=9+10)
EqualPort/4K-4        141MB/s ± 1%   141MB/s ± 1%      ~     (p=0.382 n=10+10)
EqualPort/4M-4        175MB/s ± 1%   176MB/s ± 0%    +0.65%  (p=0.001 n=9+9)
EqualPort/64M-4       175MB/s ± 1%   176MB/s ± 0%      ~     (p=0.225 n=10+10)

The 5-12% decrease in performance on Equal/{6,9,15} are due to the
benchmarks splitting the bytes buffer in half. The b argument to Equal
then ends up being unaligned and thus the fast word-wise compare doesn't
kick in.

Updates #29001

Change-Id: I73be501c18e67d211ed19da7771b4f254254e609
Reviewed-on: https://go-review.googlesource.com/c/go/+/167557
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: default avatarCherry Zhang <cherryyz@google.com>
parent 746f405f
......@@ -8,14 +8,19 @@
TEXT ·Equal(SB),NOSPLIT,$0-25
MOVW a_len+4(FP), R1
MOVW b_len+16(FP), R3
CMP R1, R3 // unequal lengths are not equal
B.NE notequal
CMP $0, R1 // short path to handle 0-byte case
B.EQ equal
MOVW a_base+0(FP), R0
MOVW b_base+12(FP), R2
MOVW $ret+24(FP), R7
B memeqbody<>(SB)
equal:
MOVW $1, R0
MOVB R0, ret+24(FP)
RET
notequal:
MOVW $0, R0
MOVBU R0, ret+24(FP)
......@@ -28,6 +33,8 @@ TEXT runtime·memequal(SB),NOSPLIT|NOFRAME,$0-13
CMP R0, R2
B.EQ eq
MOVW size+8(FP), R1
CMP $0, R1
B.EQ eq // short path to handle 0-byte case
MOVW $ret+12(FP), R7
B memeqbody<>(SB)
eq:
......@@ -41,7 +48,9 @@ TEXT runtime·memequal_varlen(SB),NOSPLIT|NOFRAME,$0-9
MOVW b+4(FP), R2
CMP R0, R2
B.EQ eq
MOVW 4(R7), R1 // compiler stores size at offset 4 in the closure
MOVW 4(R7), R1 // compiler stores size at offset 4 in the closure
CMP $0, R1
B.EQ eq // short path to handle 0-byte case
MOVW $ret+8(FP), R7
B memeqbody<>(SB)
eq:
......@@ -54,20 +63,50 @@ eq:
// R1: length
// R2: data of b
// R7: points to return value
//
// On exit:
// R4, R5 and R6 are clobbered
TEXT memeqbody<>(SB),NOSPLIT|NOFRAME,$0-0
ADD R0, R1 // end
loop:
CMP $1, R1
B.EQ one // 1-byte special case for better performance
CMP $4, R1
ADD R0, R1 // R1 is the end of the range to compare
B.LT byte_loop // length < 4
AND $3, R0, R6
CMP $0, R6
B.NE byte_loop // unaligned a, use byte-wise compare (TODO: try to align a)
AND $3, R2, R6
CMP $0, R6
B.NE byte_loop // unaligned b, use byte-wise compare
AND $0xfffffffc, R1, R6
// length >= 4
chunk4_loop:
MOVW.P 4(R0), R4
MOVW.P 4(R2), R5
CMP R4, R5
B.NE notequal
CMP R0, R6
B.NE chunk4_loop
CMP R0, R1
B.EQ equal // reached the end
byte_loop:
MOVBU.P 1(R0), R4
MOVBU.P 1(R2), R5
CMP R4, R5
B.EQ loop
notequal:
MOVW $0, R0
MOVB R0, (R7)
RET
B.NE notequal
CMP R0, R1
B.NE byte_loop
equal:
MOVW $1, R0
MOVB R0, (R7)
RET
one:
MOVBU (R0), R4
MOVBU (R2), R5
CMP R4, R5
B.EQ equal
notequal:
MOVW $0, R0
MOVB R0, (R7)
RET
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment