bytes: add asm version of Index for short strings on arm64
Currently we have special case for 1-byte strings, this extends it to strings shorter than 9 bytes on arm64. Benchmark results: name old time/op new time/op delta IndexByte/10-32 18.6ns ± 0% 18.1ns ± 0% -2.69% (p=0.008 n=5+5) IndexByte/32-32 16.8ns ± 1% 16.9ns ± 1% ~ (p=0.762 n=5+5) IndexByte/4K-32 464ns ± 0% 464ns ± 0% ~ (all equal) IndexByte/4M-32 528µs ± 1% 506µs ± 1% -4.17% (p=0.008 n=5+5) IndexByte/64M-32 18.7ms ± 0% 18.7ms ± 1% ~ (p=0.730 n=4+5) IndexBytePortable/10-32 33.8ns ± 0% 34.9ns ± 3% ~ (p=0.167 n=5+5) IndexBytePortable/32-32 65.3ns ± 0% 66.1ns ± 2% ~ (p=0.444 n=5+5) IndexBytePortable/4K-32 5.88µs ± 0% 5.88µs ± 0% ~ (p=0.325 n=5+5) IndexBytePortable/4M-32 6.03ms ± 0% 6.03ms ± 0% ~ (p=1.000 n=5+5) IndexBytePortable/64M-32 98.8ms ± 0% 98.9ms ± 0% +0.10% (p=0.008 n=5+5) IndexRune/10-32 57.7ns ± 0% 49.2ns ± 0% -14.73% (p=0.000 n=5+4) IndexRune/32-32 57.7ns ± 0% 58.6ns ± 0% +1.56% (p=0.008 n=5+5) IndexRune/4K-32 511ns ± 0% 513ns ± 0% +0.39% (p=0.008 n=5+5) IndexRune/4M-32 527µs ± 1% 527µs ± 1% ~ (p=0.690 n=5+5) IndexRune/64M-32 18.7ms ± 0% 18.7ms ± 1% ~ (p=0.190 n=4+5) IndexRuneASCII/10-32 23.8ns ± 0% 23.8ns ± 0% ~ (all equal) IndexRuneASCII/32-32 24.3ns ± 0% 24.3ns ± 0% ~ (all equal) IndexRuneASCII/4K-32 468ns ± 0% 468ns ± 0% ~ (all equal) IndexRuneASCII/4M-32 521µs ± 1% 531µs ± 2% +1.91% (p=0.016 n=5+5) IndexRuneASCII/64M-32 18.6ms ± 1% 18.5ms ± 0% ~ (p=0.730 n=5+4) Index/10-32 89.1ns ±13% 25.2ns ± 0% -71.72% (p=0.008 n=5+5) Index/32-32 225ns ± 2% 226ns ± 3% ~ (p=0.683 n=5+5) Index/4K-32 11.9µs ± 0% 11.8µs ± 0% -0.22% (p=0.008 n=5+5) Index/4M-32 12.1ms ± 0% 12.1ms ± 0% ~ (p=0.548 n=5+5) Index/64M-32 197ms ± 0% 197ms ± 0% ~ (p=0.690 n=5+5) IndexEasy/10-32 46.2ns ± 0% 22.1ns ± 8% -52.16% (p=0.008 n=5+5) IndexEasy/32-32 46.2ns ± 0% 47.2ns ± 0% +2.16% (p=0.008 n=5+5) IndexEasy/4K-32 499ns ± 0% 502ns ± 0% +0.44% (p=0.008 n=5+5) IndexEasy/4M-32 529µs ± 2% 529µs ± 1% ~ (p=0.841 n=5+5) IndexEasy/64M-32 18.6ms ± 1% 18.7ms ± 1% ~ (p=0.222 n=5+5) IndexAnyASCII/1:1-32 15.7ns ± 0% 15.7ns ± 0% ~ (all equal) IndexAnyASCII/1:2-32 17.2ns ± 0% 17.2ns ± 0% ~ (all equal) IndexAnyASCII/1:4-32 20.0ns ± 0% 20.0ns ± 0% ~ (all equal) IndexAnyASCII/1:8-32 34.8ns ± 0% 34.8ns ± 0% ~ (all equal) IndexAnyASCII/1:16-32 48.1ns ± 0% 48.1ns ± 0% ~ (all equal) IndexAnyASCII/16:1-32 97.9ns ± 1% 97.7ns ± 0% ~ (p=0.857 n=5+5) IndexAnyASCII/16:2-32 102ns ± 0% 102ns ± 0% ~ (all equal) IndexAnyASCII/16:4-32 116ns ± 1% 116ns ± 1% ~ (p=1.000 n=5+5) IndexAnyASCII/16:8-32 141ns ± 1% 141ns ± 0% ~ (p=0.571 n=5+4) IndexAnyASCII/16:16-32 178ns ± 0% 178ns ± 0% ~ (all equal) IndexAnyASCII/256:1-32 1.09µs ± 0% 1.09µs ± 0% ~ (all equal) IndexAnyASCII/256:2-32 1.09µs ± 0% 1.10µs ± 0% +0.27% (p=0.008 n=5+5) IndexAnyASCII/256:4-32 1.11µs ± 0% 1.11µs ± 0% ~ (p=0.397 n=5+5) IndexAnyASCII/256:8-32 1.10µs ± 0% 1.10µs ± 0% ~ (p=0.444 n=5+5) IndexAnyASCII/256:16-32 1.14µs ± 0% 1.14µs ± 0% ~ (all equal) IndexAnyASCII/4096:1-32 16.5µs ± 0% 16.5µs ± 0% ~ (p=1.000 n=5+5) IndexAnyASCII/4096:2-32 17.0µs ± 0% 17.0µs ± 0% ~ (p=0.159 n=5+4) IndexAnyASCII/4096:4-32 17.1µs ± 0% 17.1µs ± 0% ~ (p=0.921 n=4+5) IndexAnyASCII/4096:8-32 16.5µs ± 0% 16.5µs ± 0% ~ (p=0.460 n=5+5) IndexAnyASCII/4096:16-32 16.5µs ± 0% 16.5µs ± 0% ~ (p=0.794 n=5+4) IndexPeriodic/IndexPeriodic2-32 189µs ± 0% 189µs ± 0% ~ (p=0.841 n=5+5) IndexPeriodic/IndexPeriodic4-32 189µs ± 0% 189µs ± 0% -0.03% (p=0.016 n=5+4) IndexPeriodic/IndexPeriodic8-32 189µs ± 0% 189µs ± 0% ~ (p=0.651 n=5+5) IndexPeriodic/IndexPeriodic16-32 175µs ± 9% 174µs ± 7% ~ (p=1.000 n=5+5) IndexPeriodic/IndexPeriodic32-32 75.1µs ± 0% 75.1µs ± 0% ~ (p=0.690 n=5+5) IndexPeriodic/IndexPeriodic64-32 42.6µs ± 0% 44.7µs ± 0% +4.98% (p=0.008 n=5+5) name old speed new speed delta IndexByte/10-32 538MB/s ± 0% 552MB/s ± 0% +2.65% (p=0.008 n=5+5) IndexByte/32-32 1.90GB/s ± 1% 1.90GB/s ± 1% ~ (p=0.548 n=5+5) IndexByte/4K-32 8.82GB/s ± 0% 8.81GB/s ± 0% ~ (p=0.548 n=5+5) IndexByte/4M-32 7.95GB/s ± 1% 8.29GB/s ± 1% +4.35% (p=0.008 n=5+5) IndexByte/64M-32 3.58GB/s ± 0% 3.60GB/s ± 1% ~ (p=0.730 n=4+5) IndexBytePortable/10-32 296MB/s ± 0% 286MB/s ± 3% ~ (p=0.381 n=4+5) IndexBytePortable/32-32 490MB/s ± 0% 485MB/s ± 2% ~ (p=0.286 n=5+5) IndexBytePortable/4K-32 697MB/s ± 0% 697MB/s ± 0% ~ (p=0.413 n=5+5) IndexBytePortable/4M-32 696MB/s ± 0% 695MB/s ± 0% ~ (p=0.897 n=5+5) IndexBytePortable/64M-32 679MB/s ± 0% 678MB/s ± 0% -0.10% (p=0.008 n=5+5) IndexRune/10-32 173MB/s ± 0% 203MB/s ± 0% +17.24% (p=0.016 n=5+4) IndexRune/32-32 555MB/s ± 0% 546MB/s ± 0% -1.62% (p=0.008 n=5+5) IndexRune/4K-32 8.01GB/s ± 0% 7.98GB/s ± 0% -0.38% (p=0.008 n=5+5) IndexRune/4M-32 7.97GB/s ± 1% 7.95GB/s ± 1% ~ (p=0.690 n=5+5) IndexRune/64M-32 3.59GB/s ± 0% 3.58GB/s ± 1% ~ (p=0.190 n=4+5) IndexRuneASCII/10-32 420MB/s ± 0% 420MB/s ± 0% ~ (p=0.190 n=5+4) IndexRuneASCII/32-32 1.32GB/s ± 0% 1.32GB/s ± 0% ~ (p=0.333 n=5+5) IndexRuneASCII/4K-32 8.75GB/s ± 0% 8.75GB/s ± 0% ~ (p=0.690 n=5+5) IndexRuneASCII/4M-32 8.04GB/s ± 1% 7.89GB/s ± 2% -1.87% (p=0.016 n=5+5) IndexRuneASCII/64M-32 3.61GB/s ± 1% 3.62GB/s ± 0% ~ (p=0.730 n=5+4) Index/10-32 113MB/s ±14% 397MB/s ± 0% +249.76% (p=0.008 n=5+5) Index/32-32 142MB/s ± 2% 141MB/s ± 3% ~ (p=0.794 n=5+5) Index/4K-32 345MB/s ± 0% 346MB/s ± 0% +0.22% (p=0.008 n=5+5) Index/4M-32 345MB/s ± 0% 345MB/s ± 0% ~ (p=0.619 n=5+5) Index/64M-32 341MB/s ± 0% 341MB/s ± 0% ~ (p=0.595 n=5+5) IndexEasy/10-32 216MB/s ± 0% 453MB/s ± 8% +109.60% (p=0.008 n=5+5) IndexEasy/32-32 692MB/s ± 0% 678MB/s ± 0% -2.01% (p=0.008 n=5+5) IndexEasy/4K-32 8.19GB/s ± 0% 8.16GB/s ± 0% -0.45% (p=0.008 n=5+5) IndexEasy/4M-32 7.93GB/s ± 2% 7.93GB/s ± 1% ~ (p=0.841 n=5+5) IndexEasy/64M-32 3.60GB/s ± 1% 3.59GB/s ± 1% ~ (p=0.222 n=5+5) Change-Id: I4ca69378a2df6f9ba748c6a2706953ee1bd07343 Reviewed-on: https://go-review.googlesource.com/96555 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
Showing
Please register or sign in to comment