• Eric Biggers's avatar
    crypto: arm/blake2s - add ARM scalar optimized BLAKE2s · 5172d322
    Eric Biggers authored
    Add an ARM scalar optimized implementation of BLAKE2s.
    
    NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too
    small for NEON to help.  Each NEON instruction would depend on the
    previous one, resulting in poor performance.
    
    With scalar instructions, on the other hand, we can take advantage of
    ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an
    implementation get runs much faster than the C implementation.
    
    Performance results on Cortex-A7 in cycles per byte using the shash API:
    
    	4096-byte messages:
    		blake2s-256-arm:     18.8
    		blake2s-256-generic: 26.0
    
    	500-byte messages:
    		blake2s-256-arm:     20.3
    		blake2s-256-generic: 27.9
    
    	100-byte messages:
    		blake2s-256-arm:     29.7
    		blake2s-256-generic: 39.2
    
    	32-byte messages:
    		blake2s-256-arm:     50.6
    		blake2s-256-generic: 66.2
    
    Except on very short messages, this is still slower than the NEON
    implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8,
    and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively.
    However, optimized BLAKE2s is useful for cases where BLAKE2s is used
    instead of BLAKE2b, such as WireGuard.
    
    This new implementation is added in the form of a new module
    blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it
    provides blake2s_compress_arch() for use by the library API as well as
    optionally register the algorithms with the shash API.
    Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
    Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
    Tested-by: default avatarArd Biesheuvel <ardb@kernel.org>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    5172d322
blake2s-glue.c 2.39 KB