• Ard Biesheuvel's avatar
    crypto: arm/chacha-neon - optimize for non-block size multiples · 86cd97ec
    Ard Biesheuvel authored
    The current NEON based ChaCha implementation for ARM is optimized for
    multiples of 4x the ChaCha block size (64 bytes). This makes sense for
    block encryption, but given that ChaCha is also often used in the
    context of networking, it makes sense to consider arbitrary length
    inputs as well.
    
    For example, WireGuard typically uses 1420 byte packets, and performing
    ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
    and 3 invocations of chacha_block_xor_neon(), where the last one also
    involves a memcpy() using a buffer on the stack to process the final
    chunk of 1420 % 64 == 12 bytes.
    
    Let's optimize for this case as well, by letting chacha_4block_xor_neon()
    deal with any input size between 64 and 256 bytes, using NEON permutation
    instructions and overlapping loads and stores. This way, the 140 byte
    tail of a 1420 byte input buffer can simply be processed in one go.
    
    This results in the following performance improvements for 1420 byte
    blocks, without significant impact on power-of-2 input sizes. (Note
    that Raspberry Pi is widely used in combination with a 32-bit kernel,
    even though the core is 64-bit capable)
    
       Cortex-A8  (BeagleBone)       :   7%
       Cortex-A15 (Calxeda Midway)   :  21%
       Cortex-A53 (Raspberry Pi 3)   :   3%
       Cortex-A72 (Raspberry Pi 4)   :  19%
    
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
    Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    86cd97ec
chacha-neon-core.S 14.7 KB