• Ard Biesheuvel's avatar
    crypto: arm64/chacha - simplify tail block handling · c4fc6328
    Ard Biesheuvel authored
    Based on lessons learnt from optimizing the 32-bit version of this driver,
    we can simplify the arm64 version considerably, by reordering the final
    two stores when the last block is not a multiple of 64 bytes. This removes
    the need to use permutation instructions to calculate the elements that are
    clobbered by the final overlapping store, given that the store of the
    penultimate block now follows it, and that one carries the correct values
    for those elements already.
    
    While at it, simplify the overlapping loads as well, by calculating the
    address of the final overlapping load upfront, and switching to this
    address for every load that would otherwise extend past the end of the
    source buffer.
    
    There is no impact on performance, but the resulting code is substantially
    smaller and easier to follow.
    
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
    Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    c4fc6328
chacha-neon-core.S 18.6 KB