1. 14 Jan, 2021 35 commits
  2. 08 Jan, 2021 3 commits
    • Herbert Xu's avatar
      crypto: vmx - Move extern declarations into header file · 622aae87
      Herbert Xu authored
      This patch moves the extern algorithm declarations into a header
      file so that a number of compiler warnings are silenced.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      622aae87
    • Ard Biesheuvel's avatar
      crypto: x86/aes-ni-xts - rewrite and drop indirections via glue helper · 2481104f
      Ard Biesheuvel authored
      The AES-NI driver implements XTS via the glue helper, which consumes
      a struct with sets of function pointers which are invoked on chunks
      of input data of the appropriate size, as annotated in the struct.
      
      Let's get rid of this indirection, so that we can perform direct calls
      to the assembler helpers. Instead, let's adopt the arm64 strategy, i.e.,
      provide a helper which can consume inputs of any size, provided that the
      penultimate, full block is passed via the last call if ciphertext stealing
      needs to be applied.
      
      This also allows us to enable the XTS mode for i386.
      
      Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      2481104f
    • Ard Biesheuvel's avatar
      crypto: x86/aes-ni-xts - use direct calls to and 4-way stride · 86ad60a6
      Ard Biesheuvel authored
      The XTS asm helper arrangement is a bit odd: the 8-way stride helper
      consists of back-to-back calls to the 4-way core transforms, which
      are called indirectly, based on a boolean that indicates whether we
      are performing encryption or decryption.
      
      Given how costly indirect calls are on x86, let's switch to direct
      calls, and given how the 8-way stride doesn't really add anything
      substantial, use a 4-way stride instead, and make the asm core
      routine deal with any multiple of 4 blocks. Since 512 byte sectors
      or 4 KB blocks are the typical quantities XTS operates on, increase
      the stride exported to the glue helper to 512 bytes as well.
      
      As a result, the number of indirect calls is reduced from 3 per 64 bytes
      of in/output to 1 per 512 bytes of in/output, which produces a 65% speedup
      when operating on 1 KB blocks (measured on a Intel(R) Core(TM) i7-8650U CPU)
      
      Fixes: 9697fa39 ("x86/retpoline/crypto: Convert crypto assembler indirect jumps")
      Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      86ad60a6
  3. 02 Jan, 2021 2 commits
    • Rob Herring's avatar
      crypto: picoxcell - Remove PicoXcell driver · fecff3b9
      Rob Herring authored
      PicoXcell has had nothing but treewide cleanups for at least the last 8
      years and no signs of activity. The most recent activity is a yocto vendor
      kernel based on v3.0 in 2015.
      
      Cc: Jamie Iles <jamie@jamieiles.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: linux-crypto@vger.kernel.org
      Signed-off-by: default avatarRob Herring <robh@kernel.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      fecff3b9
    • Eric Biggers's avatar
      crypto: arm/blake2b - add NEON-accelerated BLAKE2b · 1862eb00
      Eric Biggers authored
      Add a NEON-accelerated implementation of BLAKE2b.
      
      On Cortex-A7 (which these days is the most common ARM processor that
      doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
      SHA-256, and slightly faster than SHA-1.  It is also almost three times
      as fast as the generic implementation of BLAKE2b:
      
      	Algorithm            Cycles per byte (on 4096-byte messages)
      	===================  =======================================
      	blake2b-256-neon     14.0
      	sha1-neon            16.3
      	blake2s-256-arm      18.8
      	sha1-asm             20.8
      	blake2s-256-generic  26.0
      	sha256-neon	     28.9
      	sha256-asm	     32.0
      	blake2b-256-generic  38.9
      
      This implementation isn't directly based on any other implementation,
      but it borrows some ideas from previous NEON code I've written as well
      as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
      the other NEON implementations of BLAKE2b I'm aware of (the
      implementation in the BLAKE2 official repository using intrinsics, and
      Andrew Moon's implementation which can be found in SUPERCOP).  It does
      only one block at a time, so it performs well on short messages too.
      
      NEON-accelerated BLAKE2b is useful because there is interest in using
      BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
      devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
      these devices, the performance cost of upgrading to SHA-256 may be
      unacceptable, whereas BLAKE2b-256 would actually improve performance.
      
      Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
      is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
      BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
      64-bit operations, and because BLAKE2s's block size is too small for
      NEON to be helpful for it.  The best I've been able to do with BLAKE2s
      on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.
      
      (I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
      they're more complex as they require running multiple hashes at once.
      Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
      so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
      from the smaller number of rounds, not from the extra parallelism.)
      
      For now this BLAKE2b implementation is only wired up to the shash API,
      since there is no library API for BLAKE2b yet.  However, I've tried to
      keep things consistent with BLAKE2s, e.g. by defining
      blake2b_compress_arch() which is analogous to blake2s_compress_arch()
      and could be exported for use by the library API later if needed.
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Tested-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      1862eb00