1. 14 Jan, 2021 24 commits
  2. 08 Jan, 2021 3 commits
    • Herbert Xu's avatar
      crypto: vmx - Move extern declarations into header file · 622aae87
      Herbert Xu authored
      This patch moves the extern algorithm declarations into a header
      file so that a number of compiler warnings are silenced.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      622aae87
    • Ard Biesheuvel's avatar
      crypto: x86/aes-ni-xts - rewrite and drop indirections via glue helper · 2481104f
      Ard Biesheuvel authored
      The AES-NI driver implements XTS via the glue helper, which consumes
      a struct with sets of function pointers which are invoked on chunks
      of input data of the appropriate size, as annotated in the struct.
      
      Let's get rid of this indirection, so that we can perform direct calls
      to the assembler helpers. Instead, let's adopt the arm64 strategy, i.e.,
      provide a helper which can consume inputs of any size, provided that the
      penultimate, full block is passed via the last call if ciphertext stealing
      needs to be applied.
      
      This also allows us to enable the XTS mode for i386.
      
      Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      2481104f
    • Ard Biesheuvel's avatar
      crypto: x86/aes-ni-xts - use direct calls to and 4-way stride · 86ad60a6
      Ard Biesheuvel authored
      The XTS asm helper arrangement is a bit odd: the 8-way stride helper
      consists of back-to-back calls to the 4-way core transforms, which
      are called indirectly, based on a boolean that indicates whether we
      are performing encryption or decryption.
      
      Given how costly indirect calls are on x86, let's switch to direct
      calls, and given how the 8-way stride doesn't really add anything
      substantial, use a 4-way stride instead, and make the asm core
      routine deal with any multiple of 4 blocks. Since 512 byte sectors
      or 4 KB blocks are the typical quantities XTS operates on, increase
      the stride exported to the glue helper to 512 bytes as well.
      
      As a result, the number of indirect calls is reduced from 3 per 64 bytes
      of in/output to 1 per 512 bytes of in/output, which produces a 65% speedup
      when operating on 1 KB blocks (measured on a Intel(R) Core(TM) i7-8650U CPU)
      
      Fixes: 9697fa39 ("x86/retpoline/crypto: Convert crypto assembler indirect jumps")
      Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      86ad60a6
  3. 02 Jan, 2021 13 commits
    • Rob Herring's avatar
      crypto: picoxcell - Remove PicoXcell driver · fecff3b9
      Rob Herring authored
      PicoXcell has had nothing but treewide cleanups for at least the last 8
      years and no signs of activity. The most recent activity is a yocto vendor
      kernel based on v3.0 in 2015.
      
      Cc: Jamie Iles <jamie@jamieiles.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: linux-crypto@vger.kernel.org
      Signed-off-by: default avatarRob Herring <robh@kernel.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      fecff3b9
    • Eric Biggers's avatar
      crypto: arm/blake2b - add NEON-accelerated BLAKE2b · 1862eb00
      Eric Biggers authored
      Add a NEON-accelerated implementation of BLAKE2b.
      
      On Cortex-A7 (which these days is the most common ARM processor that
      doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
      SHA-256, and slightly faster than SHA-1.  It is also almost three times
      as fast as the generic implementation of BLAKE2b:
      
      	Algorithm            Cycles per byte (on 4096-byte messages)
      	===================  =======================================
      	blake2b-256-neon     14.0
      	sha1-neon            16.3
      	blake2s-256-arm      18.8
      	sha1-asm             20.8
      	blake2s-256-generic  26.0
      	sha256-neon	     28.9
      	sha256-asm	     32.0
      	blake2b-256-generic  38.9
      
      This implementation isn't directly based on any other implementation,
      but it borrows some ideas from previous NEON code I've written as well
      as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
      the other NEON implementations of BLAKE2b I'm aware of (the
      implementation in the BLAKE2 official repository using intrinsics, and
      Andrew Moon's implementation which can be found in SUPERCOP).  It does
      only one block at a time, so it performs well on short messages too.
      
      NEON-accelerated BLAKE2b is useful because there is interest in using
      BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
      devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
      these devices, the performance cost of upgrading to SHA-256 may be
      unacceptable, whereas BLAKE2b-256 would actually improve performance.
      
      Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
      is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
      BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
      64-bit operations, and because BLAKE2s's block size is too small for
      NEON to be helpful for it.  The best I've been able to do with BLAKE2s
      on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.
      
      (I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
      they're more complex as they require running multiple hashes at once.
      Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
      so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
      from the smaller number of rounds, not from the extra parallelism.)
      
      For now this BLAKE2b implementation is only wired up to the shash API,
      since there is no library API for BLAKE2b yet.  However, I've tried to
      keep things consistent with BLAKE2s, e.g. by defining
      blake2b_compress_arch() which is analogous to blake2s_compress_arch()
      and could be exported for use by the library API later if needed.
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Tested-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      1862eb00
    • Eric Biggers's avatar
      crypto: blake2b - update file comment · 0cdc438e
      Eric Biggers authored
      The file comment for blake2b_generic.c makes it sound like it's the
      reference implementation of BLAKE2b with only minor changes.  But it's
      actually been changed a lot.  Update the comment to make this clearer.
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      0cdc438e
    • Eric Biggers's avatar
      crypto: blake2b - sync with blake2s implementation · 28dcca4c
      Eric Biggers authored
      Sync the BLAKE2b code with the BLAKE2s code as much as possible:
      
      - Move a lot of code into new headers <crypto/blake2b.h> and
        <crypto/internal/blake2b.h>, and adjust it to be like the
        corresponding BLAKE2s code, i.e. like <crypto/blake2s.h> and
        <crypto/internal/blake2s.h>.
      
      - Rename constants, e.g. BLAKE2B_*_DIGEST_SIZE => BLAKE2B_*_HASH_SIZE.
      
      - Use a macro BLAKE2B_ALG() to define the shash_alg structs.
      
      - Export blake2b_compress_generic() for use as a fallback.
      
      This makes it much easier to add optimized implementations of BLAKE2b,
      as optimized implementations can use the helper functions
      crypto_blake2b_{setkey,init,update,final}() and
      blake2b_compress_generic().  The ARM implementation will use these.
      
      But this change is also helpful because it eliminates unnecessary
      differences between the BLAKE2b and BLAKE2s code, so that the same
      improvements can easily be made to both.  (The two algorithms are
      basically identical, except for the word size and constants.)  It also
      makes it straightforward to add a library API for BLAKE2b in the future
      if/when it's needed.
      
      This change does make the BLAKE2b code slightly more complicated than it
      needs to be, as it doesn't actually provide a library API yet.  For
      example, __blake2b_update() doesn't really need to exist yet; it could
      just be inlined into crypto_blake2b_update().  But I believe this is
      outweighed by the benefits of keeping the code in sync.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      28dcca4c
    • Eric Biggers's avatar
      wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM · a64bfe7a
      Eric Biggers authored
      When available, select the new implementation of BLAKE2s for 32-bit ARM.
      This is faster than the generic C implementation.
      Reviewed-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      a64bfe7a
    • Eric Biggers's avatar
      crypto: arm/blake2s - add ARM scalar optimized BLAKE2s · 5172d322
      Eric Biggers authored
      Add an ARM scalar optimized implementation of BLAKE2s.
      
      NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too
      small for NEON to help.  Each NEON instruction would depend on the
      previous one, resulting in poor performance.
      
      With scalar instructions, on the other hand, we can take advantage of
      ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an
      implementation get runs much faster than the C implementation.
      
      Performance results on Cortex-A7 in cycles per byte using the shash API:
      
      	4096-byte messages:
      		blake2s-256-arm:     18.8
      		blake2s-256-generic: 26.0
      
      	500-byte messages:
      		blake2s-256-arm:     20.3
      		blake2s-256-generic: 27.9
      
      	100-byte messages:
      		blake2s-256-arm:     29.7
      		blake2s-256-generic: 39.2
      
      	32-byte messages:
      		blake2s-256-arm:     50.6
      		blake2s-256-generic: 66.2
      
      Except on very short messages, this is still slower than the NEON
      implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8,
      and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively.
      However, optimized BLAKE2s is useful for cases where BLAKE2s is used
      instead of BLAKE2b, such as WireGuard.
      
      This new implementation is added in the form of a new module
      blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it
      provides blake2s_compress_arch() for use by the library API as well as
      optionally register the algorithms with the shash API.
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Tested-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      5172d322
    • Eric Biggers's avatar
      crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h> · bbda6e0f
      Eric Biggers authored
      Address the following checkpatch warning:
      
      	WARNING: Use #include <linux/bug.h> instead of <asm/bug.h>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      bbda6e0f
    • Eric Biggers's avatar
      crypto: blake2s - adjust include guard naming · 8786841b
      Eric Biggers authored
      Use the full path in the include guards for the BLAKE2s headers to avoid
      ambiguity and to match the convention for most files in include/crypto/.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      8786841b
    • Eric Biggers's avatar
      crypto: blake2s - add comment for blake2s_state fields · 7d87131f
      Eric Biggers authored
      The first three fields of 'struct blake2s_state' are used in assembly
      code, which isn't immediately obvious, so add a comment to this effect.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      7d87131f
    • Eric Biggers's avatar
      crypto: blake2s - optimize blake2s initialization · 42ad8cf8
      Eric Biggers authored
      If no key was provided, then don't waste time initializing the block
      buffer, as its initial contents won't be used.
      
      Also, make crypto_blake2s_init() and blake2s() call a single internal
      function __blake2s_init() which treats the key as optional, rather than
      conditionally calling blake2s_init() or blake2s_init_key().  This
      reduces the compiled code size, as previously both blake2s_init() and
      blake2s_init_key() were being inlined into these two callers, except
      when the key size passed to blake2s() was a compile-time constant.
      
      These optimizations aren't that significant for BLAKE2s.  However, the
      equivalent optimizations will be more significant for BLAKE2b, as
      everything is twice as big in BLAKE2b.  And it's good to keep things
      consistent rather than making optimizations for BLAKE2b but not BLAKE2s.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      42ad8cf8
    • Eric Biggers's avatar
      crypto: blake2s - share the "shash" API boilerplate code · 8c4a93a1
      Eric Biggers authored
      Add helper functions for shash implementations of BLAKE2s to
      include/crypto/internal/blake2s.h, taking advantage of
      __blake2s_update() and __blake2s_final() that were added by the previous
      patch to share more code between the library and shash implementations.
      
      crypto_blake2s_setkey() and crypto_blake2s_init() are usable as
      shash_alg::setkey and shash_alg::init directly, while
      crypto_blake2s_update() and crypto_blake2s_final() take an extra
      'blake2s_compress_t' function pointer parameter.  This allows the
      implementation of the compression function to be overridden, which is
      the only part that optimized implementations really care about.
      
      The new functions are inline functions (similar to those in sha1_base.h,
      sha256_base.h, and sm3_base.h) because this avoids needing to add a new
      module blake2s_helpers.ko, they aren't *too* long, and this avoids
      indirect calls which are expensive these days.  Note that they can't go
      in blake2s_generic.ko, as that would require selecting CRYPTO_BLAKE2S
      from CRYPTO_BLAKE2S_X86, which would cause a recursive dependency.
      
      Finally, use these new helper functions in the x86 implementation of
      BLAKE2s.  (This part should be a separate patch, but unfortunately the
      x86 implementation used the exact same function names like
      "crypto_blake2s_update()", so it had to be updated at the same time.)
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      8c4a93a1
    • Eric Biggers's avatar
      crypto: blake2s - move update and final logic to internal/blake2s.h · 057edc9c
      Eric Biggers authored
      Move most of blake2s_update() and blake2s_final() into new inline
      functions __blake2s_update() and __blake2s_final() in
      include/crypto/internal/blake2s.h so that this logic can be shared by
      the shash helper functions.  This will avoid duplicating this logic
      between the library and shash implementations.
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      057edc9c
    • Eric Biggers's avatar
      crypto: blake2s - remove unneeded includes · df412e7e
      Eric Biggers authored
      It doesn't make sense for the generic implementation of BLAKE2s to
      include <crypto/internal/simd.h> and <linux/jump_label.h>, as these are
      things that would only be useful in an architecture-specific
      implementation.  Remove these unnecessary includes.
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      df412e7e