• Eric Biggers's avatar
    crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM · b06affb1
    Eric Biggers authored
    Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
    AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
    AVX10.  There are two implementations, sharing most source code: one
    using 256-bit vectors and one using 512-bit vectors.  This patch
    improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.
    
    I wrote the new AES-GCM assembly code from scratch, focusing on
    correctness, performance, code size (both source and binary), and
    documenting the source.  The new assembly file aes-gcm-avx10-x86_64.S is
    about 1200 lines including extensive comments, and it generates less
    than 8 KB of binary code.  The main loop does 4 vectors at a time, with
    the AES and GHASH instructions interleaved.  Any remainder is handled
    using a simple 1 vector at a time loop, with masking.
    
    Several VAES + AVX512 implementations of AES-GCM exist from Intel,
    including one in OpenSSL and one proposed for inclusion in Linux in 2021
    (https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
    These aren't really suitable to be used, though, due to the massive
    amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
    and well as the significantly larger amount of assembly source (4978
    lines for OpenSSL, 1788 lines for Linux).  Also, Intel's code does not
    support 256-bit vectors, which makes it not usable on future
    AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
    downclocking issues.  So I ended up starting from scratch.  Usually my
    much shorter code is actually slightly faster than Intel's AVX512 code,
    though it depends on message length and on which of Intel's
    implementations is used; for details, see Tables 3 and 4 below.
    
    To facilitate potential integration into other projects, I've
    dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
    the same as the recently added RISC-V crypto code.
    
    The following two tables summarize the performance improvement over the
    existing AES-GCM code in Linux that uses AES-NI and AVX2:
    
    Table 1: AES-256-GCM encryption throughput improvement,
             CPU microarchitecture vs. message length in bytes:
    
                          | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
    ----------------------+-------+-------+-------+-------+-------+-------+
    Intel Ice Lake        |   42% |   48% |   60% |   62% |   70% |   69% |
    Intel Sapphire Rapids |  157% |  145% |  162% |  119% |   96% |   96% |
    Intel Emerald Rapids  |  156% |  144% |  161% |  115% |   95% |  100% |
    AMD Zen 4             |  103% |   89% |   78% |   56% |   54% |   54% |
    
                          |   300 |   200 |    64 |    63 |    16 |
    ----------------------+-------+-------+-------+-------+-------+
    Intel Ice Lake        |   66% |   48% |   49% |   70% |   53% |
    Intel Sapphire Rapids |   80% |   60% |   41% |   62% |   38% |
    Intel Emerald Rapids  |   79% |   60% |   41% |   62% |   38% |
    AMD Zen 4             |   51% |   35% |   27% |   32% |   25% |
    
    Table 2: AES-256-GCM decryption throughput improvement,
             CPU microarchitecture vs. message length in bytes:
    
                          | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
    ----------------------+-------+-------+-------+-------+-------+-------+
    Intel Ice Lake        |   42% |   48% |   59% |   63% |   67% |   71% |
    Intel Sapphire Rapids |  159% |  145% |  161% |  125% |  102% |  100% |
    Intel Emerald Rapids  |  158% |  144% |  161% |  124% |  100% |  103% |
    AMD Zen 4             |  110% |   95% |   80% |   59% |   56% |   54% |
    
                          |   300 |   200 |    64 |    63 |    16 |
    ----------------------+-------+-------+-------+-------+-------+
    Intel Ice Lake        |   67% |   56% |   46% |   70% |   56% |
    Intel Sapphire Rapids |   79% |   62% |   39% |   61% |   39% |
    Intel Emerald Rapids  |   80% |   62% |   40% |   58% |   40% |
    AMD Zen 4             |   49% |   36% |   30% |   35% |   28% |
    
    The above numbers are percentage improvements in single-thread
    throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
    listed as 50%.  They were collected by directly measuring the Linux
    crypto API performance using a custom kernel module.  Note that indirect
    benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
    include more overhead and won't see quite as much of a difference.  All
    these benchmarks used an associated data length of 16 bytes.  Note that
    AES-GCM is almost always used with short associated data lengths.
    
    The following two tables summarize how the performance of my code
    compares with Intel's AVX512 AES-GCM code, both the version that is in
    OpenSSL and the version that was proposed for inclusion in Linux.
    Neither version exists in Linux currently, but these are alternative
    AES-GCM implementations that could be chosen instead of mine.  I
    collected the following numbers on Emerald Rapids using a userspace
    benchmark program that calls the assembly functions directly.
    
    I've also included a comparison with Cloudflare's AES-GCM implementation
    from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.
    
    Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
             implementation name vs. message length in bytes:
    
                         | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
    ---------------------+-------+-------+-------+-------+-------+-------+
    This implementation  | 14171 | 12956 | 12318 |  9588 |  7293 |  6449 |
    AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 |  9107 |  5891 |  6472 |
    AVX512_Intel_Linux   | 13954 | 12277 | 11530 |  8712 |  6627 |  5898 |
    AVX512_Cloudflare    | 12564 | 11050 | 10905 |  8152 |  5345 |  5202 |
    
                         |   300 |   200 |    64 |    63 |    16 |
    ---------------------+-------+-------+-------+-------+-------+
    This implementation  |  4939 |  3688 |  1846 |  1821 |   738 |
    AVX512_Intel_OpenSSL |  4629 |  4532 |  2734 |  2332 |  1131 |
    AVX512_Intel_Linux   |  4035 |  2966 |  1567 |  1330 |   639 |
    AVX512_Cloudflare    |  3344 |  2485 |  1141 |  1127 |   456 |
    
    Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
             implementation name vs. message length in bytes:
    
                         | 16384 |  4096 |  4095 |  1420 |   512 |   500 |
    ---------------------+-------+-------+-------+-------+-------+-------+
    This implementation  | 14276 | 13311 | 13007 | 11086 |  8268 |  8086 |
    AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 |  9587 |  5954 |  7060 |
    AVX512_Intel_Linux   | 14116 | 12795 | 11778 |  9269 |  7735 |  6455 |
    AVX512_Cloudflare    | 13301 | 12018 | 11919 |  9182 |  7189 |  6726 |
    
                         |   300 |   200 |    64 |    63 |    16 |
    ---------------------+-------+-------+-------+-------+-------+
    This implementation  |  6454 |  5020 |  2635 |  2602 |  1079 |
    AVX512_Intel_OpenSSL |  5184 |  5799 |  2957 |  2545 |  1228 |
    AVX512_Intel_Linux   |  4394 |  4247 |  2235 |  1635 |   922 |
    AVX512_Cloudflare    |  4289 |  3851 |  1435 |  1417 |   574 |
    
    So, usually my code is actually slightly faster than Intel's code,
    though the OpenSSL implementation has a slight edge on messages shorter
    than 256 bytes in this microbenchmark.  (This also holds true when doing
    the same tests on AMD Zen 4.)  It can be seen that the large code size
    (up to 94x larger!) of the Intel implementations doesn't seem to bring
    much benefit, so starting from scratch with much smaller code, as I've
    done, seems appropriate.  The performance of my code on messages shorter
    than 256 bytes could be improved through a limited amount of unrolling,
    but it's unclear it would be worth it, given code size considerations
    (e.g. caches) that don't get measured in microbenchmarks.
    Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
    Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
    b06affb1
Makefile 5.32 KB