32 lines
1.9 KiB
TeX
32 lines
1.9 KiB
TeX
Post-quantum cryptography (PQC) standards are being deployed at scale following
|
|
NIST's 2024 finalization of \mlkem{} (FIPS~203), \mldsa{} (FIPS~204), and
|
|
\slhdsa{} (FIPS~205). Hand-written SIMD implementations of these algorithms
|
|
report dramatic performance advantages, yet the mechanistic origins of these
|
|
speedups are rarely quantified with statistical rigor.
|
|
|
|
We present the first systematic empirical decomposition of SIMD speedup across
|
|
the operations of \mlkem{} (Kyber) on Intel x86-64 with AVX2. Using a
|
|
reproducible benchmark harness across four compilation variants---\varrefo{}
|
|
(unoptimized), \varrefnv{} (O3, auto-vectorization disabled), \varref{}
|
|
(O3 with auto-vectorization), and \varavx{} (hand-written AVX2 intrinsics)---we
|
|
isolate three distinct contributions: compiler optimization, compiler
|
|
auto-vectorization, and hand-written SIMD. All measurements are conducted on a
|
|
pinned core of an Intel Xeon Platinum 8268 on Brown University's OSCAR HPC
|
|
cluster, with statistical significance assessed via Mann-Whitney U tests and
|
|
Cliff's~$\delta$ effect-size analysis across $n \ge 2{,}000$ independent
|
|
observations per group.
|
|
|
|
Our key findings are: (1) hand-written AVX2 assembly accounts for
|
|
\speedup{35}--\speedup{56} speedup over compiler-optimized C for the dominant
|
|
arithmetic operations (NTT, INVNTT, base multiplication), with Cliff's
|
|
$\delta = +1.000$ in every comparison---meaning AVX2 is faster in
|
|
\emph{every single} observation pair; (2) GCC's auto-vectorizer contributes
|
|
negligibly or even slightly negatively for NTT-based operations because the
|
|
modular reduction step prevents vectorization; (3) end-to-end KEM speedups of
|
|
\speedup{5.4}--\speedup{7.1} result from a weighted combination of large
|
|
per-operation gains and smaller gains in SHAKE-heavy operations (gen\_a:
|
|
\speedup{3.8}--\speedup{4.7}; noise sampling: \speedup{1.2}--\speedup{1.4}).
|
|
|
|
The benchmark harness, raw data, and analysis pipeline are released as an open
|
|
reproducible artifact.
|