where-simd-helps/paper/sections/intro.tex

52 lines
2.7 KiB
TeX

% ── 1. Introduction ───────────────────────────────────────────────────────────
\section{Introduction}
\label{sec:intro}
The 2024 NIST post-quantum cryptography standards~\cite{fips203,fips204,fips205}
mark a turning point in deployed cryptography. \mlkem{} (Module-Lattice Key
Encapsulation Mechanism, FIPS~203) is already being integrated into TLS~1.3 by
major browser vendors~\cite{bettini2024} and is planned for inclusion in OpenSSH.
At deployment scale, performance matters: a server handling thousands of TLS
handshakes per second experiences a non-trivial computational overhead from
replacing elliptic-curve key exchange with a lattice-based KEM.
Reference implementations of \mlkem{} ship with hand-optimized AVX2 assembly
for the dominant operations~\cite{kyber-avx2}. Benchmarks routinely report
that the AVX2 path is ``$5$--$7\times$ faster'' than the portable C reference.
However, such top-level numbers conflate several distinct phenomena:
compiler optimization, compiler auto-vectorization, and hand-written SIMD. They
also say nothing about \emph{which} operations drive the speedup or \emph{why}
the assembly is faster than what a compiler can produce automatically.
\subsection*{Contributions}
This paper makes the following contributions:
\begin{enumerate}
\item \textbf{Three-way speedup decomposition.} We isolate compiler
optimization, auto-vectorization, and hand-written SIMD as separate
factors using four compilation variants (§\ref{sec:methodology}).
\item \textbf{Statistically rigorous benchmarking.} All comparisons are
backed by Mann-Whitney U tests and Cliff's~$\delta$ effect-size
analysis over $n \ge 2{,}000$ independent observations, with
bootstrapped 95\% confidence intervals on speedup ratios
\ref{sec:results}).
\item \textbf{Mechanistic analysis without hardware counters.} We explain
the quantitative speedup pattern analytically from the structure of
the NTT butterfly, Montgomery multiplication, and the SHAKE-128
permutation (§\ref{sec:discussion}).
\item \textbf{Open reproducible artifact.} The full pipeline from raw
SLURM outputs to publication figures is released publicly.
\end{enumerate}
\subsection*{Scope and roadmap}
This report covers Phase~1 of a broader study: \mlkem{} on Intel x86-64 with
AVX2. Planned extensions include hardware performance counter profiles (PAPI),
energy measurement (Intel RAPL), extension to \mldsa{} (Dilithium), and
cross-ISA comparison with ARM NEON/SVE and RISC-V V. Those results will be
incorporated in subsequent revisions.