where-simd-helps/paper/sections/intro.tex

% ── 1. Introduction ───────────────────────────────────────────────────────────
\section{Introduction}
\label{sec:intro}

The 2024 NIST post-quantum cryptography standards~\cite{fips203,fips204,fips205}
mark a turning point in deployed cryptography. \mlkem{} (Module-Lattice Key
Encapsulation Mechanism, FIPS~203) is already being integrated into TLS~1.3 by
major browser vendors~\cite{bettini2024} and is planned for inclusion in OpenSSH.
At deployment scale, performance matters: a server handling thousands of TLS
handshakes per second experiences a non-trivial computational overhead from
replacing elliptic-curve key exchange with a lattice-based KEM.

Reference implementations of \mlkem{} ship with hand-optimized AVX2 assembly
for the dominant operations~\cite{kyber-avx2}. Benchmarks routinely report
that the AVX2 path is ``$5$--$7\times$ faster'' than the portable C reference.
However, such top-level numbers conflate several distinct phenomena:
compiler optimization, compiler auto-vectorization, and hand-written SIMD. They
also say nothing about \emph{which} operations drive the speedup or \emph{why}
the assembly is faster than what a compiler can produce automatically.

\subsection*{Contributions}

This paper makes the following contributions:

\begin{enumerate}
  \item \textbf{Three-way speedup decomposition.} We isolate compiler
        optimization, auto-vectorization, and hand-written SIMD as separate
        factors using four compilation variants (§\ref{sec:methodology}).

  \item \textbf{Statistically rigorous benchmarking.} All comparisons are
        backed by Mann-Whitney U tests and Cliff's~$\delta$ effect-size
        analysis over $n \ge 2{,}000$ independent observations, with
        bootstrapped 95\% confidence intervals on speedup ratios
        (§\ref{sec:results}).

  \item \textbf{Mechanistic analysis without hardware counters.} We explain
        the quantitative speedup pattern analytically from the structure of
        the NTT butterfly, Montgomery multiplication, and the SHAKE-128
        permutation (§\ref{sec:discussion}).

  \item \textbf{Open reproducible artifact.} The full pipeline from raw
        SLURM outputs to publication figures is released publicly.
\end{enumerate}

\subsection*{Scope and roadmap}

This report covers Phase~1 of a broader study: \mlkem{} on Intel x86-64 with
AVX2. Planned extensions include hardware performance counter profiles (PAPI),
energy measurement (Intel RAPL), extension to \mldsa{} (Dilithium), and
cross-ISA comparison with ARM NEON/SVE and RISC-V V. Those results will be
incorporated in subsequent revisions.