% ── 1. Introduction ─────────────────────────────────────────────────────────── \section{Introduction} \label{sec:intro} The 2024 NIST post-quantum cryptography standards~\cite{fips203,fips204,fips205} mark a turning point in deployed cryptography. \mlkem{} (Module-Lattice Key Encapsulation Mechanism, FIPS~203) is already being integrated into TLS~1.3 by major browser vendors~\cite{bettini2024} and is planned for inclusion in OpenSSH. At deployment scale, performance matters: a server handling thousands of TLS handshakes per second experiences a non-trivial computational overhead from replacing elliptic-curve key exchange with a lattice-based KEM. Reference implementations of \mlkem{} ship with hand-optimized AVX2 assembly for the dominant operations~\cite{kyber-avx2}. Benchmarks routinely report that the AVX2 path is ``$5$--$7\times$ faster'' than the portable C reference. However, such top-level numbers conflate several distinct phenomena: compiler optimization, compiler auto-vectorization, and hand-written SIMD. They also say nothing about \emph{which} operations drive the speedup or \emph{why} the assembly is faster than what a compiler can produce automatically. \subsection*{Contributions} This paper makes the following contributions: \begin{enumerate} \item \textbf{Three-way speedup decomposition.} We isolate compiler optimization, auto-vectorization, and hand-written SIMD as separate factors using four compilation variants (§\ref{sec:methodology}). \item \textbf{Statistically rigorous benchmarking.} All comparisons are backed by Mann-Whitney U tests and Cliff's~$\delta$ effect-size analysis over $n \ge 2{,}000$ independent observations, with bootstrapped 95\% confidence intervals on speedup ratios (§\ref{sec:results}). \item \textbf{Mechanistic analysis without hardware counters.} We explain the quantitative speedup pattern analytically from the structure of the NTT butterfly, Montgomery multiplication, and the SHAKE-128 permutation (§\ref{sec:discussion}). \item \textbf{Open reproducible artifact.} The full pipeline from raw SLURM outputs to publication figures is released publicly. \end{enumerate} \subsection*{Scope and roadmap} This report covers Phase~1 of a broader study: \mlkem{} on Intel x86-64 with AVX2. Planned extensions include hardware performance counter profiles (PAPI), energy measurement (Intel RAPL), extension to \mldsa{} (Dilithium), and cross-ISA comparison with ARM NEON/SVE and RISC-V V. Those results will be incorporated in subsequent revisions.