Post-quantum cryptography (PQC) standards are being deployed at scale following
NIST's 2024 finalization of \mlkem{} (FIPS~203), \mldsa{} (FIPS~204), and
\slhdsa{} (FIPS~205). Hand-written SIMD implementations of these algorithms
report dramatic performance advantages, yet the mechanistic origins of these
speedups are rarely quantified with statistical rigor.

We present the first systematic empirical decomposition of SIMD speedup across
the operations of \mlkem{} (Kyber) on Intel x86-64 with AVX2. Using a
reproducible benchmark harness across four compilation variants---\varrefo{}
(unoptimized), \varrefnv{} (O3, auto-vectorization disabled), \varref{}
(O3 with auto-vectorization), and \varavx{} (hand-written AVX2 intrinsics)---we
isolate three distinct contributions: compiler optimization, compiler
auto-vectorization, and hand-written SIMD. All measurements are conducted on a
pinned core of an Intel Xeon Platinum 8268 on Brown University's OSCAR HPC
cluster, with statistical significance assessed via Mann-Whitney U tests and
Cliff's~$\delta$ effect-size analysis across $n \ge 2{,}000$ independent
observations per group.

Our key findings are: (1) hand-written AVX2 assembly accounts for
\speedup{35}--\speedup{56} speedup over compiler-optimized C for the dominant
arithmetic operations (NTT, INVNTT, base multiplication), with Cliff's
$\delta = +1.000$ in every comparison---meaning AVX2 is faster in
\emph{every single} observation pair; (2) GCC's auto-vectorizer contributes
negligibly or even slightly negatively for NTT-based operations because the
modular reduction step prevents vectorization; (3) end-to-end KEM speedups of
\speedup{5.4}--\speedup{7.1} result from a weighted combination of large
per-operation gains and smaller gains in SHAKE-heavy operations (gen\_a:
\speedup{3.8}--\speedup{4.7}; noise sampling: \speedup{1.2}--\speedup{1.4}).

The benchmark harness, raw data, and analysis pipeline are released as an open
reproducible artifact.