Post-quantum cryptography (PQC) standards are being deployed at scale following NIST's 2024 finalization of \mlkem{} (FIPS~203), \mldsa{} (FIPS~204), and \slhdsa{} (FIPS~205). Hand-written SIMD implementations of these algorithms report dramatic performance advantages, yet the mechanistic origins of these speedups are rarely quantified with statistical rigor. We present the first systematic empirical decomposition of SIMD speedup across the operations of \mlkem{} (Kyber) on Intel x86-64 with AVX2. Using a reproducible benchmark harness across four compilation variants---\varrefo{} (unoptimized), \varrefnv{} (O3, auto-vectorization disabled), \varref{} (O3 with auto-vectorization), and \varavx{} (hand-written AVX2 intrinsics)---we isolate three distinct contributions: compiler optimization, compiler auto-vectorization, and hand-written SIMD. All measurements are conducted on a pinned core of an Intel Xeon Platinum 8268 on Brown University's OSCAR HPC cluster, with statistical significance assessed via Mann-Whitney U tests and Cliff's~$\delta$ effect-size analysis across $n \ge 2{,}000$ independent observations per group. Our key findings are: (1) hand-written AVX2 assembly accounts for \speedup{35}--\speedup{56} speedup over compiler-optimized C for the dominant arithmetic operations (NTT, INVNTT, base multiplication), with Cliff's $\delta = +1.000$ in every comparison---meaning AVX2 is faster in \emph{every single} observation pair; (2) GCC's auto-vectorizer contributes negligibly or even slightly negatively for NTT-based operations because the modular reduction step prevents vectorization; (3) end-to-end KEM speedups of \speedup{5.4}--\speedup{7.1} result from a weighted combination of large per-operation gains and smaller gains in SHAKE-heavy operations (gen\_a: \speedup{3.8}--\speedup{4.7}; noise sampling: \speedup{1.2}--\speedup{1.4}). The benchmark harness, raw data, and analysis pipeline are released as an open reproducible artifact.