% ── 7. Conclusion ───────────────────────────────────────────────────────────── \section{Conclusion} \label{sec:conclusion} We presented the first statistically rigorous decomposition of SIMD speedup in \mlkem{} (Kyber), isolating the contributions of compiler optimization, auto-vectorization, and hand-written AVX2 assembly. Our main findings are: \begin{enumerate} \item \textbf{Hand-written SIMD is necessary, not optional.} GCC's auto-vectorizer provides negligible benefit ($<10\%$) for NTT-based arithmetic, and for \op{INVNTT} actually produces slightly slower code than non-vectorized O3. The full \speedup{35}--\speedup{56} speedup on arithmetic operations comes entirely from hand-written assembly. \item \textbf{The distribution of SIMD benefit across operations is highly non-uniform.} Arithmetic operations (NTT, INVNTT, basemul, frommsg) achieve \speedup{35}--\speedup{56}; SHAKE-based expansion (gen\_a) achieves only \speedup{3.8}--\speedup{4.7}; and noise sampling achieves \speedup{1.2}--\speedup{1.4}. The bottleneck shifts from compute to memory bandwidth for non-arithmetic operations. \item \textbf{The statistical signal is overwhelming.} Cliff's $\delta = +1.000$ for nearly all operations means AVX2 is faster than \varref{} in every single observation pair across $n \ge 2{,}000$ measurements. These results are stable across three \mlkem{} parameter sets. \item \textbf{Context affects even isolated micro-benchmarks.} The NTT speedup varies by 13\% across parameter sets despite identical polynomial dimensions, attributed to cache-state effects from surrounding polyvec operations. \end{enumerate} \paragraph{Future work.} Planned extensions include: hardware performance counter profiles (IPC, cache miss rates) via PAPI to validate the mechanistic explanations in §\ref{sec:discussion}; energy measurement via Intel RAPL; extension to \mldsa{} (Dilithium) and \slhdsa{} (SPHINCS+) with the same harness; and cross-ISA comparison with ARM NEON/SVE (Graviton3) and RISC-V V. A compiler version sensitivity study (GCC 11--14, Clang 14--17) will characterize how stable the auto-vectorization gap is across compiler releases. \paragraph{Artifact.} The benchmark harness, SLURM job templates, raw cycle-count data, analysis pipeline, and this paper are released at \url{https://github.com/lneuwirth/where-simd-helps} under an open license.