where-simd-helps/paper/sections/conclusion.tex

% ── 7. Conclusion ─────────────────────────────────────────────────────────────
\section{Conclusion}
\label{sec:conclusion}

We presented the first statistically rigorous decomposition of SIMD speedup
in \mlkem{} (Kyber), isolating the contributions of compiler optimization,
auto-vectorization, and hand-written AVX2 assembly. Our main findings are:

\begin{enumerate}
  \item \textbf{Hand-written SIMD is necessary, not optional.} GCC's
        auto-vectorizer provides negligible benefit ($<10\%$) for NTT-based
        arithmetic, and for \op{INVNTT} actually produces slightly slower code
        than non-vectorized O3. The full \speedup{35}--\speedup{56} speedup
        on arithmetic operations comes entirely from hand-written assembly.

  \item \textbf{The distribution of SIMD benefit across operations is
        highly non-uniform.} Arithmetic operations (NTT, INVNTT, basemul,
        frommsg) achieve \speedup{35}--\speedup{56}; SHAKE-based expansion
        (gen\_a) achieves only \speedup{3.8}--\speedup{4.7}; and noise
        sampling achieves \speedup{1.2}--\speedup{1.4}. The bottleneck shifts
        from compute to memory bandwidth for non-arithmetic operations.

  \item \textbf{The statistical signal is overwhelming.} Cliff's $\delta =
        +1.000$ for nearly all operations means AVX2 is faster than \varref{}
        in every single observation pair across $n \ge 2{,}000$ measurements.
        These results are stable across three \mlkem{} parameter sets.

  \item \textbf{Context affects even isolated micro-benchmarks.} The NTT
        speedup varies by 13\% across parameter sets despite identical
        polynomial dimensions, attributed to cache-state effects from
        surrounding polyvec operations.
\end{enumerate}

\paragraph{Future work.}
Planned extensions include: hardware performance counter profiles (IPC, cache
miss rates) via PAPI to validate the mechanistic explanations in
§\ref{sec:discussion}; energy measurement via Intel RAPL; extension to
\mldsa{} (Dilithium) and \slhdsa{} (SPHINCS+) with the same harness; and
cross-ISA comparison with ARM NEON/SVE (Graviton3) and RISC-V V. A compiler
version sensitivity study (GCC 11--14, Clang 14--17) will characterize how
stable the auto-vectorization gap is across compiler releases.

\paragraph{Artifact.}
The benchmark harness, SLURM job templates, raw cycle-count data, analysis
pipeline, and this paper are released at
\url{https://github.com/lneuwirth/where-simd-helps} under an open license.