where-simd-helps/paper/sections/conclusion.tex

47 lines
2.6 KiB
TeX

% ── 7. Conclusion ─────────────────────────────────────────────────────────────
\section{Conclusion}
\label{sec:conclusion}
We presented the first statistically rigorous decomposition of SIMD speedup
in \mlkem{} (Kyber), isolating the contributions of compiler optimization,
auto-vectorization, and hand-written AVX2 assembly. Our main findings are:
\begin{enumerate}
\item \textbf{Hand-written SIMD is necessary, not optional.} GCC's
auto-vectorizer provides negligible benefit ($<10\%$) for NTT-based
arithmetic, and for \op{INVNTT} actually produces slightly slower code
than non-vectorized O3. The full \speedup{35}--\speedup{56} speedup
on arithmetic operations comes entirely from hand-written assembly.
\item \textbf{The distribution of SIMD benefit across operations is
highly non-uniform.} Arithmetic operations (NTT, INVNTT, basemul,
frommsg) achieve \speedup{35}--\speedup{56}; SHAKE-based expansion
(gen\_a) achieves only \speedup{3.8}--\speedup{4.7}; and noise
sampling achieves \speedup{1.2}--\speedup{1.4}. The bottleneck shifts
from compute to memory bandwidth for non-arithmetic operations.
\item \textbf{The statistical signal is overwhelming.} Cliff's $\delta =
+1.000$ for nearly all operations means AVX2 is faster than \varref{}
in every single observation pair across $n \ge 2{,}000$ measurements.
These results are stable across three \mlkem{} parameter sets.
\item \textbf{Context affects even isolated micro-benchmarks.} The NTT
speedup varies by 13\% across parameter sets despite identical
polynomial dimensions, attributed to cache-state effects from
surrounding polyvec operations.
\end{enumerate}
\paragraph{Future work.}
Planned extensions include: hardware performance counter profiles (IPC, cache
miss rates) via PAPI to validate the mechanistic explanations in
§\ref{sec:discussion}; energy measurement via Intel RAPL; extension to
\mldsa{} (Dilithium) and \slhdsa{} (SPHINCS+) with the same harness; and
cross-ISA comparison with ARM NEON/SVE (Graviton3) and RISC-V V. A compiler
version sensitivity study (GCC 11--14, Clang 14--17) will characterize how
stable the auto-vectorization gap is across compiler releases.
\paragraph{Artifact.}
The benchmark harness, SLURM job templates, raw cycle-count data, analysis
pipeline, and this paper are released at
\url{https://github.com/lneuwirth/where-simd-helps} under an open license.