47 lines
2.6 KiB
TeX
47 lines
2.6 KiB
TeX
% ── 7. Conclusion ─────────────────────────────────────────────────────────────
|
|
\section{Conclusion}
|
|
\label{sec:conclusion}
|
|
|
|
We presented the first statistically rigorous decomposition of SIMD speedup
|
|
in \mlkem{} (Kyber), isolating the contributions of compiler optimization,
|
|
auto-vectorization, and hand-written AVX2 assembly. Our main findings are:
|
|
|
|
\begin{enumerate}
|
|
\item \textbf{Hand-written SIMD is necessary, not optional.} GCC's
|
|
auto-vectorizer provides negligible benefit ($<10\%$) for NTT-based
|
|
arithmetic, and for \op{INVNTT} actually produces slightly slower code
|
|
than non-vectorized O3. The full \speedup{35}--\speedup{56} speedup
|
|
on arithmetic operations comes entirely from hand-written assembly.
|
|
|
|
\item \textbf{The distribution of SIMD benefit across operations is
|
|
highly non-uniform.} Arithmetic operations (NTT, INVNTT, basemul,
|
|
frommsg) achieve \speedup{35}--\speedup{56}; SHAKE-based expansion
|
|
(gen\_a) achieves only \speedup{3.8}--\speedup{4.7}; and noise
|
|
sampling achieves \speedup{1.2}--\speedup{1.4}. The bottleneck shifts
|
|
from compute to memory bandwidth for non-arithmetic operations.
|
|
|
|
\item \textbf{The statistical signal is overwhelming.} Cliff's $\delta =
|
|
+1.000$ for nearly all operations means AVX2 is faster than \varref{}
|
|
in every single observation pair across $n \ge 2{,}000$ measurements.
|
|
These results are stable across three \mlkem{} parameter sets.
|
|
|
|
\item \textbf{Context affects even isolated micro-benchmarks.} The NTT
|
|
speedup varies by 13\% across parameter sets despite identical
|
|
polynomial dimensions, attributed to cache-state effects from
|
|
surrounding polyvec operations.
|
|
\end{enumerate}
|
|
|
|
\paragraph{Future work.}
|
|
Planned extensions include: hardware performance counter profiles (IPC, cache
|
|
miss rates) via PAPI to validate the mechanistic explanations in
|
|
§\ref{sec:discussion}; energy measurement via Intel RAPL; extension to
|
|
\mldsa{} (Dilithium) and \slhdsa{} (SPHINCS+) with the same harness; and
|
|
cross-ISA comparison with ARM NEON/SVE (Graviton3) and RISC-V V. A compiler
|
|
version sensitivity study (GCC 11--14, Clang 14--17) will characterize how
|
|
stable the auto-vectorization gap is across compiler releases.
|
|
|
|
\paragraph{Artifact.}
|
|
The benchmark harness, SLURM job templates, raw cycle-count data, analysis
|
|
pipeline, and this paper are released at
|
|
\url{https://github.com/lneuwirth/where-simd-helps} under an open license.
|