where-simd-helps/paper/sections/related.tex

% ── 6. Related Work ───────────────────────────────────────────────────────────
\section{Related Work}
\label{sec:related}

\paragraph{ML-KEM / Kyber implementations.}
The AVX2 implementation studied here was developed by Schwabe and
Seiler~\cite{kyber-avx2} and forms the optimized path in both the
\texttt{pq-crystals/kyber} reference repository and
PQClean~\cite{pqclean}. Bos et al.~\cite{kyber2018} describe the original
Kyber submission; FIPS~203~\cite{fips203} is the standardized form.
The ARM NEON and Cortex-M4 implementations are available in
pqm4~\cite{pqm4}; cross-ISA comparison is planned for Phase~3.

\paragraph{PQC benchmarking.}
eBACS/SUPERCOP provides a cross-platform benchmark suite~\cite{supercop} that
reports median cycle counts for many cryptographic primitives, including Kyber.
Our contribution complements this with a statistically rigorous decomposition
using nonparametric effect-size analysis and bootstrapped CIs. Kannwischer et
al.~\cite{pqm4} present systematic benchmarks on ARM Cortex-M4 (pqm4), which
focuses on constrained-device performance rather than SIMD analysis.

\paragraph{SIMD in cryptography.}
Gueron and Krasnov demonstrated AVX2 speedups for AES-GCM~\cite{gueron2014};
similar techniques underpin the Kyber AVX2 implementation. Bernstein's
vectorized polynomial arithmetic for Curve25519~\cite{bernstein2006} established
the template of hand-written vector intrinsics for cryptographic field
arithmetic.

\paragraph{NTT optimization.}
Longa and Naehrig~\cite{ntt-survey} survey NTT algorithms for ideal
lattice-based cryptography and analyze instruction counts for vectorized
implementations. Our measurements provide the first empirical cycle-count
decomposition isolating the compiler's contribution vs.\ hand-written SIMD for
the ML-KEM NTT specifically.

\paragraph{Hardware counter profiling.}
Bernstein and Schwabe~\cite{cachetime} discuss the relationship between cache
behavior and cryptographic timing. PAPI~\cite{papi} provides a portable
interface to hardware performance counters used in related profiling work.
Phase~2 of this study will add PAPI counter collection to provide the
mechanistic hardware-level explanation of the speedups observed here.