% ── 6. Related Work ─────────────────────────────────────────────────────────── \section{Related Work} \label{sec:related} \paragraph{ML-KEM / Kyber implementations.} The AVX2 implementation studied here was developed by Schwabe and Seiler~\cite{kyber-avx2} and forms the optimized path in both the \texttt{pq-crystals/kyber} reference repository and PQClean~\cite{pqclean}. Bos et al.~\cite{kyber2018} describe the original Kyber submission; FIPS~203~\cite{fips203} is the standardized form. The ARM NEON and Cortex-M4 implementations are available in pqm4~\cite{pqm4}; cross-ISA comparison is planned for Phase~3. \paragraph{PQC benchmarking.} eBACS/SUPERCOP provides a cross-platform benchmark suite~\cite{supercop} that reports median cycle counts for many cryptographic primitives, including Kyber. Our contribution complements this with a statistically rigorous decomposition using nonparametric effect-size analysis and bootstrapped CIs. Kannwischer et al.~\cite{pqm4} present systematic benchmarks on ARM Cortex-M4 (pqm4), which focuses on constrained-device performance rather than SIMD analysis. \paragraph{SIMD in cryptography.} Gueron and Krasnov demonstrated AVX2 speedups for AES-GCM~\cite{gueron2014}; similar techniques underpin the Kyber AVX2 implementation. Bernstein's vectorized polynomial arithmetic for Curve25519~\cite{bernstein2006} established the template of hand-written vector intrinsics for cryptographic field arithmetic. \paragraph{NTT optimization.} Longa and Naehrig~\cite{ntt-survey} survey NTT algorithms for ideal lattice-based cryptography and analyze instruction counts for vectorized implementations. Our measurements provide the first empirical cycle-count decomposition isolating the compiler's contribution vs.\ hand-written SIMD for the ML-KEM NTT specifically. \paragraph{Hardware counter profiling.} Bernstein and Schwabe~\cite{cachetime} discuss the relationship between cache behavior and cryptographic timing. PAPI~\cite{papi} provides a portable interface to hardware performance counters used in related profiling work. Phase~2 of this study will add PAPI counter collection to provide the mechanistic hardware-level explanation of the speedups observed here.