% ── 2. Background ───────────────────────────────────────────────────────────── \section{Background} \label{sec:background} \subsection{ML-KEM and the Number Theoretic Transform} \mlkem{}~\cite{fips203} is a key encapsulation mechanism built on the Module-Learning-With-Errors (Module-LWE) problem. Its security parameter $k \in \{2, 3, 4\}$ controls the module dimension, yielding the three instantiations \mlkemk{512}, \mlkemk{768}, and \mlkemk{1024}. The scheme operates on polynomials in $\mathbb{Z}_q[x]/(x^{256}+1)$ with $q = 3329$. The computational core is polynomial multiplication, which \mlkem{} evaluates using the Number Theoretic Transform (NTT)~\cite{ntt-survey}. The NTT is a modular analog of the Fast Fourier Transform that reduces schoolbook $O(n^2)$ polynomial multiplication to $O(n \log n)$ pointwise operations. For $n = 256$ coefficients and $q = 3329$, the NTT can be computed using a specialized radix-2 Cooley-Tukey butterfly operating over 128 size-2 NTTs in the NTT domain. The primitive operations benchmarked in this paper are: \begin{itemize} \item \op{NTT} / \op{INVNTT}: forward and inverse NTT over a single polynomial ($n = 256$). \item \op{basemul}: pointwise multiplication in the NTT domain (base multiplication of two NTT-domain polynomials). \item \op{poly\_frommsg}: encodes a 32-byte message into a polynomial. \item \op{gen\_a}: generates the public matrix $\mathbf{A}$ by expanding a seed with SHAKE-128. \item \op{poly\_getnoise\_eta\{1,2\}}: samples a centered binomial distribution (CBD) noise polynomial using SHAKE-256 output. \item \op{indcpa\_\{keypair, enc, dec\}}: full IND-CPA key generation, encryption, and decryption. \end{itemize} \subsection{AVX2 SIMD on x86-64} Intel's Advanced Vector Extensions 2 (AVX2) extends the YMM register file to 256-bit width, accommodating sixteen 16-bit integers simultaneously. The \mlkem{} AVX2 implementation~\cite{kyber-avx2} by Schwabe and Seiler uses hand-written assembly intrinsics rather than compiler-generated vectorized code. The key instruction patterns exploited are: \begin{itemize} \item \texttt{vpaddw} / \texttt{vpsubw}: packed 16-bit addition/subtraction, operating on 16 coefficients per instruction. \item \texttt{vpmullw} / \texttt{vpmulhw}: packed 16-bit low/high multiply, used to implement 16-wide Montgomery reduction. \item \texttt{vpunpcklwd} / \texttt{vpunpckhwd}: interleave operations for the NTT butterfly shuffle pattern. \end{itemize} Because \mlkem{} coefficients are 16-bit integers and the NTT butterfly operates independently on 16 coefficient pairs per round, AVX2 offers a theoretical $16\times$ instruction-count reduction for arithmetic steps. As \S\ref{sec:results} shows, observed speedups \emph{exceed} $16\times$ for \op{INVNTT} and \op{basemul} due to additional instruction-level parallelism (ILP) in the unrolled hand-written loops. \subsection{Compilation Variants} To isolate distinct sources of speedup, we define four compilation variants (detailed in §\ref{sec:methodology}): \begin{description} \item[\varrefo{}] Compiled at \texttt{-O0}: no optimization. Serves as the unoptimized baseline. \item[\varrefnv{}] Compiled at \texttt{-O3 -fno-tree-vectorize}: full compiler optimization but with auto-vectorization disabled. Isolates the contribution of general compiler optimizations (register allocation, loop unrolling, constant propagation) from SIMD. \item[\varref{}] Compiled at \texttt{-O3}: full optimization including GCC's auto-vectorizer. Represents what production deployments without hand-tuned SIMD would achieve. \item[\varavx{}] Hand-written AVX2 assembly: the production-quality optimized implementation. \end{description} \subsection{Hardware Performance Counters and Energy} \label{sec:bg:papi} \phasetwo{Expand with PAPI and RAPL background once data is collected.} Hardware performance counters (accessed via PAPI~\cite{papi} or Linux \texttt{perf\_event}) allow measuring IPC, cache miss rates, and branch mispredictions at the instruction level. Intel RAPL~\cite{rapl} provides package- and DRAM-domain energy readings. These will be incorporated in Phase~2 to provide a mechanistic hardware-level explanation complementing the cycle-count analysis presented here.