where-simd-helps/paper/sections/background.tex

% ── 2. Background ─────────────────────────────────────────────────────────────
\section{Background}
\label{sec:background}

\subsection{ML-KEM and the Number Theoretic Transform}

\mlkem{}~\cite{fips203} is a key encapsulation mechanism built on the
Module-Learning-With-Errors (Module-LWE) problem. Its security parameter
$k \in \{2, 3, 4\}$ controls the module dimension, yielding the three
instantiations \mlkemk{512}, \mlkemk{768}, and \mlkemk{1024}. The scheme
operates on polynomials in $\mathbb{Z}_q[x]/(x^{256}+1)$ with $q = 3329$.

The computational core is polynomial multiplication, which \mlkem{} evaluates
using the Number Theoretic Transform (NTT)~\cite{ntt-survey}. The NTT is a
modular analog of the Fast Fourier Transform that reduces schoolbook
$O(n^2)$ polynomial multiplication to $O(n \log n)$ pointwise operations.
For $n = 256$ coefficients and $q = 3329$, the NTT can be computed using a
specialized radix-2 Cooley-Tukey butterfly operating over 128 size-2 NTTs
in the NTT domain.

The primitive operations benchmarked in this paper are:
\begin{itemize}
  \item \op{NTT} / \op{INVNTT}: forward and inverse NTT over a single
        polynomial ($n = 256$).
  \item \op{basemul}: pointwise multiplication in the NTT domain (base
        multiplication of two NTT-domain polynomials).
  \item \op{poly\_frommsg}: encodes a 32-byte message into a polynomial.
  \item \op{gen\_a}: generates the public matrix $\mathbf{A}$ by expanding
        a seed with SHAKE-128.
  \item \op{poly\_getnoise\_eta\{1,2\}}: samples a centered binomial
        distribution (CBD) noise polynomial using SHAKE-256 output.
  \item \op{indcpa\_\{keypair, enc, dec\}}: full IND-CPA key generation,
        encryption, and decryption.
\end{itemize}

\subsection{AVX2 SIMD on x86-64}

Intel's Advanced Vector Extensions 2 (AVX2) extends the YMM register file to
256-bit width, accommodating sixteen 16-bit integers simultaneously. The
\mlkem{} AVX2 implementation~\cite{kyber-avx2} by Schwabe and Seiler uses
hand-written assembly intrinsics rather than compiler-generated vectorized code.

The key instruction patterns exploited are:
\begin{itemize}
  \item \texttt{vpaddw} / \texttt{vpsubw}: packed 16-bit addition/subtraction,
        operating on 16 coefficients per instruction.
  \item \texttt{vpmullw} / \texttt{vpmulhw}: packed 16-bit low/high multiply,
        used to implement 16-wide Montgomery reduction.
  \item \texttt{vpunpcklwd} / \texttt{vpunpckhwd}: interleave operations for
        the NTT butterfly shuffle pattern.
\end{itemize}

Because \mlkem{} coefficients are 16-bit integers and the NTT butterfly
operates independently on 16 coefficient pairs per round, AVX2 offers a
theoretical $16\times$ instruction-count reduction for arithmetic steps. As
\S\ref{sec:results} shows, observed speedups \emph{exceed} $16\times$ for
\op{INVNTT} and \op{basemul} due to additional instruction-level parallelism
(ILP) in the unrolled hand-written loops.

\subsection{Compilation Variants}

To isolate distinct sources of speedup, we define four compilation variants
(detailed in §\ref{sec:methodology}):

\begin{description}
  \item[\varrefo{}] Compiled at \texttt{-O0}: no optimization. Serves as the
        unoptimized baseline.
  \item[\varrefnv{}] Compiled at \texttt{-O3 -fno-tree-vectorize}: full
        compiler optimization but with auto-vectorization disabled. Isolates
        the contribution of general compiler optimizations (register
        allocation, loop unrolling, constant propagation) from SIMD.
  \item[\varref{}] Compiled at \texttt{-O3}: full optimization including GCC's
        auto-vectorizer. Represents what production deployments without
        hand-tuned SIMD would achieve.
  \item[\varavx{}] Hand-written AVX2 assembly: the production-quality
        optimized implementation.
\end{description}

\subsection{Hardware Performance Counters and Energy}
\label{sec:bg:papi}
\phasetwo{Expand with PAPI and RAPL background once data is collected.}

Hardware performance counters (accessed via PAPI~\cite{papi} or Linux
\texttt{perf\_event}) allow measuring IPC, cache miss rates, and branch
mispredictions at the instruction level. Intel RAPL~\cite{rapl} provides
package- and DRAM-domain energy readings. These will be incorporated in
Phase~2 to provide a mechanistic hardware-level explanation complementing the
cycle-count analysis presented here.