where-simd-helps/paper/sections/background.tex

89 lines
4.4 KiB
TeX

% ── 2. Background ─────────────────────────────────────────────────────────────
\section{Background}
\label{sec:background}
\subsection{ML-KEM and the Number Theoretic Transform}
\mlkem{}~\cite{fips203} is a key encapsulation mechanism built on the
Module-Learning-With-Errors (Module-LWE) problem. Its security parameter
$k \in \{2, 3, 4\}$ controls the module dimension, yielding the three
instantiations \mlkemk{512}, \mlkemk{768}, and \mlkemk{1024}. The scheme
operates on polynomials in $\mathbb{Z}_q[x]/(x^{256}+1)$ with $q = 3329$.
The computational core is polynomial multiplication, which \mlkem{} evaluates
using the Number Theoretic Transform (NTT)~\cite{ntt-survey}. The NTT is a
modular analog of the Fast Fourier Transform that reduces schoolbook
$O(n^2)$ polynomial multiplication to $O(n \log n)$ pointwise operations.
For $n = 256$ coefficients and $q = 3329$, the NTT can be computed using a
specialized radix-2 Cooley-Tukey butterfly operating over 128 size-2 NTTs
in the NTT domain.
The primitive operations benchmarked in this paper are:
\begin{itemize}
\item \op{NTT} / \op{INVNTT}: forward and inverse NTT over a single
polynomial ($n = 256$).
\item \op{basemul}: pointwise multiplication in the NTT domain (base
multiplication of two NTT-domain polynomials).
\item \op{poly\_frommsg}: encodes a 32-byte message into a polynomial.
\item \op{gen\_a}: generates the public matrix $\mathbf{A}$ by expanding
a seed with SHAKE-128.
\item \op{poly\_getnoise\_eta\{1,2\}}: samples a centered binomial
distribution (CBD) noise polynomial using SHAKE-256 output.
\item \op{indcpa\_\{keypair, enc, dec\}}: full IND-CPA key generation,
encryption, and decryption.
\end{itemize}
\subsection{AVX2 SIMD on x86-64}
Intel's Advanced Vector Extensions 2 (AVX2) extends the YMM register file to
256-bit width, accommodating sixteen 16-bit integers simultaneously. The
\mlkem{} AVX2 implementation~\cite{kyber-avx2} by Schwabe and Seiler uses
hand-written assembly intrinsics rather than compiler-generated vectorized code.
The key instruction patterns exploited are:
\begin{itemize}
\item \texttt{vpaddw} / \texttt{vpsubw}: packed 16-bit addition/subtraction,
operating on 16 coefficients per instruction.
\item \texttt{vpmullw} / \texttt{vpmulhw}: packed 16-bit low/high multiply,
used to implement 16-wide Montgomery reduction.
\item \texttt{vpunpcklwd} / \texttt{vpunpckhwd}: interleave operations for
the NTT butterfly shuffle pattern.
\end{itemize}
Because \mlkem{} coefficients are 16-bit integers and the NTT butterfly
operates independently on 16 coefficient pairs per round, AVX2 offers a
theoretical $16\times$ instruction-count reduction for arithmetic steps. As
\S\ref{sec:results} shows, observed speedups \emph{exceed} $16\times$ for
\op{INVNTT} and \op{basemul} due to additional instruction-level parallelism
(ILP) in the unrolled hand-written loops.
\subsection{Compilation Variants}
To isolate distinct sources of speedup, we define four compilation variants
(detailed in §\ref{sec:methodology}):
\begin{description}
\item[\varrefo{}] Compiled at \texttt{-O0}: no optimization. Serves as the
unoptimized baseline.
\item[\varrefnv{}] Compiled at \texttt{-O3 -fno-tree-vectorize}: full
compiler optimization but with auto-vectorization disabled. Isolates
the contribution of general compiler optimizations (register
allocation, loop unrolling, constant propagation) from SIMD.
\item[\varref{}] Compiled at \texttt{-O3}: full optimization including GCC's
auto-vectorizer. Represents what production deployments without
hand-tuned SIMD would achieve.
\item[\varavx{}] Hand-written AVX2 assembly: the production-quality
optimized implementation.
\end{description}
\subsection{Hardware Performance Counters and Energy}
\label{sec:bg:papi}
\phasetwo{Expand with PAPI and RAPL background once data is collected.}
Hardware performance counters (accessed via PAPI~\cite{papi} or Linux
\texttt{perf\_event}) allow measuring IPC, cache miss rates, and branch
mispredictions at the instruction level. Intel RAPL~\cite{rapl} provides
package- and DRAM-domain energy readings. These will be incorporated in
Phase~2 to provide a mechanistic hardware-level explanation complementing the
cycle-count analysis presented here.