89 lines
4.4 KiB
TeX
89 lines
4.4 KiB
TeX
% ── 2. Background ─────────────────────────────────────────────────────────────
|
|
\section{Background}
|
|
\label{sec:background}
|
|
|
|
\subsection{ML-KEM and the Number Theoretic Transform}
|
|
|
|
\mlkem{}~\cite{fips203} is a key encapsulation mechanism built on the
|
|
Module-Learning-With-Errors (Module-LWE) problem. Its security parameter
|
|
$k \in \{2, 3, 4\}$ controls the module dimension, yielding the three
|
|
instantiations \mlkemk{512}, \mlkemk{768}, and \mlkemk{1024}. The scheme
|
|
operates on polynomials in $\mathbb{Z}_q[x]/(x^{256}+1)$ with $q = 3329$.
|
|
|
|
The computational core is polynomial multiplication, which \mlkem{} evaluates
|
|
using the Number Theoretic Transform (NTT)~\cite{ntt-survey}. The NTT is a
|
|
modular analog of the Fast Fourier Transform that reduces schoolbook
|
|
$O(n^2)$ polynomial multiplication to $O(n \log n)$ pointwise operations.
|
|
For $n = 256$ coefficients and $q = 3329$, the NTT can be computed using a
|
|
specialized radix-2 Cooley-Tukey butterfly operating over 128 size-2 NTTs
|
|
in the NTT domain.
|
|
|
|
The primitive operations benchmarked in this paper are:
|
|
\begin{itemize}
|
|
\item \op{NTT} / \op{INVNTT}: forward and inverse NTT over a single
|
|
polynomial ($n = 256$).
|
|
\item \op{basemul}: pointwise multiplication in the NTT domain (base
|
|
multiplication of two NTT-domain polynomials).
|
|
\item \op{poly\_frommsg}: encodes a 32-byte message into a polynomial.
|
|
\item \op{gen\_a}: generates the public matrix $\mathbf{A}$ by expanding
|
|
a seed with SHAKE-128.
|
|
\item \op{poly\_getnoise\_eta\{1,2\}}: samples a centered binomial
|
|
distribution (CBD) noise polynomial using SHAKE-256 output.
|
|
\item \op{indcpa\_\{keypair, enc, dec\}}: full IND-CPA key generation,
|
|
encryption, and decryption.
|
|
\end{itemize}
|
|
|
|
\subsection{AVX2 SIMD on x86-64}
|
|
|
|
Intel's Advanced Vector Extensions 2 (AVX2) extends the YMM register file to
|
|
256-bit width, accommodating sixteen 16-bit integers simultaneously. The
|
|
\mlkem{} AVX2 implementation~\cite{kyber-avx2} by Schwabe and Seiler uses
|
|
hand-written assembly intrinsics rather than compiler-generated vectorized code.
|
|
|
|
The key instruction patterns exploited are:
|
|
\begin{itemize}
|
|
\item \texttt{vpaddw} / \texttt{vpsubw}: packed 16-bit addition/subtraction,
|
|
operating on 16 coefficients per instruction.
|
|
\item \texttt{vpmullw} / \texttt{vpmulhw}: packed 16-bit low/high multiply,
|
|
used to implement 16-wide Montgomery reduction.
|
|
\item \texttt{vpunpcklwd} / \texttt{vpunpckhwd}: interleave operations for
|
|
the NTT butterfly shuffle pattern.
|
|
\end{itemize}
|
|
|
|
Because \mlkem{} coefficients are 16-bit integers and the NTT butterfly
|
|
operates independently on 16 coefficient pairs per round, AVX2 offers a
|
|
theoretical $16\times$ instruction-count reduction for arithmetic steps. As
|
|
\S\ref{sec:results} shows, observed speedups \emph{exceed} $16\times$ for
|
|
\op{INVNTT} and \op{basemul} due to additional instruction-level parallelism
|
|
(ILP) in the unrolled hand-written loops.
|
|
|
|
\subsection{Compilation Variants}
|
|
|
|
To isolate distinct sources of speedup, we define four compilation variants
|
|
(detailed in §\ref{sec:methodology}):
|
|
|
|
\begin{description}
|
|
\item[\varrefo{}] Compiled at \texttt{-O0}: no optimization. Serves as the
|
|
unoptimized baseline.
|
|
\item[\varrefnv{}] Compiled at \texttt{-O3 -fno-tree-vectorize}: full
|
|
compiler optimization but with auto-vectorization disabled. Isolates
|
|
the contribution of general compiler optimizations (register
|
|
allocation, loop unrolling, constant propagation) from SIMD.
|
|
\item[\varref{}] Compiled at \texttt{-O3}: full optimization including GCC's
|
|
auto-vectorizer. Represents what production deployments without
|
|
hand-tuned SIMD would achieve.
|
|
\item[\varavx{}] Hand-written AVX2 assembly: the production-quality
|
|
optimized implementation.
|
|
\end{description}
|
|
|
|
\subsection{Hardware Performance Counters and Energy}
|
|
\label{sec:bg:papi}
|
|
\phasetwo{Expand with PAPI and RAPL background once data is collected.}
|
|
|
|
Hardware performance counters (accessed via PAPI~\cite{papi} or Linux
|
|
\texttt{perf\_event}) allow measuring IPC, cache miss rates, and branch
|
|
mispredictions at the instruction level. Intel RAPL~\cite{rapl} provides
|
|
package- and DRAM-domain energy readings. These will be incorporated in
|
|
Phase~2 to provide a mechanistic hardware-level explanation complementing the
|
|
cycle-count analysis presented here.
|