52 lines
2.7 KiB
TeX
52 lines
2.7 KiB
TeX
% ── 1. Introduction ───────────────────────────────────────────────────────────
|
|
\section{Introduction}
|
|
\label{sec:intro}
|
|
|
|
The 2024 NIST post-quantum cryptography standards~\cite{fips203,fips204,fips205}
|
|
mark a turning point in deployed cryptography. \mlkem{} (Module-Lattice Key
|
|
Encapsulation Mechanism, FIPS~203) is already being integrated into TLS~1.3 by
|
|
major browser vendors~\cite{bettini2024} and is planned for inclusion in OpenSSH.
|
|
At deployment scale, performance matters: a server handling thousands of TLS
|
|
handshakes per second experiences a non-trivial computational overhead from
|
|
replacing elliptic-curve key exchange with a lattice-based KEM.
|
|
|
|
Reference implementations of \mlkem{} ship with hand-optimized AVX2 assembly
|
|
for the dominant operations~\cite{kyber-avx2}. Benchmarks routinely report
|
|
that the AVX2 path is ``$5$--$7\times$ faster'' than the portable C reference.
|
|
However, such top-level numbers conflate several distinct phenomena:
|
|
compiler optimization, compiler auto-vectorization, and hand-written SIMD. They
|
|
also say nothing about \emph{which} operations drive the speedup or \emph{why}
|
|
the assembly is faster than what a compiler can produce automatically.
|
|
|
|
\subsection*{Contributions}
|
|
|
|
This paper makes the following contributions:
|
|
|
|
\begin{enumerate}
|
|
\item \textbf{Three-way speedup decomposition.} We isolate compiler
|
|
optimization, auto-vectorization, and hand-written SIMD as separate
|
|
factors using four compilation variants (§\ref{sec:methodology}).
|
|
|
|
\item \textbf{Statistically rigorous benchmarking.} All comparisons are
|
|
backed by Mann-Whitney U tests and Cliff's~$\delta$ effect-size
|
|
analysis over $n \ge 2{,}000$ independent observations, with
|
|
bootstrapped 95\% confidence intervals on speedup ratios
|
|
(§\ref{sec:results}).
|
|
|
|
\item \textbf{Mechanistic analysis without hardware counters.} We explain
|
|
the quantitative speedup pattern analytically from the structure of
|
|
the NTT butterfly, Montgomery multiplication, and the SHAKE-128
|
|
permutation (§\ref{sec:discussion}).
|
|
|
|
\item \textbf{Open reproducible artifact.} The full pipeline from raw
|
|
SLURM outputs to publication figures is released publicly.
|
|
\end{enumerate}
|
|
|
|
\subsection*{Scope and roadmap}
|
|
|
|
This report covers Phase~1 of a broader study: \mlkem{} on Intel x86-64 with
|
|
AVX2. Planned extensions include hardware performance counter profiles (PAPI),
|
|
energy measurement (Intel RAPL), extension to \mldsa{} (Dilithium), and
|
|
cross-ISA comparison with ARM NEON/SVE and RISC-V V. Those results will be
|
|
incorporated in subsequent revisions.
|