106 lines
4.8 KiB
TeX
106 lines
4.8 KiB
TeX
% ── 3. Methodology ────────────────────────────────────────────────────────────
|
|
\section{Methodology}
|
|
\label{sec:methodology}
|
|
|
|
\subsection{Implementation Source}
|
|
|
|
We use the \mlkem{} reference implementation from the \texttt{pq-crystals/kyber}
|
|
repository~\cite{kyber-avx2}, which provides both a portable C reference
|
|
(\varref{} / \varrefnv{}) and hand-written AVX2 assembly (\varavx{}). The
|
|
implementation targets the CRYSTALS-Kyber specification, functionally identical
|
|
to FIPS~203.
|
|
|
|
\subsection{Compilation Variants}
|
|
\label{sec:meth:variants}
|
|
|
|
We compile the same C source under four variant configurations using GCC 13.3.0:
|
|
|
|
\begin{description}
|
|
\item[\varrefo{}] \texttt{-O0}: unoptimized. Every operation is loaded/stored
|
|
through memory; no inlining, no register allocation. Establishes a
|
|
reproducible performance floor.
|
|
\item[\varrefnv{}] \texttt{-O3 -fno-tree-vectorize}: aggressive scalar
|
|
optimization but with the tree-vectorizer disabled. Isolates the
|
|
auto-vectorization contribution from general O3 optimizations.
|
|
\item[\varref{}] \texttt{-O3}: full optimization with GCC auto-vectorization
|
|
enabled. Represents realistic scalar-C performance.
|
|
\item[\varavx{}] \texttt{-O3} with hand-written AVX2 assembly linked in:
|
|
the production optimized path.
|
|
\end{description}
|
|
|
|
All four variants are built with position-independent code and identical linker
|
|
flags. The AVX2 assembly sources use the same \texttt{KYBER\_NAMESPACE} macro
|
|
as the C sources to prevent symbol collisions.
|
|
|
|
\subsection{Benchmark Harness}
|
|
|
|
Each binary runs a \emph{spin loop}: $N = 1{,}000$ outer iterations (spins),
|
|
each performing 20~repetitions of the target operation followed by a median
|
|
and mean cycle count report via \texttt{RDTSC}. Using the median of 20
|
|
repetitions per spin suppresses within-spin outliers; collecting 1{,}000 spins
|
|
produces a distribution of 1{,}000 median observations per binary invocation.
|
|
|
|
Two independent job submissions per (algorithm, variant) pair yield
|
|
$n \ge 2{,}000$ independent observations per group (3{,}000 for \varref{} and
|
|
\varavx{}, which had a third clean run). All runs used \texttt{taskset} to pin
|
|
to a single logical core, preventing OS scheduling interference.
|
|
|
|
\subsection{Hardware Platform}
|
|
|
|
All benchmarks were conducted on Brown University's OSCAR HPC cluster, node
|
|
\texttt{node2334}, pinned via SLURM's \texttt{{-}{-}nodelist} directive to
|
|
ensure all variants measured on identical hardware. The node specifications are:
|
|
|
|
\begin{center}
|
|
\small
|
|
\begin{tabular}{ll}
|
|
\toprule
|
|
CPU model & Intel Xeon Platinum 8268 (Cascade Lake) \\
|
|
Clock speed & 2.90\,GHz base \\
|
|
ISA extensions & SSE4.2, AVX, AVX2, AVX-512F \\
|
|
L1D cache & 32\,KB (per core) \\
|
|
L2 cache & 1\,MB (per core) \\
|
|
L3 cache & 35.75\,MB (shared) \\
|
|
OS & Linux (kernel 3.10) \\
|
|
Compiler & GCC 13.3.0 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
\noindent\textbf{Reproducibility note:} The \texttt{perf\_event\_paranoid}
|
|
setting on OSCAR nodes is 2, which prevents unprivileged access to hardware
|
|
performance counters. Hardware counter data (IPC, cache miss rates) will be
|
|
collected in Phase~2 after requesting elevated permissions from the cluster
|
|
administrators. \phasetwo{Hardware counter collection via PAPI.}
|
|
|
|
\subsection{Statistical Methodology}
|
|
\label{sec:meth:stats}
|
|
|
|
Cycle count distributions are right-skewed with occasional outliers from
|
|
OS interrupts and cache-cold starts (Figure~\ref{fig:distributions}). We
|
|
therefore use nonparametric statistics throughout:
|
|
|
|
\begin{itemize}
|
|
\item \textbf{Speedup}: ratio of group medians, $\hat{s} =
|
|
\text{median}(X_\text{baseline}) / \text{median}(X_\text{variant})$.
|
|
\item \textbf{Confidence interval}: 95\% bootstrap CI on $\hat{s}$,
|
|
computed by resampling both groups independently $B = 5{,}000$ times
|
|
with replacement.
|
|
\item \textbf{Mann-Whitney U test}: one-sided test for the hypothesis that
|
|
the variant distribution is stochastically smaller than the baseline
|
|
($H_1: P(X_\text{variant} < X_\text{baseline}) > 0.5$).
|
|
\item \textbf{Cliff's $\delta$}: effect size defined as $\delta =
|
|
[P(X_\text{variant} < X_\text{baseline}) -
|
|
P(X_\text{variant} > X_\text{baseline})]$, derived from the
|
|
Mann-Whitney U statistic. $\delta = +1$ indicates that
|
|
\emph{every} variant observation is faster than \emph{every}
|
|
baseline observation.
|
|
\end{itemize}
|
|
|
|
\subsection{Energy Measurement}
|
|
\label{sec:meth:energy}
|
|
\phasetwo{Intel RAPL (pkg + DRAM domains), EDP computation, per-operation joules.}
|
|
Energy measurements via Intel RAPL will be incorporated in Phase~2. The harness
|
|
already includes conditional RAPL support (\texttt{-DWITH\_RAPL=ON}) pending
|
|
appropriate system permissions.
|