where-simd-helps/paper/sections/methodology.tex

106 lines
4.8 KiB
TeX

% ── 3. Methodology ────────────────────────────────────────────────────────────
\section{Methodology}
\label{sec:methodology}
\subsection{Implementation Source}
We use the \mlkem{} reference implementation from the \texttt{pq-crystals/kyber}
repository~\cite{kyber-avx2}, which provides both a portable C reference
(\varref{} / \varrefnv{}) and hand-written AVX2 assembly (\varavx{}). The
implementation targets the CRYSTALS-Kyber specification, functionally identical
to FIPS~203.
\subsection{Compilation Variants}
\label{sec:meth:variants}
We compile the same C source under four variant configurations using GCC 13.3.0:
\begin{description}
\item[\varrefo{}] \texttt{-O0}: unoptimized. Every operation is loaded/stored
through memory; no inlining, no register allocation. Establishes a
reproducible performance floor.
\item[\varrefnv{}] \texttt{-O3 -fno-tree-vectorize}: aggressive scalar
optimization but with the tree-vectorizer disabled. Isolates the
auto-vectorization contribution from general O3 optimizations.
\item[\varref{}] \texttt{-O3}: full optimization with GCC auto-vectorization
enabled. Represents realistic scalar-C performance.
\item[\varavx{}] \texttt{-O3} with hand-written AVX2 assembly linked in:
the production optimized path.
\end{description}
All four variants are built with position-independent code and identical linker
flags. The AVX2 assembly sources use the same \texttt{KYBER\_NAMESPACE} macro
as the C sources to prevent symbol collisions.
\subsection{Benchmark Harness}
Each binary runs a \emph{spin loop}: $N = 1{,}000$ outer iterations (spins),
each performing 20~repetitions of the target operation followed by a median
and mean cycle count report via \texttt{RDTSC}. Using the median of 20
repetitions per spin suppresses within-spin outliers; collecting 1{,}000 spins
produces a distribution of 1{,}000 median observations per binary invocation.
Two independent job submissions per (algorithm, variant) pair yield
$n \ge 2{,}000$ independent observations per group (3{,}000 for \varref{} and
\varavx{}, which had a third clean run). All runs used \texttt{taskset} to pin
to a single logical core, preventing OS scheduling interference.
\subsection{Hardware Platform}
All benchmarks were conducted on Brown University's OSCAR HPC cluster, node
\texttt{node2334}, pinned via SLURM's \texttt{{-}{-}nodelist} directive to
ensure all variants measured on identical hardware. The node specifications are:
\begin{center}
\small
\begin{tabular}{ll}
\toprule
CPU model & Intel Xeon Platinum 8268 (Cascade Lake) \\
Clock speed & 2.90\,GHz base \\
ISA extensions & SSE4.2, AVX, AVX2, AVX-512F \\
L1D cache & 32\,KB (per core) \\
L2 cache & 1\,MB (per core) \\
L3 cache & 35.75\,MB (shared) \\
OS & Linux (kernel 3.10) \\
Compiler & GCC 13.3.0 \\
\bottomrule
\end{tabular}
\end{center}
\noindent\textbf{Reproducibility note:} The \texttt{perf\_event\_paranoid}
setting on OSCAR nodes is 2, which prevents unprivileged access to hardware
performance counters. Hardware counter data (IPC, cache miss rates) will be
collected in Phase~2 after requesting elevated permissions from the cluster
administrators. \phasetwo{Hardware counter collection via PAPI.}
\subsection{Statistical Methodology}
\label{sec:meth:stats}
Cycle count distributions are right-skewed with occasional outliers from
OS interrupts and cache-cold starts (Figure~\ref{fig:distributions}). We
therefore use nonparametric statistics throughout:
\begin{itemize}
\item \textbf{Speedup}: ratio of group medians, $\hat{s} =
\text{median}(X_\text{baseline}) / \text{median}(X_\text{variant})$.
\item \textbf{Confidence interval}: 95\% bootstrap CI on $\hat{s}$,
computed by resampling both groups independently $B = 5{,}000$ times
with replacement.
\item \textbf{Mann-Whitney U test}: one-sided test for the hypothesis that
the variant distribution is stochastically smaller than the baseline
($H_1: P(X_\text{variant} < X_\text{baseline}) > 0.5$).
\item \textbf{Cliff's $\delta$}: effect size defined as $\delta =
[P(X_\text{variant} < X_\text{baseline}) -
P(X_\text{variant} > X_\text{baseline})]$, derived from the
Mann-Whitney U statistic. $\delta = +1$ indicates that
\emph{every} variant observation is faster than \emph{every}
baseline observation.
\end{itemize}
\subsection{Energy Measurement}
\label{sec:meth:energy}
\phasetwo{Intel RAPL (pkg + DRAM domains), EDP computation, per-operation joules.}
Energy measurements via Intel RAPL will be incorporated in Phase~2. The harness
already includes conditional RAPL support (\texttt{-DWITH\_RAPL=ON}) pending
appropriate system permissions.