where-simd-helps/paper/sections/methodology.tex

% ── 3. Methodology ────────────────────────────────────────────────────────────
\section{Methodology}
\label{sec:methodology}

\subsection{Implementation Source}

We use the \mlkem{} reference implementation from the \texttt{pq-crystals/kyber}
repository~\cite{kyber-avx2}, which provides both a portable C reference
(\varref{} / \varrefnv{}) and hand-written AVX2 assembly (\varavx{}). The
implementation targets the CRYSTALS-Kyber specification, functionally identical
to FIPS~203.

\subsection{Compilation Variants}
\label{sec:meth:variants}

We compile the same C source under four variant configurations using GCC 13.3.0:

\begin{description}
  \item[\varrefo{}] \texttt{-O0}: unoptimized. Every operation is loaded/stored
        through memory; no inlining, no register allocation. Establishes a
        reproducible performance floor.
  \item[\varrefnv{}] \texttt{-O3 -fno-tree-vectorize}: aggressive scalar
        optimization but with the tree-vectorizer disabled. Isolates the
        auto-vectorization contribution from general O3 optimizations.
  \item[\varref{}] \texttt{-O3}: full optimization with GCC auto-vectorization
        enabled. Represents realistic scalar-C performance.
  \item[\varavx{}] \texttt{-O3} with hand-written AVX2 assembly linked in:
        the production optimized path.
\end{description}

All four variants are built with position-independent code and identical linker
flags. The AVX2 assembly sources use the same \texttt{KYBER\_NAMESPACE} macro
as the C sources to prevent symbol collisions.

\subsection{Benchmark Harness}

Each binary runs a \emph{spin loop}: $N = 1{,}000$ outer iterations (spins),
each performing 20~repetitions of the target operation followed by a median
and mean cycle count report via \texttt{RDTSC}. Using the median of 20
repetitions per spin suppresses within-spin outliers; collecting 1{,}000 spins
produces a distribution of 1{,}000 median observations per binary invocation.

Two independent job submissions per (algorithm, variant) pair yield
$n \ge 2{,}000$ independent observations per group (3{,}000 for \varref{} and
\varavx{}, which had a third clean run). All runs used \texttt{taskset} to pin
to a single logical core, preventing OS scheduling interference.

\subsection{Hardware Platform}

All benchmarks were conducted on Brown University's OSCAR HPC cluster, node
\texttt{node2334}, pinned via SLURM's \texttt{{-}{-}nodelist} directive to
ensure all variants measured on identical hardware. The node specifications are:

\begin{center}
\small
\begin{tabular}{ll}
\toprule
CPU model     & Intel Xeon Platinum 8268 (Cascade Lake) \\
Clock speed   & 2.90\,GHz base \\
ISA extensions & SSE4.2, AVX, AVX2, AVX-512F \\
L1D cache     & 32\,KB (per core) \\
L2 cache      & 1\,MB (per core) \\
L3 cache      & 35.75\,MB (shared) \\
OS            & Linux (kernel 3.10) \\
Compiler      & GCC 13.3.0 \\
\bottomrule
\end{tabular}
\end{center}

\noindent\textbf{Reproducibility note:} The \texttt{perf\_event\_paranoid}
setting on OSCAR nodes is 2, which prevents unprivileged access to hardware
performance counters. Hardware counter data (IPC, cache miss rates) will be
collected in Phase~2 after requesting elevated permissions from the cluster
administrators. \phasetwo{Hardware counter collection via PAPI.}

\subsection{Statistical Methodology}
\label{sec:meth:stats}

Cycle count distributions are right-skewed with occasional outliers from
OS interrupts and cache-cold starts (Figure~\ref{fig:distributions}). We
therefore use nonparametric statistics throughout:

\begin{itemize}
  \item \textbf{Speedup}: ratio of group medians, $\hat{s} =
        \text{median}(X_\text{baseline}) / \text{median}(X_\text{variant})$.
  \item \textbf{Confidence interval}: 95\% bootstrap CI on $\hat{s}$,
        computed by resampling both groups independently $B = 5{,}000$ times
        with replacement.
  \item \textbf{Mann-Whitney U test}: one-sided test for the hypothesis that
        the variant distribution is stochastically smaller than the baseline
        ($H_1: P(X_\text{variant} < X_\text{baseline}) > 0.5$).
  \item \textbf{Cliff's $\delta$}: effect size defined as $\delta =
        [P(X_\text{variant} < X_\text{baseline}) -
         P(X_\text{variant} > X_\text{baseline})]$, derived from the
        Mann-Whitney U statistic. $\delta = +1$ indicates that
        \emph{every} variant observation is faster than \emph{every}
        baseline observation.
\end{itemize}

\subsection{Energy Measurement}
\label{sec:meth:energy}
\phasetwo{Intel RAPL (pkg + DRAM domains), EDP computation, per-operation joules.}
Energy measurements via Intel RAPL will be incorporated in Phase~2. The harness
already includes conditional RAPL support (\texttt{-DWITH\_RAPL=ON}) pending
appropriate system permissions.