% ── 3. Methodology ──────────────────────────────────────────────────────────── \section{Methodology} \label{sec:methodology} \subsection{Implementation Source} We use the \mlkem{} reference implementation from the \texttt{pq-crystals/kyber} repository~\cite{kyber-avx2}, which provides both a portable C reference (\varref{} / \varrefnv{}) and hand-written AVX2 assembly (\varavx{}). The implementation targets the CRYSTALS-Kyber specification, functionally identical to FIPS~203. \subsection{Compilation Variants} \label{sec:meth:variants} We compile the same C source under four variant configurations using GCC 13.3.0: \begin{description} \item[\varrefo{}] \texttt{-O0}: unoptimized. Every operation is loaded/stored through memory; no inlining, no register allocation. Establishes a reproducible performance floor. \item[\varrefnv{}] \texttt{-O3 -fno-tree-vectorize}: aggressive scalar optimization but with the tree-vectorizer disabled. Isolates the auto-vectorization contribution from general O3 optimizations. \item[\varref{}] \texttt{-O3}: full optimization with GCC auto-vectorization enabled. Represents realistic scalar-C performance. \item[\varavx{}] \texttt{-O3} with hand-written AVX2 assembly linked in: the production optimized path. \end{description} All four variants are built with position-independent code and identical linker flags. The AVX2 assembly sources use the same \texttt{KYBER\_NAMESPACE} macro as the C sources to prevent symbol collisions. \subsection{Benchmark Harness} Each binary runs a \emph{spin loop}: $N = 1{,}000$ outer iterations (spins), each performing 20~repetitions of the target operation followed by a median and mean cycle count report via \texttt{RDTSC}. Using the median of 20 repetitions per spin suppresses within-spin outliers; collecting 1{,}000 spins produces a distribution of 1{,}000 median observations per binary invocation. Two independent job submissions per (algorithm, variant) pair yield $n \ge 2{,}000$ independent observations per group (3{,}000 for \varref{} and \varavx{}, which had a third clean run). All runs used \texttt{taskset} to pin to a single logical core, preventing OS scheduling interference. \subsection{Hardware Platform} All benchmarks were conducted on Brown University's OSCAR HPC cluster, node \texttt{node2334}, pinned via SLURM's \texttt{{-}{-}nodelist} directive to ensure all variants measured on identical hardware. The node specifications are: \begin{center} \small \begin{tabular}{ll} \toprule CPU model & Intel Xeon Platinum 8268 (Cascade Lake) \\ Clock speed & 2.90\,GHz base \\ ISA extensions & SSE4.2, AVX, AVX2, AVX-512F \\ L1D cache & 32\,KB (per core) \\ L2 cache & 1\,MB (per core) \\ L3 cache & 35.75\,MB (shared) \\ OS & Linux (kernel 3.10) \\ Compiler & GCC 13.3.0 \\ \bottomrule \end{tabular} \end{center} \noindent\textbf{Reproducibility note:} The \texttt{perf\_event\_paranoid} setting on OSCAR nodes is 2, which prevents unprivileged access to hardware performance counters. Hardware counter data (IPC, cache miss rates) will be collected in Phase~2 after requesting elevated permissions from the cluster administrators. \phasetwo{Hardware counter collection via PAPI.} \subsection{Statistical Methodology} \label{sec:meth:stats} Cycle count distributions are right-skewed with occasional outliers from OS interrupts and cache-cold starts (Figure~\ref{fig:distributions}). We therefore use nonparametric statistics throughout: \begin{itemize} \item \textbf{Speedup}: ratio of group medians, $\hat{s} = \text{median}(X_\text{baseline}) / \text{median}(X_\text{variant})$. \item \textbf{Confidence interval}: 95\% bootstrap CI on $\hat{s}$, computed by resampling both groups independently $B = 5{,}000$ times with replacement. \item \textbf{Mann-Whitney U test}: one-sided test for the hypothesis that the variant distribution is stochastically smaller than the baseline ($H_1: P(X_\text{variant} < X_\text{baseline}) > 0.5$). \item \textbf{Cliff's $\delta$}: effect size defined as $\delta = [P(X_\text{variant} < X_\text{baseline}) - P(X_\text{variant} > X_\text{baseline})]$, derived from the Mann-Whitney U statistic. $\delta = +1$ indicates that \emph{every} variant observation is faster than \emph{every} baseline observation. \end{itemize} \subsection{Energy Measurement} \label{sec:meth:energy} \phasetwo{Intel RAPL (pkg + DRAM domains), EDP computation, per-operation joules.} Energy measurements via Intel RAPL will be incorporated in Phase~2. The harness already includes conditional RAPL support (\texttt{-DWITH\_RAPL=ON}) pending appropriate system permissions.