% ── 4. Results ──────────────────────────────────────────────────────────────── \section{Results} \label{sec:results} \subsection{Cycle Count Distributions} \label{sec:results:distributions} Figure~\ref{fig:distributions} shows the cycle count distributions for three representative operations in \mlkemk{512}, comparing \varref{} and \varavx{}. All distributions are right-skewed with a long tail from OS interrupts and cache-cold executions. The median (dashed lines) is robust to these outliers, justifying the nonparametric approach of §\ref{sec:meth:stats}. The separation between \varref{} and \varavx{} is qualitatively different across operation types: for \op{INVNTT} the distributions do not overlap at all (disjoint spikes separated by two orders of magnitude on the log scale); for \op{gen\_a} there is partial overlap; for noise sampling the distributions are nearly coincident. \begin{figure}[t] \centering \includegraphics[width=\columnwidth]{figures/distributions.pdf} \caption{Cycle count distributions for three representative \mlkemk{512} operations. Log $x$-axis. Dashed lines mark medians. Right-skew and outlier structure motivate nonparametric statistics.} \label{fig:distributions} \end{figure} \subsection{Speedup Decomposition} \label{sec:results:decomp} Figure~\ref{fig:decomp} shows the cumulative speedup at each optimization stage for all three \mlkem{} parameter sets. Each group of bars represents one operation; the three bars within a group show the total speedup achieved after applying (i)~O3 without auto-vec (\varrefnv{}), (ii)~O3 with auto-vec (\varref{}), and (iii)~hand-written AVX2 (\varavx{})---all normalized to the unoptimized \varrefo{} baseline. The log scale makes the three orders of magnitude of variation legible. Several structural features are immediately apparent: \begin{itemize} \item The \varrefnv{} and \varref{} bars are nearly indistinguishable for arithmetic operations (NTT, INVNTT, basemul, frommsg), confirming that GCC's auto-vectorizer contributes negligibly to these operations. \item The \varavx{} bars are 1--2 orders of magnitude taller than the \varref{} bars for arithmetic operations, indicating that hand-written SIMD dominates the speedup. \item For SHAKE-heavy operations (gen\_a, noise), all three bars are much closer together, reflecting the memory-bandwidth bottleneck that limits SIMD benefit. \end{itemize} \begin{figure*}[t] \centering \input{figures/fig_decomp} \caption{Cumulative speedup at each optimization stage, normalized to \varrefo{} (1×). Three bars per operation: \textcolor{colRefnv}{$\blacksquare$}~O3 no auto-vec, \textcolor{colRef}{$\blacksquare$}~O3 + auto-vec, \textcolor{colAvx}{$\blacksquare$}~O3 + hand SIMD (AVX2). Log $y$-axis; 95\% bootstrap CI shown on \varavx{} bars. Sorted by \varavx{} speedup.} \label{fig:decomp} \end{figure*} \subsection{Hand-Written SIMD Speedup} \label{sec:results:simd} Figure~\ref{fig:handsimd} isolates the hand-written SIMD speedup (\varref{} $\to$ \varavx{}) across all three \mlkem{} parameter sets. Table~\ref{tab:simd} summarizes the numerical values. Key observations: \begin{itemize} \item \textbf{Arithmetic operations} achieve the largest speedups: \speedup{56.3} for \op{INVNTT} at \mlkemk{512}, \speedup{52.0} for \op{basemul}, and \speedup{45.6} for \op{frommsg}. The 95\% bootstrap CIs on these ratios are extremely tight (often $[\hat{s}, \hat{s}]$ to two decimal places), reflecting near-perfect measurement stability. \item \textbf{gen\_a} achieves \speedup{3.8}--\speedup{4.7}: substantially smaller than arithmetic operations because SHAKE-128 generation is memory-bandwidth limited. \item \textbf{Noise sampling} achieves only \speedup{1.2}--\speedup{1.4}, the smallest SIMD benefit. The centered binomial distribution (CBD) sampler is bit-manipulation-heavy with sequential bitstream reads that do not parallelise well. \item Speedups are broadly consistent across parameter sets for per-polynomial operations, as expected (§\ref{sec:results:crossparams}). \end{itemize} \begin{figure*}[t] \centering \input{figures/fig_hand_simd} \caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}) per operation, across all three \mlkem{} parameter sets. Log $y$-axis. 95\% bootstrap CI error bars (often sub-pixel). Sorted by \mlkemk{512} speedup.} \label{fig:handsimd} \end{figure*} \begin{table}[t] \caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}), median ratio with 95\% bootstrap CI. All Cliff's $\delta = +1.000$, $p < 10^{-300}$.} \label{tab:simd} \small \begin{tabular}{lccc} \toprule Operation & \mlkemk{512} & \mlkemk{768} & \mlkemk{1024} \\ \midrule \op{INVNTT} & $56.3\times$ & $52.2\times$ & $50.5\times$ \\ \op{basemul} & $52.0\times$ & $47.6\times$ & $41.6\times$ \\ \op{frommsg} & $45.6\times$ & $49.2\times$ & $55.4\times$ \\ \op{NTT} & $35.5\times$ & $39.4\times$ & $34.6\times$ \\ \op{iDec} & $35.1\times$ & $35.0\times$ & $31.1\times$ \\ \op{iEnc} & $10.0\times$ & $9.4\times$ & $9.4\times$ \\ \op{iKeypair}& $8.3\times$ & $7.6\times$ & $8.1\times$ \\ \op{gen\_a} & $4.7\times$ & $3.8\times$ & $4.8\times$ \\ \op{noise} & $1.4\times$ & $1.4\times$ & $1.2\times$ \\ \bottomrule \end{tabular} \end{table} \subsection{Statistical Significance} \label{sec:results:stats} All \varref{} vs.\ \varavx{} comparisons pass the Mann-Whitney U test at $p < 10^{-300}$. Cliff's $\delta = +1.000$ for all operations except \op{NTT} at \mlkemk{512} and \mlkemk{1024} ($\delta = +0.999$), meaning AVX2 achieves a strictly smaller cycle count than \varref{} in effectively every observation pair. Figure~\ref{fig:cliffs} shows the heatmap of Cliff's $\delta$ values across all operations and parameter sets. \begin{figure}[t] \centering \includegraphics[width=\columnwidth]{figures/cliffs_delta_heatmap.pdf} \caption{Cliff's $\delta$ (\varref{} vs.\ \varavx{}) for all operations and parameter sets. $\delta = +1$: AVX2 is faster in every observation pair. Nearly all cells are at $+1.000$.} \label{fig:cliffs} \end{figure} \subsection{Cross-Parameter Consistency} \label{sec:results:crossparams} Figure~\ref{fig:crossparams} shows the \varavx{} speedup for the four per-polynomial operations across \mlkemk{512}, \mlkemk{768}, and \mlkemk{1024}. Since all three instantiations operate on 256-coefficient polynomials, speedups for \op{frommsg} and \op{INVNTT} should be parameter-independent. This holds approximately: frommsg varies by only $\pm{10\%}$, INVNTT by $\pm{6\%}$. \op{NTT} shows a more pronounced variation ($35.5\times$ at \mlkemk{512}, $39.4\times$ at \mlkemk{768}, $34.6\times$ at \mlkemk{1024}) that is statistically real (non-overlapping 95\% CIs). We attribute this to \emph{cache state effects}: the surrounding polyvec loops that precede each NTT call have a footprint that varies with $k$, leaving different cache residency patterns that affect NTT latency in the scalar \varref{} path. The AVX2 path is less sensitive because its smaller register footprint keeps more state in vector registers. \begin{figure}[t] \centering \input{figures/fig_cross_param} \caption{Per-polynomial operation speedup (\varref{} $\to$ \varavx{}) across security parameters. Polynomial dimension is 256 for all; variation reflects cache-state differences in the calling context.} \label{fig:crossparams} \end{figure} \subsection{Hardware Counter Breakdown} \label{sec:results:papi} \phasetwo{IPC, L1/L2/L3 cache miss rates, branch mispredictions via PAPI. This section will contain bar charts of per-counter values comparing ref and avx2 for each operation, explaining the mechanistic origins of the speedup.} \subsection{Energy Efficiency} \label{sec:results:energy} \phasetwo{Intel RAPL pkg + DRAM energy readings per operation. EDP (energy-delay product) comparison. Energy per KEM operation.}