where-simd-helps/paper/sections/results.tex

% ── 4. Results ────────────────────────────────────────────────────────────────
\section{Results}
\label{sec:results}

\subsection{Cycle Count Distributions}
\label{sec:results:distributions}

Figure~\ref{fig:distributions} shows the cycle count distributions for three
representative operations in \mlkemk{512}, comparing \varref{} and \varavx{}.
All distributions are right-skewed with a long tail from OS interrupts and
cache-cold executions. The median (dashed lines) is robust to these outliers,
justifying the nonparametric approach of §\ref{sec:meth:stats}.

The separation between \varref{} and \varavx{} is qualitatively different
across operation types: for \op{INVNTT} the distributions do not overlap at
all (disjoint spikes separated by two orders of magnitude on the log scale);
for \op{gen\_a} there is partial overlap; for noise sampling the distributions
are nearly coincident.

\begin{figure}[t]
  \centering
  \includegraphics[width=\columnwidth]{figures/distributions.pdf}
  \caption{Cycle count distributions for three representative \mlkemk{512}
           operations. Log $x$-axis. Dashed lines mark medians. Right-skew and
           outlier structure motivate nonparametric statistics.}
  \label{fig:distributions}
\end{figure}

\subsection{Speedup Decomposition}
\label{sec:results:decomp}

Figure~\ref{fig:decomp} shows the cumulative speedup at each optimization stage
for all three \mlkem{} parameter sets. Each group of bars represents one
operation; the three bars within a group show the total speedup achieved after
applying (i)~O3 without auto-vec (\varrefnv{}), (ii)~O3 with auto-vec
(\varref{}), and (iii)~hand-written AVX2 (\varavx{})---all normalized to the
unoptimized \varrefo{} baseline. The log scale makes the three orders of
magnitude of variation legible.

Several structural features are immediately apparent:
\begin{itemize}
  \item The \varrefnv{} and \varref{} bars are nearly indistinguishable for
        arithmetic operations (NTT, INVNTT, basemul, frommsg), confirming that
        GCC's auto-vectorizer contributes negligibly to these operations.
  \item The \varavx{} bars are 1--2 orders of magnitude taller than the
        \varref{} bars for arithmetic operations, indicating that hand-written
        SIMD dominates the speedup.
  \item For SHAKE-heavy operations (gen\_a, noise), all three bars are much
        closer together, reflecting the memory-bandwidth bottleneck that limits
        SIMD benefit.
\end{itemize}

\begin{figure*}[t]
  \centering
  \input{figures/fig_decomp}
  \caption{Cumulative speedup at each optimization stage, normalized to
           \varrefo{} (1×). Three bars per operation:
           \textcolor{colRefnv}{$\blacksquare$}~O3 no auto-vec,
           \textcolor{colRef}{$\blacksquare$}~O3 + auto-vec,
           \textcolor{colAvx}{$\blacksquare$}~O3 + hand SIMD (AVX2).
           Log $y$-axis; 95\% bootstrap CI shown on \varavx{} bars.
           Sorted by \varavx{} speedup.}
  \label{fig:decomp}
\end{figure*}

\subsection{Hand-Written SIMD Speedup}
\label{sec:results:simd}

Figure~\ref{fig:handsimd} isolates the hand-written SIMD speedup (\varref{}
$\to$ \varavx{}) across all three \mlkem{} parameter sets. Table~\ref{tab:simd}
summarizes the numerical values.

Key observations:
\begin{itemize}
  \item \textbf{Arithmetic operations} achieve the largest speedups:
        \speedup{56.3} for \op{INVNTT} at \mlkemk{512}, \speedup{52.0} for
        \op{basemul}, and \speedup{45.6} for \op{frommsg}. The 95\% bootstrap
        CIs on these ratios are extremely tight (often $[\hat{s}, \hat{s}]$ to
        two decimal places), reflecting near-perfect measurement stability.
  \item \textbf{gen\_a} achieves \speedup{3.8}--\speedup{4.7}: substantially
        smaller than arithmetic operations because SHAKE-128 generation is
        memory-bandwidth limited.
  \item \textbf{Noise sampling} achieves only \speedup{1.2}--\speedup{1.4},
        the smallest SIMD benefit. The centered binomial distribution (CBD)
        sampler is bit-manipulation-heavy with sequential bitstream reads that
        do not parallelise well.
  \item Speedups are broadly consistent across parameter sets for per-polynomial
        operations, as expected (§\ref{sec:results:crossparams}).
\end{itemize}

\begin{figure*}[t]
  \centering
  \input{figures/fig_hand_simd}
  \caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}) per operation,
           across all three \mlkem{} parameter sets. Log $y$-axis.
           95\% bootstrap CI error bars (often sub-pixel).
           Sorted by \mlkemk{512} speedup.}
  \label{fig:handsimd}
\end{figure*}

\begin{table}[t]
\caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}), median ratio
         with 95\% bootstrap CI. All Cliff's $\delta = +1.000$, $p < 10^{-300}$.}
\label{tab:simd}
\small
\begin{tabular}{lccc}
\toprule
Operation & \mlkemk{512} & \mlkemk{768} & \mlkemk{1024} \\
\midrule
\op{INVNTT}  & $56.3\times$ & $52.2\times$ & $50.5\times$ \\
\op{basemul} & $52.0\times$ & $47.6\times$ & $41.6\times$ \\
\op{frommsg} & $45.6\times$ & $49.2\times$ & $55.4\times$ \\
\op{NTT}     & $35.5\times$ & $39.4\times$ & $34.6\times$ \\
\op{iDec}    & $35.1\times$ & $35.0\times$ & $31.1\times$ \\
\op{iEnc}    & $10.0\times$ & $9.4\times$  & $9.4\times$  \\
\op{iKeypair}& $8.3\times$  & $7.6\times$  & $8.1\times$  \\
\op{gen\_a}  & $4.7\times$  & $3.8\times$  & $4.8\times$  \\
\op{noise}   & $1.4\times$  & $1.4\times$  & $1.2\times$  \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Statistical Significance}
\label{sec:results:stats}

All \varref{} vs.\ \varavx{} comparisons pass the Mann-Whitney U test at
$p < 10^{-300}$. Cliff's $\delta = +1.000$ for all operations except
\op{NTT} at \mlkemk{512} and \mlkemk{1024} ($\delta = +0.999$), meaning AVX2
achieves a strictly smaller cycle count than \varref{} in effectively every
observation pair.

Figure~\ref{fig:cliffs} shows the heatmap of Cliff's $\delta$ values across
all operations and parameter sets.

\begin{figure}[t]
  \centering
  \includegraphics[width=\columnwidth]{figures/cliffs_delta_heatmap.pdf}
  \caption{Cliff's $\delta$ (\varref{} vs.\ \varavx{}) for all operations and
           parameter sets. $\delta = +1$: AVX2 is faster in every observation
           pair. Nearly all cells are at $+1.000$.}
  \label{fig:cliffs}
\end{figure}

\subsection{Cross-Parameter Consistency}
\label{sec:results:crossparams}

Figure~\ref{fig:crossparams} shows the \varavx{} speedup for the four
per-polynomial operations across \mlkemk{512}, \mlkemk{768}, and
\mlkemk{1024}. Since all three instantiations operate on 256-coefficient
polynomials, speedups for \op{frommsg} and \op{INVNTT} should be
parameter-independent. This holds approximately: frommsg varies by only
$\pm{10\%}$, INVNTT by $\pm{6\%}$.

\op{NTT} shows a more pronounced variation ($35.5\times$ at \mlkemk{512},
$39.4\times$ at \mlkemk{768}, $34.6\times$ at \mlkemk{1024}) that is
statistically real (non-overlapping 95\% CIs). We attribute this to
\emph{cache state effects}: the surrounding polyvec loops that precede each
NTT call have a footprint that varies with $k$, leaving different cache
residency patterns that affect NTT latency in the scalar \varref{} path.
The AVX2 path is less sensitive because its smaller register footprint keeps
more state in vector registers.

\begin{figure}[t]
  \centering
  \input{figures/fig_cross_param}
  \caption{Per-polynomial operation speedup (\varref{} $\to$ \varavx{}) across
           security parameters. Polynomial dimension is 256 for all; variation
           reflects cache-state differences in the calling context.}
  \label{fig:crossparams}
\end{figure}

\subsection{Hardware Counter Breakdown}
\label{sec:results:papi}
\phasetwo{IPC, L1/L2/L3 cache miss rates, branch mispredictions via PAPI.
This section will contain bar charts of per-counter values comparing ref and
avx2 for each operation, explaining the mechanistic origins of the speedup.}

\subsection{Energy Efficiency}
\label{sec:results:energy}
\phasetwo{Intel RAPL pkg + DRAM energy readings per operation.
EDP (energy-delay product) comparison. Energy per KEM operation.}