where-simd-helps/paper/sections/results.tex

182 lines
8.1 KiB
TeX
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

% ── 4. Results ────────────────────────────────────────────────────────────────
\section{Results}
\label{sec:results}
\subsection{Cycle Count Distributions}
\label{sec:results:distributions}
Figure~\ref{fig:distributions} shows the cycle count distributions for three
representative operations in \mlkemk{512}, comparing \varref{} and \varavx{}.
All distributions are right-skewed with a long tail from OS interrupts and
cache-cold executions. The median (dashed lines) is robust to these outliers,
justifying the nonparametric approach of §\ref{sec:meth:stats}.
The separation between \varref{} and \varavx{} is qualitatively different
across operation types: for \op{INVNTT} the distributions do not overlap at
all (disjoint spikes separated by two orders of magnitude on the log scale);
for \op{gen\_a} there is partial overlap; for noise sampling the distributions
are nearly coincident.
\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{figures/distributions.pdf}
\caption{Cycle count distributions for three representative \mlkemk{512}
operations. Log $x$-axis. Dashed lines mark medians. Right-skew and
outlier structure motivate nonparametric statistics.}
\label{fig:distributions}
\end{figure}
\subsection{Speedup Decomposition}
\label{sec:results:decomp}
Figure~\ref{fig:decomp} shows the cumulative speedup at each optimization stage
for all three \mlkem{} parameter sets. Each group of bars represents one
operation; the three bars within a group show the total speedup achieved after
applying (i)~O3 without auto-vec (\varrefnv{}), (ii)~O3 with auto-vec
(\varref{}), and (iii)~hand-written AVX2 (\varavx{})---all normalized to the
unoptimized \varrefo{} baseline. The log scale makes the three orders of
magnitude of variation legible.
Several structural features are immediately apparent:
\begin{itemize}
\item The \varrefnv{} and \varref{} bars are nearly indistinguishable for
arithmetic operations (NTT, INVNTT, basemul, frommsg), confirming that
GCC's auto-vectorizer contributes negligibly to these operations.
\item The \varavx{} bars are 1--2 orders of magnitude taller than the
\varref{} bars for arithmetic operations, indicating that hand-written
SIMD dominates the speedup.
\item For SHAKE-heavy operations (gen\_a, noise), all three bars are much
closer together, reflecting the memory-bandwidth bottleneck that limits
SIMD benefit.
\end{itemize}
\begin{figure*}[t]
\centering
\input{figures/fig_decomp}
\caption{Cumulative speedup at each optimization stage, normalized to
\varrefo{} (1×). Three bars per operation:
\textcolor{colRefnv}{$\blacksquare$}~O3 no auto-vec,
\textcolor{colRef}{$\blacksquare$}~O3 + auto-vec,
\textcolor{colAvx}{$\blacksquare$}~O3 + hand SIMD (AVX2).
Log $y$-axis; 95\% bootstrap CI shown on \varavx{} bars.
Sorted by \varavx{} speedup.}
\label{fig:decomp}
\end{figure*}
\subsection{Hand-Written SIMD Speedup}
\label{sec:results:simd}
Figure~\ref{fig:handsimd} isolates the hand-written SIMD speedup (\varref{}
$\to$ \varavx{}) across all three \mlkem{} parameter sets. Table~\ref{tab:simd}
summarizes the numerical values.
Key observations:
\begin{itemize}
\item \textbf{Arithmetic operations} achieve the largest speedups:
\speedup{56.3} for \op{INVNTT} at \mlkemk{512}, \speedup{52.0} for
\op{basemul}, and \speedup{45.6} for \op{frommsg}. The 95\% bootstrap
CIs on these ratios are extremely tight (often $[\hat{s}, \hat{s}]$ to
two decimal places), reflecting near-perfect measurement stability.
\item \textbf{gen\_a} achieves \speedup{3.8}--\speedup{4.7}: substantially
smaller than arithmetic operations because SHAKE-128 generation is
memory-bandwidth limited.
\item \textbf{Noise sampling} achieves only \speedup{1.2}--\speedup{1.4},
the smallest SIMD benefit. The centered binomial distribution (CBD)
sampler is bit-manipulation-heavy with sequential bitstream reads that
do not parallelise well.
\item Speedups are broadly consistent across parameter sets for per-polynomial
operations, as expected (§\ref{sec:results:crossparams}).
\end{itemize}
\begin{figure*}[t]
\centering
\input{figures/fig_hand_simd}
\caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}) per operation,
across all three \mlkem{} parameter sets. Log $y$-axis.
95\% bootstrap CI error bars (often sub-pixel).
Sorted by \mlkemk{512} speedup.}
\label{fig:handsimd}
\end{figure*}
\begin{table}[t]
\caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}), median ratio
with 95\% bootstrap CI. All Cliff's $\delta = +1.000$, $p < 10^{-300}$.}
\label{tab:simd}
\small
\begin{tabular}{lccc}
\toprule
Operation & \mlkemk{512} & \mlkemk{768} & \mlkemk{1024} \\
\midrule
\op{INVNTT} & $56.3\times$ & $52.2\times$ & $50.5\times$ \\
\op{basemul} & $52.0\times$ & $47.6\times$ & $41.6\times$ \\
\op{frommsg} & $45.6\times$ & $49.2\times$ & $55.4\times$ \\
\op{NTT} & $35.5\times$ & $39.4\times$ & $34.6\times$ \\
\op{iDec} & $35.1\times$ & $35.0\times$ & $31.1\times$ \\
\op{iEnc} & $10.0\times$ & $9.4\times$ & $9.4\times$ \\
\op{iKeypair}& $8.3\times$ & $7.6\times$ & $8.1\times$ \\
\op{gen\_a} & $4.7\times$ & $3.8\times$ & $4.8\times$ \\
\op{noise} & $1.4\times$ & $1.4\times$ & $1.2\times$ \\
\bottomrule
\end{tabular}
\end{table}
\subsection{Statistical Significance}
\label{sec:results:stats}
All \varref{} vs.\ \varavx{} comparisons pass the Mann-Whitney U test at
$p < 10^{-300}$. Cliff's $\delta = +1.000$ for all operations except
\op{NTT} at \mlkemk{512} and \mlkemk{1024} ($\delta = +0.999$), meaning AVX2
achieves a strictly smaller cycle count than \varref{} in effectively every
observation pair.
Figure~\ref{fig:cliffs} shows the heatmap of Cliff's $\delta$ values across
all operations and parameter sets.
\begin{figure}[t]
\centering
\includegraphics[width=\columnwidth]{figures/cliffs_delta_heatmap.pdf}
\caption{Cliff's $\delta$ (\varref{} vs.\ \varavx{}) for all operations and
parameter sets. $\delta = +1$: AVX2 is faster in every observation
pair. Nearly all cells are at $+1.000$.}
\label{fig:cliffs}
\end{figure}
\subsection{Cross-Parameter Consistency}
\label{sec:results:crossparams}
Figure~\ref{fig:crossparams} shows the \varavx{} speedup for the four
per-polynomial operations across \mlkemk{512}, \mlkemk{768}, and
\mlkemk{1024}. Since all three instantiations operate on 256-coefficient
polynomials, speedups for \op{frommsg} and \op{INVNTT} should be
parameter-independent. This holds approximately: frommsg varies by only
$\pm{10\%}$, INVNTT by $\pm{6\%}$.
\op{NTT} shows a more pronounced variation ($35.5\times$ at \mlkemk{512},
$39.4\times$ at \mlkemk{768}, $34.6\times$ at \mlkemk{1024}) that is
statistically real (non-overlapping 95\% CIs). We attribute this to
\emph{cache state effects}: the surrounding polyvec loops that precede each
NTT call have a footprint that varies with $k$, leaving different cache
residency patterns that affect NTT latency in the scalar \varref{} path.
The AVX2 path is less sensitive because its smaller register footprint keeps
more state in vector registers.
\begin{figure}[t]
\centering
\input{figures/fig_cross_param}
\caption{Per-polynomial operation speedup (\varref{} $\to$ \varavx{}) across
security parameters. Polynomial dimension is 256 for all; variation
reflects cache-state differences in the calling context.}
\label{fig:crossparams}
\end{figure}
\subsection{Hardware Counter Breakdown}
\label{sec:results:papi}
\phasetwo{IPC, L1/L2/L3 cache miss rates, branch mispredictions via PAPI.
This section will contain bar charts of per-counter values comparing ref and
avx2 for each operation, explaining the mechanistic origins of the speedup.}
\subsection{Energy Efficiency}
\label{sec:results:energy}
\phasetwo{Intel RAPL pkg + DRAM energy readings per operation.
EDP (energy-delay product) comparison. Energy per KEM operation.}