182 lines
8.1 KiB
TeX
182 lines
8.1 KiB
TeX
% ── 4. Results ────────────────────────────────────────────────────────────────
|
||
\section{Results}
|
||
\label{sec:results}
|
||
|
||
\subsection{Cycle Count Distributions}
|
||
\label{sec:results:distributions}
|
||
|
||
Figure~\ref{fig:distributions} shows the cycle count distributions for three
|
||
representative operations in \mlkemk{512}, comparing \varref{} and \varavx{}.
|
||
All distributions are right-skewed with a long tail from OS interrupts and
|
||
cache-cold executions. The median (dashed lines) is robust to these outliers,
|
||
justifying the nonparametric approach of §\ref{sec:meth:stats}.
|
||
|
||
The separation between \varref{} and \varavx{} is qualitatively different
|
||
across operation types: for \op{INVNTT} the distributions do not overlap at
|
||
all (disjoint spikes separated by two orders of magnitude on the log scale);
|
||
for \op{gen\_a} there is partial overlap; for noise sampling the distributions
|
||
are nearly coincident.
|
||
|
||
\begin{figure}[t]
|
||
\centering
|
||
\includegraphics[width=\columnwidth]{figures/distributions.pdf}
|
||
\caption{Cycle count distributions for three representative \mlkemk{512}
|
||
operations. Log $x$-axis. Dashed lines mark medians. Right-skew and
|
||
outlier structure motivate nonparametric statistics.}
|
||
\label{fig:distributions}
|
||
\end{figure}
|
||
|
||
\subsection{Speedup Decomposition}
|
||
\label{sec:results:decomp}
|
||
|
||
Figure~\ref{fig:decomp} shows the cumulative speedup at each optimization stage
|
||
for all three \mlkem{} parameter sets. Each group of bars represents one
|
||
operation; the three bars within a group show the total speedup achieved after
|
||
applying (i)~O3 without auto-vec (\varrefnv{}), (ii)~O3 with auto-vec
|
||
(\varref{}), and (iii)~hand-written AVX2 (\varavx{})---all normalized to the
|
||
unoptimized \varrefo{} baseline. The log scale makes the three orders of
|
||
magnitude of variation legible.
|
||
|
||
Several structural features are immediately apparent:
|
||
\begin{itemize}
|
||
\item The \varrefnv{} and \varref{} bars are nearly indistinguishable for
|
||
arithmetic operations (NTT, INVNTT, basemul, frommsg), confirming that
|
||
GCC's auto-vectorizer contributes negligibly to these operations.
|
||
\item The \varavx{} bars are 1--2 orders of magnitude taller than the
|
||
\varref{} bars for arithmetic operations, indicating that hand-written
|
||
SIMD dominates the speedup.
|
||
\item For SHAKE-heavy operations (gen\_a, noise), all three bars are much
|
||
closer together, reflecting the memory-bandwidth bottleneck that limits
|
||
SIMD benefit.
|
||
\end{itemize}
|
||
|
||
\begin{figure*}[t]
|
||
\centering
|
||
\input{figures/fig_decomp}
|
||
\caption{Cumulative speedup at each optimization stage, normalized to
|
||
\varrefo{} (1×). Three bars per operation:
|
||
\textcolor{colRefnv}{$\blacksquare$}~O3 no auto-vec,
|
||
\textcolor{colRef}{$\blacksquare$}~O3 + auto-vec,
|
||
\textcolor{colAvx}{$\blacksquare$}~O3 + hand SIMD (AVX2).
|
||
Log $y$-axis; 95\% bootstrap CI shown on \varavx{} bars.
|
||
Sorted by \varavx{} speedup.}
|
||
\label{fig:decomp}
|
||
\end{figure*}
|
||
|
||
\subsection{Hand-Written SIMD Speedup}
|
||
\label{sec:results:simd}
|
||
|
||
Figure~\ref{fig:handsimd} isolates the hand-written SIMD speedup (\varref{}
|
||
$\to$ \varavx{}) across all three \mlkem{} parameter sets. Table~\ref{tab:simd}
|
||
summarizes the numerical values.
|
||
|
||
Key observations:
|
||
\begin{itemize}
|
||
\item \textbf{Arithmetic operations} achieve the largest speedups:
|
||
\speedup{56.3} for \op{INVNTT} at \mlkemk{512}, \speedup{52.0} for
|
||
\op{basemul}, and \speedup{45.6} for \op{frommsg}. The 95\% bootstrap
|
||
CIs on these ratios are extremely tight (often $[\hat{s}, \hat{s}]$ to
|
||
two decimal places), reflecting near-perfect measurement stability.
|
||
\item \textbf{gen\_a} achieves \speedup{3.8}--\speedup{4.7}: substantially
|
||
smaller than arithmetic operations because SHAKE-128 generation is
|
||
memory-bandwidth limited.
|
||
\item \textbf{Noise sampling} achieves only \speedup{1.2}--\speedup{1.4},
|
||
the smallest SIMD benefit. The centered binomial distribution (CBD)
|
||
sampler is bit-manipulation-heavy with sequential bitstream reads that
|
||
do not parallelise well.
|
||
\item Speedups are broadly consistent across parameter sets for per-polynomial
|
||
operations, as expected (§\ref{sec:results:crossparams}).
|
||
\end{itemize}
|
||
|
||
\begin{figure*}[t]
|
||
\centering
|
||
\input{figures/fig_hand_simd}
|
||
\caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}) per operation,
|
||
across all three \mlkem{} parameter sets. Log $y$-axis.
|
||
95\% bootstrap CI error bars (often sub-pixel).
|
||
Sorted by \mlkemk{512} speedup.}
|
||
\label{fig:handsimd}
|
||
\end{figure*}
|
||
|
||
\begin{table}[t]
|
||
\caption{Hand-written SIMD speedup (\varref{} $\to$ \varavx{}), median ratio
|
||
with 95\% bootstrap CI. All Cliff's $\delta = +1.000$, $p < 10^{-300}$.}
|
||
\label{tab:simd}
|
||
\small
|
||
\begin{tabular}{lccc}
|
||
\toprule
|
||
Operation & \mlkemk{512} & \mlkemk{768} & \mlkemk{1024} \\
|
||
\midrule
|
||
\op{INVNTT} & $56.3\times$ & $52.2\times$ & $50.5\times$ \\
|
||
\op{basemul} & $52.0\times$ & $47.6\times$ & $41.6\times$ \\
|
||
\op{frommsg} & $45.6\times$ & $49.2\times$ & $55.4\times$ \\
|
||
\op{NTT} & $35.5\times$ & $39.4\times$ & $34.6\times$ \\
|
||
\op{iDec} & $35.1\times$ & $35.0\times$ & $31.1\times$ \\
|
||
\op{iEnc} & $10.0\times$ & $9.4\times$ & $9.4\times$ \\
|
||
\op{iKeypair}& $8.3\times$ & $7.6\times$ & $8.1\times$ \\
|
||
\op{gen\_a} & $4.7\times$ & $3.8\times$ & $4.8\times$ \\
|
||
\op{noise} & $1.4\times$ & $1.4\times$ & $1.2\times$ \\
|
||
\bottomrule
|
||
\end{tabular}
|
||
\end{table}
|
||
|
||
\subsection{Statistical Significance}
|
||
\label{sec:results:stats}
|
||
|
||
All \varref{} vs.\ \varavx{} comparisons pass the Mann-Whitney U test at
|
||
$p < 10^{-300}$. Cliff's $\delta = +1.000$ for all operations except
|
||
\op{NTT} at \mlkemk{512} and \mlkemk{1024} ($\delta = +0.999$), meaning AVX2
|
||
achieves a strictly smaller cycle count than \varref{} in effectively every
|
||
observation pair.
|
||
|
||
Figure~\ref{fig:cliffs} shows the heatmap of Cliff's $\delta$ values across
|
||
all operations and parameter sets.
|
||
|
||
\begin{figure}[t]
|
||
\centering
|
||
\includegraphics[width=\columnwidth]{figures/cliffs_delta_heatmap.pdf}
|
||
\caption{Cliff's $\delta$ (\varref{} vs.\ \varavx{}) for all operations and
|
||
parameter sets. $\delta = +1$: AVX2 is faster in every observation
|
||
pair. Nearly all cells are at $+1.000$.}
|
||
\label{fig:cliffs}
|
||
\end{figure}
|
||
|
||
\subsection{Cross-Parameter Consistency}
|
||
\label{sec:results:crossparams}
|
||
|
||
Figure~\ref{fig:crossparams} shows the \varavx{} speedup for the four
|
||
per-polynomial operations across \mlkemk{512}, \mlkemk{768}, and
|
||
\mlkemk{1024}. Since all three instantiations operate on 256-coefficient
|
||
polynomials, speedups for \op{frommsg} and \op{INVNTT} should be
|
||
parameter-independent. This holds approximately: frommsg varies by only
|
||
$\pm{10\%}$, INVNTT by $\pm{6\%}$.
|
||
|
||
\op{NTT} shows a more pronounced variation ($35.5\times$ at \mlkemk{512},
|
||
$39.4\times$ at \mlkemk{768}, $34.6\times$ at \mlkemk{1024}) that is
|
||
statistically real (non-overlapping 95\% CIs). We attribute this to
|
||
\emph{cache state effects}: the surrounding polyvec loops that precede each
|
||
NTT call have a footprint that varies with $k$, leaving different cache
|
||
residency patterns that affect NTT latency in the scalar \varref{} path.
|
||
The AVX2 path is less sensitive because its smaller register footprint keeps
|
||
more state in vector registers.
|
||
|
||
\begin{figure}[t]
|
||
\centering
|
||
\input{figures/fig_cross_param}
|
||
\caption{Per-polynomial operation speedup (\varref{} $\to$ \varavx{}) across
|
||
security parameters. Polynomial dimension is 256 for all; variation
|
||
reflects cache-state differences in the calling context.}
|
||
\label{fig:crossparams}
|
||
\end{figure}
|
||
|
||
\subsection{Hardware Counter Breakdown}
|
||
\label{sec:results:papi}
|
||
\phasetwo{IPC, L1/L2/L3 cache miss rates, branch mispredictions via PAPI.
|
||
This section will contain bar charts of per-counter values comparing ref and
|
||
avx2 for each operation, explaining the mechanistic origins of the speedup.}
|
||
|
||
\subsection{Energy Efficiency}
|
||
\label{sec:results:energy}
|
||
\phasetwo{Intel RAPL pkg + DRAM energy readings per operation.
|
||
EDP (energy-delay product) comparison. Energy per KEM operation.}
|