105 lines
5.7 KiB
TeX
105 lines
5.7 KiB
TeX
% ── 5. Discussion ─────────────────────────────────────────────────────────────
|
|
\section{Discussion}
|
|
\label{sec:discussion}
|
|
|
|
\subsection{Why Arithmetic Operations Benefit Most}
|
|
|
|
The NTT butterfly loop processes 128 pairs of 16-bit coefficients per forward
|
|
transform. In the scalar \varref{} path, each butterfly requires a modular
|
|
multiplication (implemented as a Barrett reduction), an addition, and a
|
|
subtraction---roughly 10--15 instructions per pair with data-dependent
|
|
serialization through the multiply-add chain. The AVX2 path uses
|
|
\texttt{vpmullw}/\texttt{vpmulhw} to compute 16 Montgomery multiplications
|
|
per instruction, processing an entire butterfly layer in \mbox{$\sim$16}
|
|
fewer instruction cycles.
|
|
|
|
The observed INVNTT speedup of \speedup{56.3} at \mlkemk{512} \emph{exceeds}
|
|
the theoretical $16\times$ register-width advantage. We attribute this to
|
|
two compounding factors: (1) the unrolled hand-written assembly eliminates
|
|
loop overhead and branch prediction pressure; (2) the inverse NTT has a
|
|
slightly different access pattern than the forward NTT that benefits from
|
|
out-of-order execution with wide issue ports on the Cascade Lake
|
|
microarchitecture. \phasetwo{Confirm with IPC and port utilisation counters.}
|
|
|
|
\subsection{Why the Compiler Cannot Auto-Vectorise NTT}
|
|
|
|
A striking result is that \varref{} and \varrefnv{} perform nearly identically
|
|
for all arithmetic operations ($\leq 10\%$ difference, with \varrefnv{}
|
|
occasionally faster). This means GCC's tree-vectorizer produces no net benefit
|
|
for the NTT inner loop.
|
|
|
|
The fundamental obstacle is \emph{modular reduction}: Barrett reduction and
|
|
Montgomery reduction require a multiply-high operation (\texttt{vpmulhw}) that
|
|
GCC cannot express through the scalar multiply-add chain it generates for the
|
|
C reference code. Additionally, the NTT butterfly requires coefficient
|
|
interleaving (odd/even index separation) that the auto-vectorizer does not
|
|
recognize as a known shuffle pattern. The hand-written assembly encodes these
|
|
patterns directly in \texttt{vpunpck*} instructions.
|
|
|
|
This finding has practical significance: developers porting \mlkem{} to new
|
|
platforms cannot rely on the compiler to provide SIMD speedup for the NTT.
|
|
Hand-written intrinsics or architecture-specific assembly are necessary.
|
|
|
|
\subsection{Why SHAKE Operations Benefit Less}
|
|
|
|
\op{gen\_a} expands a public seed into a $k \times k$ matrix of polynomials
|
|
using SHAKE-128. Each Keccak-f[1600] permutation operates on a 200-byte state
|
|
that does not fit in AVX2 registers (16 lanes $\times$ 16 bits = 32 bytes). The
|
|
AVX2 Keccak implementation achieves \speedup{3.8}--\speedup{4.7} primarily by
|
|
batching multiple independent absorb phases and using vectorized XOR across
|
|
parallel state words---a different kind of SIMD parallelism than the arithmetic
|
|
path. The bottleneck shifts to memory bandwidth as the permutation state is
|
|
repeatedly loaded from and stored to L1 cache.
|
|
|
|
\subsection{Why Noise Sampling Barely Benefits}
|
|
|
|
CBD noise sampling reads adjacent bits from a byte stream and computes
|
|
Hamming weights. The scalar path already uses bitwise operations with no
|
|
data-dependent branches (constant-time design). The AVX2 path can batch the
|
|
popcount computation but remains bottlenecked by the sequential bitstream
|
|
access pattern. The small \speedup{1.2}--\speedup{1.4} speedup reflects
|
|
this fundamental memory access bottleneck rather than compute limitation.
|
|
|
|
\subsection{NTT Cache-State Variation Across Parameter Sets}
|
|
|
|
The \speedup{13\%} variation in NTT speedup across parameter sets
|
|
(§\ref{sec:results:crossparams}) despite identical polynomial dimensions
|
|
suggests that execution context matters even for nominally isolated
|
|
micro-benchmarks. Higher-$k$ polyvec operations that precede each NTT call
|
|
have larger memory footprints ($k$ more polynomials in the accumulation
|
|
buffer), potentially evicting portions of the instruction cache or L1 data
|
|
cache that the scalar NTT path relies on. The AVX2 path is less affected
|
|
because it maintains more coefficient state in vector registers between
|
|
operations. \phasetwo{Verify with L1/L2 miss counters split by scalar vs AVX2.}
|
|
|
|
\subsection{Implications for Deployment}
|
|
|
|
The end-to-end KEM speedups of \speedup{5.4}--\speedup{7.1} (Appendix,
|
|
Figure~\ref{fig:kemlevel}) represent the practical deployment benefit.
|
|
Deployments that cannot use hand-written SIMD (e.g., some constrained
|
|
environments, or languages without inline assembly support) should expect
|
|
performance within a factor of $5$--$7$ of the AVX2 reference.
|
|
Auto-vectorization provides essentially no shortcut: the gap between
|
|
compiler-optimized C and hand-written SIMD is the full $5$--$7\times$, not
|
|
a fraction of it.
|
|
|
|
\subsection{Limitations}
|
|
|
|
\paragraph{No hardware counter data (Phase~1).} The mechanistic explanations
|
|
in this section are derived analytically from instruction-set structure and
|
|
publicly known microarchitecture details. Phase~2 will validate these with
|
|
PAPI counter measurements. \phasetwo{PAPI counters: IPC, cache miss rates.}
|
|
|
|
\paragraph{Single microarchitecture.} All results are from Intel Cascade Lake
|
|
(Xeon Platinum 8268). Speedup ratios may differ on other AVX2 hosts (e.g.,
|
|
Intel Skylake, AMD Zen 3/4) due to differences in execution port configuration,
|
|
vector throughput, and out-of-order window size.
|
|
\phasethree{Repeat on AMD Zen, ARM Graviton3, RISC-V.}
|
|
|
|
\paragraph{Frequency scaling.} OSCAR nodes may operate in a power-capped mode
|
|
that reduces Turbo Boost frequency under sustained SIMD load. RDTSC counts
|
|
wall-clock ticks at the invariant TSC frequency, which may differ from the
|
|
actual core frequency during SIMD execution.
|
|
\phasetwo{Characterize frequency during benchmarks; consider RAPL-normalized
|
|
cycle counts.}
|