where-simd-helps/paper/sections/discussion.tex

% ── 5. Discussion ─────────────────────────────────────────────────────────────
\section{Discussion}
\label{sec:discussion}

\subsection{Why Arithmetic Operations Benefit Most}

The NTT butterfly loop processes 128 pairs of 16-bit coefficients per forward
transform. In the scalar \varref{} path, each butterfly requires a modular
multiplication (implemented as a Barrett reduction), an addition, and a
subtraction---roughly 10--15 instructions per pair with data-dependent
serialization through the multiply-add chain. The AVX2 path uses
\texttt{vpmullw}/\texttt{vpmulhw} to compute 16 Montgomery multiplications
per instruction, processing an entire butterfly layer in \mbox{$\sim$16}
fewer instruction cycles.

The observed INVNTT speedup of \speedup{56.3} at \mlkemk{512} \emph{exceeds}
the theoretical $16\times$ register-width advantage. We attribute this to
two compounding factors: (1) the unrolled hand-written assembly eliminates
loop overhead and branch prediction pressure; (2) the inverse NTT has a
slightly different access pattern than the forward NTT that benefits from
out-of-order execution with wide issue ports on the Cascade Lake
microarchitecture. \phasetwo{Confirm with IPC and port utilisation counters.}

\subsection{Why the Compiler Cannot Auto-Vectorise NTT}

A striking result is that \varref{} and \varrefnv{} perform nearly identically
for all arithmetic operations ($\leq 10\%$ difference, with \varrefnv{}
occasionally faster). This means GCC's tree-vectorizer produces no net benefit
for the NTT inner loop.

The fundamental obstacle is \emph{modular reduction}: Barrett reduction and
Montgomery reduction require a multiply-high operation (\texttt{vpmulhw}) that
GCC cannot express through the scalar multiply-add chain it generates for the
C reference code. Additionally, the NTT butterfly requires coefficient
interleaving (odd/even index separation) that the auto-vectorizer does not
recognize as a known shuffle pattern. The hand-written assembly encodes these
patterns directly in \texttt{vpunpck*} instructions.

This finding has practical significance: developers porting \mlkem{} to new
platforms cannot rely on the compiler to provide SIMD speedup for the NTT.
Hand-written intrinsics or architecture-specific assembly are necessary.

\subsection{Why SHAKE Operations Benefit Less}

\op{gen\_a} expands a public seed into a $k \times k$ matrix of polynomials
using SHAKE-128. Each Keccak-f[1600] permutation operates on a 200-byte state
that does not fit in AVX2 registers (16 lanes $\times$ 16 bits = 32 bytes). The
AVX2 Keccak implementation achieves \speedup{3.8}--\speedup{4.7} primarily by
batching multiple independent absorb phases and using vectorized XOR across
parallel state words---a different kind of SIMD parallelism than the arithmetic
path. The bottleneck shifts to memory bandwidth as the permutation state is
repeatedly loaded from and stored to L1 cache.

\subsection{Why Noise Sampling Barely Benefits}

CBD noise sampling reads adjacent bits from a byte stream and computes
Hamming weights. The scalar path already uses bitwise operations with no
data-dependent branches (constant-time design). The AVX2 path can batch the
popcount computation but remains bottlenecked by the sequential bitstream
access pattern. The small \speedup{1.2}--\speedup{1.4} speedup reflects
this fundamental memory access bottleneck rather than compute limitation.

\subsection{NTT Cache-State Variation Across Parameter Sets}

The \speedup{13\%} variation in NTT speedup across parameter sets
(§\ref{sec:results:crossparams}) despite identical polynomial dimensions
suggests that execution context matters even for nominally isolated
micro-benchmarks. Higher-$k$ polyvec operations that precede each NTT call
have larger memory footprints ($k$ more polynomials in the accumulation
buffer), potentially evicting portions of the instruction cache or L1 data
cache that the scalar NTT path relies on. The AVX2 path is less affected
because it maintains more coefficient state in vector registers between
operations. \phasetwo{Verify with L1/L2 miss counters split by scalar vs AVX2.}

\subsection{Implications for Deployment}

The end-to-end KEM speedups of \speedup{5.4}--\speedup{7.1} (Appendix,
Figure~\ref{fig:kemlevel}) represent the practical deployment benefit.
Deployments that cannot use hand-written SIMD (e.g., some constrained
environments, or languages without inline assembly support) should expect
performance within a factor of $5$--$7$ of the AVX2 reference.
Auto-vectorization provides essentially no shortcut: the gap between
compiler-optimized C and hand-written SIMD is the full $5$--$7\times$, not
a fraction of it.

\subsection{Limitations}

\paragraph{No hardware counter data (Phase~1).} The mechanistic explanations
in this section are derived analytically from instruction-set structure and
publicly known microarchitecture details. Phase~2 will validate these with
PAPI counter measurements. \phasetwo{PAPI counters: IPC, cache miss rates.}

\paragraph{Single microarchitecture.} All results are from Intel Cascade Lake
(Xeon Platinum 8268). Speedup ratios may differ on other AVX2 hosts (e.g.,
Intel Skylake, AMD Zen 3/4) due to differences in execution port configuration,
vector throughput, and out-of-order window size.
\phasethree{Repeat on AMD Zen, ARM Graviton3, RISC-V.}

\paragraph{Frequency scaling.} OSCAR nodes may operate in a power-capped mode
that reduces Turbo Boost frequency under sustained SIMD load. RDTSC counts
wall-clock ticks at the invariant TSC frequency, which may differ from the
actual core frequency during SIMD execution.
\phasetwo{Characterize frequency during benchmarks; consider RAPL-normalized
cycle counts.}