% ── 5. Discussion ───────────────────────────────────────────────────────────── \section{Discussion} \label{sec:discussion} \subsection{Why Arithmetic Operations Benefit Most} The NTT butterfly loop processes 128 pairs of 16-bit coefficients per forward transform. In the scalar \varref{} path, each butterfly requires a modular multiplication (implemented as a Barrett reduction), an addition, and a subtraction---roughly 10--15 instructions per pair with data-dependent serialization through the multiply-add chain. The AVX2 path uses \texttt{vpmullw}/\texttt{vpmulhw} to compute 16 Montgomery multiplications per instruction, processing an entire butterfly layer in \mbox{$\sim$16} fewer instruction cycles. The observed INVNTT speedup of \speedup{56.3} at \mlkemk{512} \emph{exceeds} the theoretical $16\times$ register-width advantage. We attribute this to two compounding factors: (1) the unrolled hand-written assembly eliminates loop overhead and branch prediction pressure; (2) the inverse NTT has a slightly different access pattern than the forward NTT that benefits from out-of-order execution with wide issue ports on the Cascade Lake microarchitecture. \phasetwo{Confirm with IPC and port utilisation counters.} \subsection{Why the Compiler Cannot Auto-Vectorise NTT} A striking result is that \varref{} and \varrefnv{} perform nearly identically for all arithmetic operations ($\leq 10\%$ difference, with \varrefnv{} occasionally faster). This means GCC's tree-vectorizer produces no net benefit for the NTT inner loop. The fundamental obstacle is \emph{modular reduction}: Barrett reduction and Montgomery reduction require a multiply-high operation (\texttt{vpmulhw}) that GCC cannot express through the scalar multiply-add chain it generates for the C reference code. Additionally, the NTT butterfly requires coefficient interleaving (odd/even index separation) that the auto-vectorizer does not recognize as a known shuffle pattern. The hand-written assembly encodes these patterns directly in \texttt{vpunpck*} instructions. This finding has practical significance: developers porting \mlkem{} to new platforms cannot rely on the compiler to provide SIMD speedup for the NTT. Hand-written intrinsics or architecture-specific assembly are necessary. \subsection{Why SHAKE Operations Benefit Less} \op{gen\_a} expands a public seed into a $k \times k$ matrix of polynomials using SHAKE-128. Each Keccak-f[1600] permutation operates on a 200-byte state that does not fit in AVX2 registers (16 lanes $\times$ 16 bits = 32 bytes). The AVX2 Keccak implementation achieves \speedup{3.8}--\speedup{4.7} primarily by batching multiple independent absorb phases and using vectorized XOR across parallel state words---a different kind of SIMD parallelism than the arithmetic path. The bottleneck shifts to memory bandwidth as the permutation state is repeatedly loaded from and stored to L1 cache. \subsection{Why Noise Sampling Barely Benefits} CBD noise sampling reads adjacent bits from a byte stream and computes Hamming weights. The scalar path already uses bitwise operations with no data-dependent branches (constant-time design). The AVX2 path can batch the popcount computation but remains bottlenecked by the sequential bitstream access pattern. The small \speedup{1.2}--\speedup{1.4} speedup reflects this fundamental memory access bottleneck rather than compute limitation. \subsection{NTT Cache-State Variation Across Parameter Sets} The \speedup{13\%} variation in NTT speedup across parameter sets (§\ref{sec:results:crossparams}) despite identical polynomial dimensions suggests that execution context matters even for nominally isolated micro-benchmarks. Higher-$k$ polyvec operations that precede each NTT call have larger memory footprints ($k$ more polynomials in the accumulation buffer), potentially evicting portions of the instruction cache or L1 data cache that the scalar NTT path relies on. The AVX2 path is less affected because it maintains more coefficient state in vector registers between operations. \phasetwo{Verify with L1/L2 miss counters split by scalar vs AVX2.} \subsection{Implications for Deployment} The end-to-end KEM speedups of \speedup{5.4}--\speedup{7.1} (Appendix, Figure~\ref{fig:kemlevel}) represent the practical deployment benefit. Deployments that cannot use hand-written SIMD (e.g., some constrained environments, or languages without inline assembly support) should expect performance within a factor of $5$--$7$ of the AVX2 reference. Auto-vectorization provides essentially no shortcut: the gap between compiler-optimized C and hand-written SIMD is the full $5$--$7\times$, not a fraction of it. \subsection{Limitations} \paragraph{No hardware counter data (Phase~1).} The mechanistic explanations in this section are derived analytically from instruction-set structure and publicly known microarchitecture details. Phase~2 will validate these with PAPI counter measurements. \phasetwo{PAPI counters: IPC, cache miss rates.} \paragraph{Single microarchitecture.} All results are from Intel Cascade Lake (Xeon Platinum 8268). Speedup ratios may differ on other AVX2 hosts (e.g., Intel Skylake, AMD Zen 3/4) due to differences in execution port configuration, vector throughput, and out-of-order window size. \phasethree{Repeat on AMD Zen, ARM Graviton3, RISC-V.} \paragraph{Frequency scaling.} OSCAR nodes may operate in a power-capped mode that reduces Turbo Boost frequency under sustained SIMD load. RDTSC counts wall-clock ticks at the invariant TSC frequency, which may differ from the actual core frequency during SIMD execution. \phasetwo{Characterize frequency during benchmarks; consider RAPL-normalized cycle counts.}