- add-popup-source.sh: slug validated against ^[a-z0-9-]+$ before nginx
interpolation; UPSTREAM_HOST derived unconditionally so the CSP
reminder fires in the no-proxy case — which is exactly when the host
must be added to connect-src (AUDIT §4.8)
- refreeze.sh: backs up the freeze and restores it on a failed resolve
instead of leaving the repo with no freeze file (§4.9)
- einops gets the policy-mandated upper bound and a comment naming its
consumer (nomic's remote modeling code) (§1.5)
- Makefile: pdftoppm failures warn instead of vanishing in the while
pipeline; .NOTPARALLEL guards deploy's clean->build->sign ordering
against -j invocations (§8.4)
- Atomic writers (embed, archive, the three sidecar extractors):
PID-unique temp names so concurrent runs can't interleave, cleanup on
failure everywhere, fsync where the artifact is not trivially
regenerable (§4.10)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The 'skip if outputs newer than every HTML' check could never fire:
stamp-build-time.py rewrites every page's footer AFTER embed.py runs,
so the comparison was always false and the full MiniLM paragraph pass
(and model load) ran on every build (AUDIT §4.3). Replaced with the
same content-hash cache the page pass already had — generalized
load/save_vec_cache, keyed by sha256 of the input text, invalidated on
model/revision/dim change. A no-change rerun now does no model loads:
measured 97s cold -> 4.8s warm.
Also strips section.footnotes from extraction: the new no-JS fallback
duplicates each sidenote's text at document end, which would double
footnotes in search results and skew page similarity.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- embed.py: pin nomic's auto_map modeling repo via code_revision —
revision= alone left nomic-bert-2048 unpinned under
trust_remote_code (AUDIT §1.3; verified loadable with
HF_HUB_OFFLINE=1). Catch BadZipFile/EOFError when loading the page
cache so a half-written npz is discarded, not fatal (§4.2), and
unlink the tmp file on a failed save (§4.1)
- nginx: collapse the CSP to one physical line — nginx has no line
continuation in quoted strings, so the old value embedded literal
backslash+LF bytes, illegal in HTTP/2 (§8.1). Add the externals the
site actually uses: KaTeX webfonts + onnxruntime wasm via jsdelivr,
and the popup provider APIs popups.js documents (§8.2)
- Makefile: pathspec-limit the auto-commit to content/ so pre-staged
unrelated work is no longer swept into auto: commits (§8.3)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pages (similar-links.json, build-only) move to nomic-embed-text-v1.5
(768d) with an on-disk npz cache; paragraphs (browser semantic search)
stay on all-MiniLM-L6-v2 (384d), so the client contract is unchanged.
WRITING.md search row updated accordingly. einops added for nomic's
remote modeling code; cache gitignored with a trailing glob so
interrupted-write debris is covered too.
Known follow-ups (AUDIT-2026-06-09.md §1.3, §4): pin the
nomic-bert-2048 remote code, catch BadZipFile in cache loads, fix the
staleness check defeated by stamp-build-time ordering.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Preserve external works the site cites against link rot, host them at
permanent /archive/<slug>/ URLs in site chrome, and treat them as
first-class citizens of the backlinks and similar-pages indexes.
Curated, not crawled: the author adds one line to archive/manifest.yaml
and the build fetches, hashes, snapshots, and indexes the work.
* archive/manifest.yaml + tools/archive.py (fetch / refresh / wayback /
check / gc) — PDFs downloaded directly, HTML pages snapshotted with a
vendored monolith (tools/bin/monolith @ 2.10.1) into a single
self-contained file with the archive CSP and a noarchive robots meta
injected. Per-entry PROVENANCE.json committed; gitignored .txt
sidecars regenerated from the artifact's SHA-256.
* build/Archive.hs + build/ArchiveIndex.hs + build/Filters/Archive.hs
— Hakyll rules for /archive/ and /archive/<slug>/, a body Pandoc
filter that appends an archive affordance to live citations and
flips dead ones to the local copy on archive.py check's asymmetric
hysteresis (rotted needs 3 fails over >= 14 days; one ok recovers).
* build/Backlinks.hs — keeps archived external URLs through pass 1 and
canonicalises them to /archive/<slug>/ in pass 2, producing a
"Referenced by" section grouped by the fragment each citation
targets. build/Stats.hs gains a "Link archive" telemetry block on
/build/ (count, total size, median age, by-status / by-quality /
by-visibility, orphans).
* Integrity: archive.py fetch and build/Archive.hs (via sha256sum)
both re-hash every committed artifact, so a tampered file halts the
build even with cabal invoked directly or no .venv present. refresh
refuses to replace an uncommitted prior snapshot and rolls back
atomically on any exit path. removed.yaml is honoured by fetch,
wayback, and check using canonical-form (tracking-stripped,
arXiv-canonicalised) comparison.
* visibility: private keeps an entry in-repo but undeployed.
nginx/archive.conf emits X-Robots-Tag: noindex, noarchive for raw
artifacts that cannot carry meta directives.
The full design, phase plan (1-5), and three refinement passes live
in ARCHIVE.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add tools/model-checksums.sha256 with sha256 hashes for the five
Xenova/all-MiniLM-L6-v2 files served from static/models/.
download-model.sh was already plumbed to verify against this file
when present; the file itself was missing, so downloads were
unverified. Now every fetch checks against committed hashes and
fails closed on mismatch.
- Pin embed.py's SentenceTransformer load to a specific HF commit
(c9745ed1d9f207416be6d2e6f8de32d1f16199bf of
sentence-transformers/all-MiniLM-L6-v2). A future model bump can no
longer silently change embedding semantics across builds. Bump
deliberately when validating; re-run a full embed pass to refresh
the semantic + similar-links data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>