# Archive Design and implementation plan for the link-archiving system of levineuwirth.org. This is the source of truth for how external references are preserved, hosted, displayed, and indexed. It sits alongside `WRITING.md`, `PHOTOGRAPHY.md`, `HOMEPAGE.md`, and `MARKS.md` as authoritative spec. ## Status **Reviewed and ratified 2026-05-21, with revisions.** The original draft was reviewed against the live site over three rounds; the decisions below incorporate every round of deltas and are now locked. **Phase 1 complete (2026-05-22).** PDF entries: `archive/manifest.yaml`, `tools/archive.py` (`fetch` + `gc`), `build/Archive.hs`, the four templates, and the Makefile / `head.html` / `.gitignore` wiring are built and verified — `/archive/` and `/archive/nist-fips-203/` render. **Phase 2 complete (2026-05-22).** HTML snapshots: the pinned `monolith` binary is vendored at `tools/bin/monolith`, `archive.py fetch` snapshots HTML pages (CSP injected, text extracted, quality classified), and `archive.html` renders them in a sandboxed iframe — `/archive/djb-aes-speed/` renders. The cross-browser CSP check and the per-snapshot review remain author-gated by design. **Archive pages styled (2026-05-22).** `static/css/archive.css` gives the index and entry pages a framed treatment (banner callout, provenance panel, artifact viewer); the PDF embed was changed to the raw `document.pdf` (browser- native viewer), symmetric with HTML snapshots — see the Display — PDF decision. **Phase 3 complete (2026-05-22).** Link annotation + Wayback: `Filters/Archive.hs` appends an archive affordance to body links whose target is archived; `archive.py wayback` (+ `make archive-wayback`) backfills Wayback captures; `visibility: private` keeps an entry's artifact in-repo but undeployed. Bibliography annotation is documented as a `Citations.hs` follow-up. **Phase 4 complete (2026-05-22).** Backlinks + similar-pages: `Backlinks.hs` keeps archived external links and canonicalises them to their `/archive//` page, so an archived work lists every essay that cites it under "Referenced by" (grouped by the fragment each citation targets); `archive.html` also carries a "Related" block from the `embed.py` similarity corpus, which now indexes archive pages and excludes the `/archive/` index. **Phase 5 complete (2026-05-22).** Link-rot detection: `tools/archive.py check` (+ `make archive-check`) HEAD/GET-probes every manifest URL and updates the gitignored `data/archive-state.json` under asymmetric hysteresis (`rotted` needs 3 fails over ≥14 days; a single success recovers immediately). `Filters.Archive` flips a body link to the archive when its target is `rotted`; each archive page surfaces its link status (provenance row, header note, Pagefind `status` filter tag); `/archive/` flags rotted entries; `/build/` gains a "Link archive" telemetry section. The search-UI `status` filter wiring in `search-filters.js` is deliberately partial — see the Phase 5 Met note. **All five phases done.** Refinements next; see the Phase 5 Met note for the documented deferrals (search-UI status filter; bibliography annotation from Phase 3; pull-from-Wayback at fetch time). **Refinements (2026-05-22).** A code-review pass found and fixed several correctness and posture issues across the system: - **Missing committed artifact no longer re-fetches silently.** `cmd_fetch` used to skip its SHA guard when the artifact was absent and then download fresh bytes whose hash differed from the recorded `sha256` — replacing the recorded snapshot without surfacing it. The guard now also halts when `PROVENANCE.json` is present but the artifact is missing, requiring the author to restore the committed bytes before rebuilding. - **`archive/removed.yaml` is now enforced in `fetch` and `check`.** It was only read by `gc`. A removed URL re-added to the manifest now halts `cmd_fetch` loudly; `cmd_check` skips removed URLs so the link-rot scanner does not keep probing a deliberate takedown. - **SHA verification closed the `.venv`-bypass hole.** The original decision relied solely on `archive.py fetch` re-hashing, but that step is `.venv`-gated — a contributor or deploy host without `.venv`, or a direct `cabal run site -- build`, would publish a tampered artifact unchecked. `build/Archive.hs` now also re-hashes via `sha256sum` from `loadArchiveEntries` and halts the build on a mismatch, so the guarantee holds independent of the Python step. - **Raw artifacts are no longer publicly indexable.** Pass 1 added a `robots.txt` `Disallow: /archive/`, which pass 2 then reverted (see below — it was counter-productive). Pass 1's other change — injecting `` into every new HTML snapshot alongside the archive CSP — remains in place; the deploy-side header for raw PDFs landed in pass 2 as `nginx/archive.conf`. - **The documented `archive.py refresh {slug}` subcommand is implemented.** It clears the slug's directory, re-fetches via `cmd_fetch`, and records the prior `sha256` as `previous-sha256` in the new `PROVENANCE.json`. The URL-changed error message in `cmd_fetch` now points at it instead of asking the author to delete the directory by hand. - **`url_aliases` widened** to the design's full equivalent-URL set: tracking-parameter stripping (`utm_*`, `fbclid`, `gclid`, `mc_*`, `ref`, `igshid`, `_hsenc`, `_hsmi`, `mkt_tok`) and arXiv abs / pdf / versioned / `.pdf` form expansion. Phase 1 had deliberately kept these as a Phase 4 deferral, but Phase 4 missed the follow-through. - **`X-Robots-Tag: noarchive` is now honoured on both HEAD and GET.** Some servers omit the header on HEAD but emit it on GET; HTML capture now aborts if either response carries the directive. Three smaller items remain documented and deferred: - **Archive tags joining the site-wide tag indexes.** `manifest.yaml`'s `tags:` is authored but `Tags.hs`/`Patterns.tagIndexable` does not yet ingest archive entries — it needs a Tags.hs-side integration with its own design pass (archive pages aren't `match`ed Hakyll items in the normal way). - **`archive.py suggest`** (bibliography discovery — diff `.bib` URLs against the manifest) is documented but not implemented. - **The controlled-host end-to-end link-rot test** (reserve `archive-test.levineuwirth.org`, run it through a 14-day-spanning fail streak, watch the flip happen) is inherently a multi-week real-world verification the author runs; the hysteresis logic is unit-tested deterministically and the rendering side is verified by a hand-crafted `rotted` state file. **Refinements pass 2 (2026-05-23).** A second code-review pass surfaced correctness gaps the first pass missed: - **`refresh` is now atomic.** It used to delete the slug directory and then call `cmd_fetch`; a failed re-fetch left the entry with no snapshot at all, while `refresh` returned 0 (because `cmd_fetch` reports per-entry skips, not a process failure). The slug directory is now *renamed* to a `.refresh-backup` sibling; success removes the backup, any failure restores it. Verified by hiding the `monolith` binary and confirming the prior snapshot survives intact. - **Invalid `visibility` values fail closed.** The `ManifestEntry` parser used to accept any string and only treat the exact `"private"` as private — a typo like `privte` would publish a work the author intended to keep offline. The parser now rejects any value other than `public` or `private`, and `readManifest` halts the build on any parse error of a present file (instead of warning + returning an empty list — that silent-skip was for `file absent`, not `file present but corrupt`). - **Lookup-side URL normalisation.** Alias generation alone cannot cover unbounded forms (arXiv versions, arbitrary tracking-parameter combinations). `ArchiveIndex` now normalises both index keys and lookup inputs through the same `normalizeUrl` (drop fragment, strip tracking, fold http→https, arXiv-canonicalise, trim trailing slash). Verified: `https://cr.yp.to/aes-speed.html`, `https://cr.yp.to/aes-speed.html?utm_source=mail`, and `http://cr.yp.to/aes-speed.html/` all match the same archived entry. - **Raw-artifact indexing posture corrected.** The Phase-5 `robots.txt` `Disallow: /archive/` was counter-productive: a URL blocked by robots.txt can still appear in results when externally linked, and the Disallow also prevents compliant crawlers from reading the wrapper pages' ``. The Disallow is reverted; a new `nginx/archive.conf` snippet emits `X-Robots-Tag: noindex, noarchive` for the whole `/archive/` tree, which crawlers honour for any resource (HTML and PDF alike). The deploy vhost should `include snippets/archive.conf`. - **`cmd_wayback` skips `removed.yaml`.** The eviction procedure says record in `removed.yaml` *before* dropping the manifest line; `fetch` and `check` now honour that ordering, but `wayback` did not. A removed entry whose manifest line was still in place could be submitted to a third-party archive after a takedown was recorded. - **The shipped HTML snapshot was refreshed in the working tree** so it carries the noarchive meta the Phase-5 inject promises. `archive.py refresh djb-aes-speed` re-fetched cr.yp.to, applied `inject_archive_metas`, and recorded the prior SHA as `previous-sha256`. `archive/djb-aes-speed/{snapshot.html, PROVENANCE.json}` now reflect the new bytes; matching SHA is verified by `Archive.hs`. *Caveat surfaced in pass 3 (below): the prior snapshot was not committed at the moment of this refresh, so its bytes are no longer recoverable via `git log -S`. A pass-3 fix to `refresh` now refuses to replace an uncommitted prior, but the historical artifact survives — `previous-sha256` records a hash whose bytes this working tree cannot reproduce.* - **The URL-changed error in `cmd_fetch`** now points at `archive.py refresh {slug}` instead of asking the author to delete the directory by hand. Tag integration remains the one deferred refinement (it needs a Tags.hs design pass). **Refinements pass 3 (2026-05-23).** A third audit surfaced gaps the pass-2 fixes didn't fully close: - **`refresh` refuses to replace an uncommitted prior snapshot.** Pass 2 preserved a prior snapshot through *failed* re-fetches, but a *successful* one happily discarded uncommitted bytes — `previous-sha256` then pointed at a hash no `git log -S` could recover. Pass 3 shells out to `git ls-files` + `git diff --quiet HEAD` and refuses the refresh unless both the prior PROVENANCE.json and its artifact are tracked and clean. - **`refresh` is atomic across *every* exit path.** Pass 2 handled the ordinary `cmd_fetch returns 0 but the artifact wasn't produced` case but not fatal `sys.exit`s (e.g. a `removed.yaml` conflict halting `cmd_fetch` mid-refresh) nor mid-refresh exceptions, and it never rolled back the `data/archive-index.json` rewrite. The work is now wrapped in `try/finally` that restores both the slug directory and the index on any exit path — normal failure, `SystemExit`, `KeyboardInterrupt`, or exception. - **Removal enforcement now uses the same equivalence as link matching.** Pass 2 introduced `normalizeUrl` for incoming citations but compared removals as literal URL strings, so a tracking-laden manifest URL could bypass a takedown. Python gains `normalize_url` mirroring the Haskell helper, and `fetch` / `check` / `wayback` compare normalised forms. `cmd_fetch` additionally rejects two manifest entries whose canonical forms collide — that would otherwise route both under one slug. - **`fetch_html` honours `X-Robots-Tag: noarchive` on the captured GET too.** Pass 1 added HEAD + ranged-GET probes, but a server can emit the header only on the full document response. The Python tool now downloads that response itself, checks its header and body directives, then passes those exact bytes to `monolith --base-url ... -` so the saved snapshot is not obtained through a second unobservable document request. - **`nginx/archive.conf` is wired into the deploy template** and re-`include`s `security-headers.conf` inside its `location` block. `nginx/vhost.conf.example` now includes `archive.conf`; the snippet itself re-emits the baseline headers because nginx's `add_header` chain is inherited from a parent only when the current context declares *no* `add_header` directives — without the re-include, /archive/ would lose HSTS, CSP, etc. - **Contract doc cleanups.** The Phase-5 paragraph claiming `robots.txt` disallows `/archive/` is reworded to acknowledge the pass-2 reversal; the Phase-1 checkbox claiming `Archive.hs` does not re-hash is updated to point at `verifyArtifactSha`; the pass-2 note about the refreshed djb snapshot now carries the caveat that its prior bytes were uncommitted and are therefore unrecoverable. The historical `previous-sha256` value in `archive/djb-aes-speed/ PROVENANCE.json` is left in place: it is a truthful record that *a* prior snapshot existed and what its hash was. It just is not recoverable from git in this working tree — the pass-3 `refresh` precondition exists so that property is never broken again. **Refinements pass 4 (2026-05-23).** A fourth audit completed the failure-closed paths: - **Direct Hakyll builds now enforce removals and missing-artifact failures.** `Archive.hs` reads `removed.yaml`, rejects normalized manifest conflicts and duplicate archive targets, and aborts if provenance exists without its artifact. `ArchiveIndex.hs` filters the generated index through the live manifest minus normalized removals, so a stale ignored index cannot retain archive affordances after a takedown when `archive.py` was skipped. - **`refresh` verifies the prior bytes before replacing them.** A prior snapshot must now be present, tracked, clean, and match its recorded SHA-256 before its hash can be written into `previous-sha256`. - **Failed refresh restores an originally-absent index state.** If `data/archive-index.json` did not exist before a failed refresh, any index created by the attempted fetch is deleted during rollback. The genuinely-open questions that remain are collected at the end — the list is short. --- ## Motivation The site cites external work — papers, articles, blog posts, documentation. Three things go wrong with a plain hyperlink over time: 1. **Link rot.** The target moves, paywalls, or vanishes. A 2019 essay's citations decay silently; nobody notices until a reader clicks. 2. **Content drift.** The target stays up but changes. The sentence you quoted is no longer the sentence at that URL. 3. **Opacity to the site's own machinery.** An external link is invisible to `Backlinks.hs` (`isPageLink` drops every `http(s)://` URL) and to `embed.py` (it indexes only `_site/**/*.html`). The site knows nothing about the things it most often points at. A paper cited by six essays has no page, no backlinks list, no place in any "Related" set. The archive fixes all three by keeping a **local, hosted, immutable snapshot** of each referenced work, giving it a stable URL on this domain, and making that URL a first-class citizen of the existing backlinks and similar-pages systems. This is deliberately *not* a general web crawler. It archives a curated set: the things this site references. The author adds a URL to a manifest; the build does the rest. ### Relationship to existing pieces | Existing piece | What it does | Why the archive is different | |----------------|--------------|------------------------------| | `static/papers/` | Hosts Levi's **own** typeset PDFs (`preprint:`, `{{pdf:}}`) | The archive holds **third-party** works. Distinct directory, distinct purpose. Never conflate the two. | | nginx `popup-proxy.conf` | Caches **metadata** (title/abstract) from arXiv / archive.org / PubMed for hover previews | Caches structured metadata, not documents. A preview accelerator, not preservation. | | `Backlinks.hs` | Inverts **internal** links into a "who links here" map | Indexes site content only; external URLs are dropped. The archive makes referenced works internal enough to index. | | `embed.py` / `SimilarLinks.hs` | Semantic "Related" block from `_site/**/*.html` embeddings | Only sees site pages. Archived works become site pages, so they enter the embedding corpus for free. | --- ## Goals - **Preservation.** Every referenced work the author chooses to archive has a byte-for-byte local snapshot that survives the original going dark. - **Stable hosting.** Each snapshot is reachable at a permanent `/archive/{slug}/` URL on levineuwirth.org, rendered in site chrome. - **Hyperlink-able.** Archive URLs are ordinary internal links: usable in prose, wikilinks, citations, and `further-reading`. - **Indexed.** Archived works appear in the **backlinks** ("Referenced by") and **similar-pages** ("Related") systems exactly as native content does — and, where the source structure allows, granularly by section. - **Curated, low-friction.** Adding an archive is one line in one manifest. Everything else — fetch, text extraction, page generation, indexing — is automatic and build-time. - **Static-friendly.** Every archive page renders at build time; JS is layered on, never required. Matches the rest of the site's contract. - **Honest.** Archive pages never impersonate the original. They are framed as archived copies, link prominently to the source, are kept out of search engines, and carry a real, advertised removal channel on every page. - **Safe by default.** No build step ever deletes or overwrites a committed artifact; destruction and replacement are always explicit, opt-in acts. --- ## Decisions (locked) | Topic | Decision | Rationale | |-------|----------|-----------| | Trigger | Curated manifest, not auto-crawl | Archives what the site *references*, not the web. Legally and operationally sane. | | Authored input | One hand-edited file: `archive/manifest.yaml` | One line per archived link. Mirrors `data/commonplace.yaml`'s authoring model. | | Bibliography seeding | **Rejected** as auto-seeding. `make archive-suggest` prints a "cited but not archived" diff; the author copies lines by hand. | Keeps the manifest the *identity* of the archive, not a cache of the `.bib` files. | | Per-entry provenance | `archive/{slug}/PROVENANCE.json`, committed — immutable for the current snapshot | An immutability claim that isn't in version control isn't immutable. | | Mutable state | `data/archive-state.json`, gitignored — link-rot status only | Strict split: immutable facts committed, volatile status disposable. | | Hakyll input | `data/archive-index.json` — `url` + aliases → slug, written by the tool | Minimal stable shape for the Haskell side; treated like `data/annotations.json`. | | Missing-index behaviour | `Backlinks.hs` and `Filters/Archive.hs` silently no-op when `archive-index.json` is absent | Preserves the established `.venv`-gated silent-skip convention. The archive degrades to invisible, never to an error. | | `fetch` idempotence | `fetch` is keyed on `(slug, url)` together; a slug whose recorded URL has changed is refused, not overwritten. `fetch` always rewrites `archive-index.json` to mirror the manifest. | A committed artifact is replaced only by an explicit `refresh`, never as a `fetch` side effect. | | Artifact storage | `archive/{slug}/` at repo root, **committed to git** | A preservation guarantee that depends on an un-versioned store is weaker. Repo stays reproducible. | | Per-artifact size cap | 25 MB; `archive.py fetch` warns and skips above it; `git add -f` to override deliberately | A 200 MB scan must never land in an auto-commit silently. | | Storage migration | If `archive/` exceeds ~5 GB or doubles year-over-year, evaluate a separate archive repo / object store. **Never git LFS.** | LFS breaks `git clone → make build` reproducibility — a regression for a preservation system. | | HTML snapshots | `monolith -j` → one self-contained HTML file; the pinned `monolith` binary is committed at `tools/bin/monolith` | Single static binary, no headless browser. Strips JS. Committing it (vs downloading) removes a network dependency and keeps the build reproducible from a bare clone. | | PDF snapshots | Direct download via `requests` | Papers are usually clean PDF URLs (arXiv etc.). | | Display — PDF | The raw `document.pdf` in an `