88 KiB
Archive
Design and implementation plan for the link-archiving system of levineuwirth.org.
This is the source of truth for how external references are preserved, hosted,
displayed, and indexed. It sits alongside WRITING.md, PHOTOGRAPHY.md,
HOMEPAGE.md, and MARKS.md as authoritative spec.
Status
Reviewed and ratified 2026-05-21, with revisions. The original draft was reviewed against the live site over three rounds; the decisions below incorporate every round of deltas and are now locked.
Phase 1 complete (2026-05-22). PDF entries: archive/manifest.yaml,
tools/archive.py (fetch + gc), build/Archive.hs, the four templates,
and the Makefile / head.html / .gitignore wiring are built and verified —
/archive/ and /archive/nist-fips-203/ render.
Phase 2 complete (2026-05-22). HTML snapshots: the pinned monolith
binary is vendored at tools/bin/monolith, archive.py fetch snapshots HTML
pages (CSP injected, text extracted, quality classified), and archive.html
renders them in a sandboxed iframe — /archive/djb-aes-speed/ renders. The
cross-browser CSP check and the per-snapshot review remain author-gated by
design.
Archive pages styled (2026-05-22). static/css/archive.css gives the
index and entry pages a framed treatment (banner callout, provenance panel,
artifact viewer); the PDF embed was changed to the raw document.pdf (browser-
native viewer), symmetric with HTML snapshots — see the Display — PDF decision.
Phase 3 complete (2026-05-22). Link annotation + Wayback: Filters/Archive.hs
appends an archive affordance to body links whose target is archived;
archive.py wayback (+ make archive-wayback) backfills Wayback captures;
visibility: private keeps an entry's artifact in-repo but undeployed.
Bibliography annotation is documented as a Citations.hs follow-up.
Phase 4 complete (2026-05-22). Backlinks + similar-pages: Backlinks.hs
keeps archived external links and canonicalises them to their /archive/<slug>/
page, so an archived work lists every essay that cites it under "Referenced by"
(grouped by the fragment each citation targets); archive.html also carries a
"Related" block from the embed.py similarity corpus, which now indexes archive
pages and excludes the /archive/ index.
Phase 5 complete (2026-05-22). Link-rot detection: tools/archive.py check
(+ make archive-check) HEAD/GET-probes every manifest URL and updates the
gitignored data/archive-state.json under asymmetric hysteresis (rotted
needs 3 fails over ≥14 days; a single success recovers immediately).
Filters.Archive flips a body link to the archive when its target is rotted;
each archive page surfaces its link status (provenance row, header note,
Pagefind status filter tag); /archive/ flags rotted entries; /build/
gains a "Link archive" telemetry section. The search-UI status filter wiring
in search-filters.js is deliberately partial — see the Phase 5 Met note.
All five phases done. Refinements next; see the Phase 5 Met note for the documented deferrals (search-UI status filter; bibliography annotation from Phase 3; pull-from-Wayback at fetch time).
Refinements (2026-05-22). A code-review pass found and fixed several correctness and posture issues across the system:
- Missing committed artifact no longer re-fetches silently.
cmd_fetchused to skip its SHA guard when the artifact was absent and then download fresh bytes whose hash differed from the recordedsha256— replacing the recorded snapshot without surfacing it. The guard now also halts whenPROVENANCE.jsonis present but the artifact is missing, requiring the author to restore the committed bytes before rebuilding. archive/removed.yamlis now enforced infetchandcheck. It was only read bygc. A removed URL re-added to the manifest now haltscmd_fetchloudly;cmd_checkskips removed URLs so the link-rot scanner does not keep probing a deliberate takedown.- SHA verification closed the
.venv-bypass hole. The original decision relied solely onarchive.py fetchre-hashing, but that step is.venv-gated — a contributor or deploy host without.venv, or a directcabal run site -- build, would publish a tampered artifact unchecked.build/Archive.hsnow also re-hashes viasha256sumfromloadArchiveEntriesand halts the build on a mismatch, so the guarantee holds independent of the Python step. - Raw artifacts are no longer publicly indexable. Pass 1 added a
robots.txtDisallow: /archive/, which pass 2 then reverted (see below — it was counter-productive). Pass 1's other change — injecting<meta name=robots content="noindex, noarchive">into every new HTML snapshot alongside the archive CSP — remains in place; the deploy-side header for raw PDFs landed in pass 2 asnginx/archive.conf. - The documented
archive.py refresh {slug}subcommand is implemented. It clears the slug's directory, re-fetches viacmd_fetch, and records the priorsha256asprevious-sha256in the newPROVENANCE.json. The URL-changed error message incmd_fetchnow points at it instead of asking the author to delete the directory by hand. url_aliaseswidened to the design's full equivalent-URL set: tracking-parameter stripping (utm_*,fbclid,gclid,mc_*,ref,igshid,_hsenc,_hsmi,mkt_tok) and arXiv abs / pdf / versioned /.pdfform expansion. Phase 1 had deliberately kept these as a Phase 4 deferral, but Phase 4 missed the follow-through.X-Robots-Tag: noarchiveis now honoured on both HEAD and GET. Some servers omit the header on HEAD but emit it on GET; HTML capture now aborts if either response carries the directive.
Three smaller items remain documented and deferred:
- Archive tags joining the site-wide tag indexes.
manifest.yaml'stags:is authored butTags.hs/Patterns.tagIndexabledoes not yet ingest archive entries — it needs a Tags.hs-side integration with its own design pass (archive pages aren'tmatched Hakyll items in the normal way). archive.py suggest(bibliography discovery — diff.bibURLs against the manifest) is documented but not implemented.- The controlled-host end-to-end link-rot test (reserve
archive-test.levineuwirth.org, run it through a 14-day-spanning fail streak, watch the flip happen) is inherently a multi-week real-world verification the author runs; the hysteresis logic is unit-tested deterministically and the rendering side is verified by a hand-craftedrottedstate file.
Refinements pass 2 (2026-05-23). A second code-review pass surfaced correctness gaps the first pass missed:
refreshis now atomic. It used to delete the slug directory and then callcmd_fetch; a failed re-fetch left the entry with no snapshot at all, whilerefreshreturned 0 (becausecmd_fetchreports per-entry skips, not a process failure). The slug directory is now renamed to a.refresh-backupsibling; success removes the backup, any failure restores it. Verified by hiding themonolithbinary and confirming the prior snapshot survives intact.- Invalid
visibilityvalues fail closed. TheManifestEntryparser used to accept any string and only treat the exact"private"as private — a typo likeprivtewould publish a work the author intended to keep offline. The parser now rejects any value other thanpublicorprivate, andreadManifesthalts the build on any parse error of a present file (instead of warning + returning an empty list — that silent-skip was forfile absent, notfile present but corrupt). - Lookup-side URL normalisation. Alias generation alone cannot cover
unbounded forms (arXiv versions, arbitrary tracking-parameter
combinations).
ArchiveIndexnow normalises both index keys and lookup inputs through the samenormalizeUrl(drop fragment, strip tracking, fold http→https, arXiv-canonicalise, trim trailing slash). Verified:https://cr.yp.to/aes-speed.html,https://cr.yp.to/aes-speed.html?utm_source=mail, andhttp://cr.yp.to/aes-speed.html/all match the same archived entry. - Raw-artifact indexing posture corrected. The Phase-5
robots.txtDisallow: /archive/was counter-productive: a URL blocked by robots.txt can still appear in results when externally linked, and the Disallow also prevents compliant crawlers from reading the wrapper pages'<meta name=robots>. The Disallow is reverted; a newnginx/archive.confsnippet emitsX-Robots-Tag: noindex, noarchivefor the whole/archive/tree, which crawlers honour for any resource (HTML and PDF alike). The deploy vhost shouldinclude snippets/archive.conf. cmd_waybackskipsremoved.yaml. The eviction procedure says record inremoved.yamlbefore dropping the manifest line;fetchandchecknow honour that ordering, butwaybackdid not. A removed entry whose manifest line was still in place could be submitted to a third-party archive after a takedown was recorded.- The shipped HTML snapshot was refreshed in the working tree so it
carries the noarchive meta the Phase-5 inject promises.
archive.py refresh djb-aes-speedre-fetched cr.yp.to, appliedinject_archive_metas, and recorded the prior SHA asprevious-sha256.archive/djb-aes-speed/{snapshot.html, PROVENANCE.json}now reflect the new bytes; matching SHA is verified byArchive.hs. Caveat surfaced in pass 3 (below): the prior snapshot was not committed at the moment of this refresh, so its bytes are no longer recoverable viagit log -S. A pass-3 fix torefreshnow refuses to replace an uncommitted prior, but the historical artifact survives —previous-sha256records a hash whose bytes this working tree cannot reproduce. - The URL-changed error in
cmd_fetchnow points atarchive.py refresh {slug}instead of asking the author to delete the directory by hand.
Tag integration remains the one deferred refinement (it needs a Tags.hs design pass).
Refinements pass 3 (2026-05-23). A third audit surfaced gaps the pass-2 fixes didn't fully close:
refreshrefuses to replace an uncommitted prior snapshot. Pass 2 preserved a prior snapshot through failed re-fetches, but a successful one happily discarded uncommitted bytes —previous-sha256then pointed at a hash nogit log -Scould recover. Pass 3 shells out togit ls-files+git diff --quiet HEADand refuses the refresh unless both the prior PROVENANCE.json and its artifact are tracked and clean.refreshis atomic across every exit path. Pass 2 handled the ordinarycmd_fetch returns 0 but the artifact wasn't producedcase but not fatalsys.exits (e.g. aremoved.yamlconflict haltingcmd_fetchmid-refresh) nor mid-refresh exceptions, and it never rolled back thedata/archive-index.jsonrewrite. The work is now wrapped intry/finallythat restores both the slug directory and the index on any exit path — normal failure,SystemExit,KeyboardInterrupt, or exception.- Removal enforcement now uses the same equivalence as link matching.
Pass 2 introduced
normalizeUrlfor incoming citations but compared removals as literal URL strings, so a tracking-laden manifest URL could bypass a takedown. Python gainsnormalize_urlmirroring the Haskell helper, andfetch/check/waybackcompare normalised forms.cmd_fetchadditionally rejects two manifest entries whose canonical forms collide — that would otherwise route both under one slug. fetch_htmlhonoursX-Robots-Tag: noarchiveon the captured GET too. Pass 1 added HEAD + ranged-GET probes, but a server can emit the header only on the full document response. The Python tool now downloads that response itself, checks its header and body directives, then passes those exact bytes tomonolith --base-url ... -so the saved snapshot is not obtained through a second unobservable document request.nginx/archive.confis wired into the deploy template and re-includessecurity-headers.confinside itslocationblock.nginx/vhost.conf.examplenow includesarchive.conf; the snippet itself re-emits the baseline headers because nginx'sadd_headerchain is inherited from a parent only when the current context declares noadd_headerdirectives — without the re-include, /archive/ would lose HSTS, CSP, etc.- Contract doc cleanups. The Phase-5 paragraph claiming
robots.txtdisallows/archive/is reworded to acknowledge the pass-2 reversal; the Phase-1 checkbox claimingArchive.hsdoes not re-hash is updated to point atverifyArtifactSha; the pass-2 note about the refreshed djb snapshot now carries the caveat that its prior bytes were uncommitted and are therefore unrecoverable.
The historical previous-sha256 value in archive/djb-aes-speed/ PROVENANCE.json is left in place: it is a truthful record that a prior
snapshot existed and what its hash was. It just is not recoverable from
git in this working tree — the pass-3 refresh precondition exists so
that property is never broken again.
Refinements pass 4 (2026-05-23). A fourth audit completed the failure-closed paths:
- Direct Hakyll builds now enforce removals and missing-artifact failures.
Archive.hsreadsremoved.yaml, rejects normalized manifest conflicts and duplicate archive targets, and aborts if provenance exists without its artifact.ArchiveIndex.hsfilters the generated index through the live manifest minus normalized removals, so a stale ignored index cannot retain archive affordances after a takedown whenarchive.pywas skipped. refreshverifies the prior bytes before replacing them. A prior snapshot must now be present, tracked, clean, and match its recorded SHA-256 before its hash can be written intoprevious-sha256.- Failed refresh restores an originally-absent index state. If
data/archive-index.jsondid not exist before a failed refresh, any index created by the attempted fetch is deleted during rollback.
The genuinely-open questions that remain are collected at the end — the list is short.
Motivation
The site cites external work — papers, articles, blog posts, documentation. Three things go wrong with a plain hyperlink over time:
- Link rot. The target moves, paywalls, or vanishes. A 2019 essay's citations decay silently; nobody notices until a reader clicks.
- Content drift. The target stays up but changes. The sentence you quoted is no longer the sentence at that URL.
- Opacity to the site's own machinery. An external link is invisible to
Backlinks.hs(isPageLinkdrops everyhttp(s)://URL) and toembed.py(it indexes only_site/**/*.html). The site knows nothing about the things it most often points at. A paper cited by six essays has no page, no backlinks list, no place in any "Related" set.
The archive fixes all three by keeping a local, hosted, immutable snapshot of each referenced work, giving it a stable URL on this domain, and making that URL a first-class citizen of the existing backlinks and similar-pages systems.
This is deliberately not a general web crawler. It archives a curated set: the things this site references. The author adds a URL to a manifest; the build does the rest.
Relationship to existing pieces
| Existing piece | What it does | Why the archive is different |
|---|---|---|
static/papers/ |
Hosts Levi's own typeset PDFs (preprint:, {{pdf:}}) |
The archive holds third-party works. Distinct directory, distinct purpose. Never conflate the two. |
nginx popup-proxy.conf |
Caches metadata (title/abstract) from arXiv / archive.org / PubMed for hover previews | Caches structured metadata, not documents. A preview accelerator, not preservation. |
Backlinks.hs |
Inverts internal links into a "who links here" map | Indexes site content only; external URLs are dropped. The archive makes referenced works internal enough to index. |
embed.py / SimilarLinks.hs |
Semantic "Related" block from _site/**/*.html embeddings |
Only sees site pages. Archived works become site pages, so they enter the embedding corpus for free. |
Goals
- Preservation. Every referenced work the author chooses to archive has a byte-for-byte local snapshot that survives the original going dark.
- Stable hosting. Each snapshot is reachable at a permanent
/archive/{slug}/URL on levineuwirth.org, rendered in site chrome. - Hyperlink-able. Archive URLs are ordinary internal links: usable in
prose, wikilinks, citations, and
further-reading. - Indexed. Archived works appear in the backlinks ("Referenced by") and similar-pages ("Related") systems exactly as native content does — and, where the source structure allows, granularly by section.
- Curated, low-friction. Adding an archive is one line in one manifest. Everything else — fetch, text extraction, page generation, indexing — is automatic and build-time.
- Static-friendly. Every archive page renders at build time; JS is layered on, never required. Matches the rest of the site's contract.
- Honest. Archive pages never impersonate the original. They are framed as archived copies, link prominently to the source, are kept out of search engines, and carry a real, advertised removal channel on every page.
- Safe by default. No build step ever deletes or overwrites a committed artifact; destruction and replacement are always explicit, opt-in acts.
Decisions (locked)
| Topic | Decision | Rationale |
|---|---|---|
| Trigger | Curated manifest, not auto-crawl | Archives what the site references, not the web. Legally and operationally sane. |
| Authored input | One hand-edited file: archive/manifest.yaml |
One line per archived link. Mirrors data/commonplace.yaml's authoring model. |
| Bibliography seeding | Rejected as auto-seeding. make archive-suggest prints a "cited but not archived" diff; the author copies lines by hand. |
Keeps the manifest the identity of the archive, not a cache of the .bib files. |
| Per-entry provenance | archive/{slug}/PROVENANCE.json, committed — immutable for the current snapshot |
An immutability claim that isn't in version control isn't immutable. |
| Mutable state | data/archive-state.json, gitignored — link-rot status only |
Strict split: immutable facts committed, volatile status disposable. |
| Hakyll input | data/archive-index.json — url + aliases → slug, written by the tool |
Minimal stable shape for the Haskell side; treated like data/annotations.json. |
| Missing-index behaviour | Backlinks.hs and Filters/Archive.hs silently no-op when archive-index.json is absent |
Preserves the established .venv-gated silent-skip convention. The archive degrades to invisible, never to an error. |
fetch idempotence |
fetch is keyed on (slug, url) together; a slug whose recorded URL has changed is refused, not overwritten. fetch always rewrites archive-index.json to mirror the manifest. |
A committed artifact is replaced only by an explicit refresh, never as a fetch side effect. |
| Artifact storage | archive/{slug}/ at repo root, committed to git |
A preservation guarantee that depends on an un-versioned store is weaker. Repo stays reproducible. |
| Per-artifact size cap | 25 MB; archive.py fetch warns and skips above it; git add -f to override deliberately |
A 200 MB scan must never land in an auto-commit silently. |
| Storage migration | If archive/ exceeds ~5 GB or doubles year-over-year, evaluate a separate archive repo / object store. Never git LFS. |
LFS breaks git clone → make build reproducibility — a regression for a preservation system. |
| HTML snapshots | monolith -j → one self-contained HTML file; the pinned monolith binary is committed at tools/bin/monolith |
Single static binary, no headless browser. Strips JS. Committing it (vs downloading) removes a network dependency and keeps the build reproducible from a bare clone. |
| PDF snapshots | Direct download via requests |
Papers are usually clean PDF URLs (arXiv etc.). |
| Display — PDF | The raw document.pdf in an <iframe> — the browser's native PDF viewer renders it |
A hyperlinked archive should display the document exactly as it is. Symmetric with the HTML snapshot (both embed the raw artifact); no PDF.js wrapper. static/pdfjs/ stays vendored for the site's own {{pdf:}} embeds. |
| Display — HTML | Snapshot in a sandboxed <iframe> (referrerpolicy="no-referrer", no allow-scripts) + CSP <meta> baked into the snapshot + extracted text in the wrapper |
Sandbox isolates markup; CSP is defense-in-depth; no-referrer stops leaking the reading path; extracted text feeds indexing. |
| Snapshot quality | Recorded per entry (ok / degraded / js-required); degraded snapshots flagged on /archive/ and /build/ |
monolith fails quietly on lazy-loaded images and SPAs; silent degradation is the enemy. |
| Index thumbnails | Dropped for v1. /archive/ is a text list. |
At v1 scale a text list is faster to scan and to build than a thumbnail grid; revisit past ~50 entries (it is deferred capability, not a rejected one). |
| Second archive | Submit every URL to the Wayback Machine — non-blocking; record the URL when it returns, backfill via make archive-wayback |
Belt-and-suspenders, never on the critical path of a build. |
| URL scheme | /archive/{slug}/ |
Permanent, human-readable, internal. |
| URL matching | archive-index.json carries each entry's equivalent-URL aliases; only tracking parameters are stripped, other query parameters preserved; backlinks match any alias |
Without it, "Referenced by" silently under-counts; blanket query stripping would over-match. |
| Homepage portal | No | Infrastructure, not a content section. Reachable from /archive/, /colophon, footer. |
| Search engines | noindex on every archive page |
Preserving, not republishing or competing with originals. |
robots.txt |
Not gated: a curated single-shot fetch of an already-cited URL is not crawling. But honour X-Robots-Tag: noarchive and <meta name="robots" content="noarchive">; skip anything behind authentication. |
Matches Save-Page-Now / reference-manager norms. The load-bearing ethic is the removal channel, not robots.txt. |
| Removal channel | A request to ln@levineuwirth.org is honoured; advertised on /archive/, on every archive page, and in the fetcher's User-Agent string |
This is the real ethical commitment robots.txt only proxies for. |
| Pagefind | Archived full text is indexed, tagged by type: archive and by link-rot status |
Searching everything you've cited is a feature; the tags let results be filtered or excluded. |
| Visibility levels | public (default) / private |
private keeps the artifact in-repo but undeployed, for content not safe to redistribute. |
| Paywalled originals | A manual paywalled: true manifest flag — not an automated scanner state. Soft paywalls return 200 and cannot be reliably detected. |
Drives a banner note only, never a link flip. |
| Eviction | Opt-in make archive-gc, never part of make build. Procedure: record in removed.yaml first, then drop the manifest line, then GC. GC deletes only slugs listed in removed.yaml. |
A rename, branch-switch, or typo'd manifest edit must not silently eat committed artifacts. |
| Snapshot mutability | Immutable for the current snapshot; archive.py refresh deliberately replaces it |
A stable citation target must not move under readers — except by an explicit act. |
| Rot hysteresis | Asymmetric: rotted requires 3 consecutive failed scans over ≥ 14 days; one failure is error. Recovery is immediate — a single success → live. |
A transient failure must not flip a live citation; a recovered original should be reached eagerly, so un-rotting needs no delay. |
| SHA verification | Both archive.py fetch and build/Archive.hs re-hash every committed artifact against PROVENANCE.json and halt non-zero on a mismatch. archive.py runs first in make build; Archive.hs shells out to sha256sum from loadArchiveEntries, so the integrity guarantee holds even when archive.py did not run (no .venv, a direct cabal run site -- build, or a deploy host that bypasses make build). |
The original "Python tool is the sufficient enforcement point" assumption was unsafe: the Python step is .venv-gated, and a contributor or deploy without it could publish a tampered artifact unchecked. Two enforcement points cost a sha256sum call per entry and close the hole. |
Content model & directory structure
archive/
├── manifest.yaml # AUTHORED — the curated list of links
├── removed.yaml # AUTHORED — record of evicted entries
├── arxiv-2403-12345/
│ ├── document.pdf # the snapshot (committed)
│ ├── PROVENANCE.json # immutable archival facts (committed)
│ ├── document.txt # extracted text (gitignored, regenerated)
│ └── document.txt.sha256 # artifact SHA the .txt was built from (gitignored)
├── gwern-net-scaling-hypothesis/
│ ├── snapshot.html # self-contained monolith snapshot (committed)
│ ├── PROVENANCE.json # immutable archival facts (committed)
│ ├── snapshot.txt # extracted readable text (gitignored)
│ └── snapshot.txt.sha256 # artifact SHA the .txt was built from (gitignored)
└── ...
archive/is a top-level directory, sibling tocontent/,static/, anddata/— not undercontent/. Files incontent/are author-written Markdown processed by Pandoc;archive/holds raw third-party artifacts plus the manifest and provenance.- One directory per entry, keyed by slug.
- Committed: the artifact (
document.pdf/snapshot.html) — the preservation payload — andPROVENANCE.json— the immutable record of the archival event. - Gitignored: the regenerable extracted text (
*.txt) and its staleness stamp (*.txt.sha256) — deterministic from the committed artifact, so committing them is pure churn. This mirrors the photography sidecar and*.webpcompanion rules already in.gitignore. make build's auto-commit stagescontent/only. Changes underarchive/(new artifacts,PROVENANCE.json, manifest edits) are committed deliberately by the author. This is a feature, not a gap: it is the eyeball-before-commit checkpoint where a degraded snapshot gets caught.
Authored input — archive/manifest.yaml
The only file the author edits for normal operation. Adding an archive =
adding one list item. Minimum is a bare url:; everything else is optional or
auto-derived.
# archive/manifest.yaml — curated list of works to preserve.
# Edited by hand. Tools never write to this file.
# Per-artifact cap: 25 MB. Above that, archive.py warns and skips the fetch;
# commit an oversize artifact deliberately with `git add -f`.
# To evict an entry, see archive/removed.yaml — record there FIRST, then
# delete the line here, then run `make archive-gc`.
- url: "https://arxiv.org/abs/2403.12345"
# slug: auto-derived → arxiv-2403-12345 (override only to disambiguate)
# title: auto-derived from the artifact / popup-proxy metadata
# type: auto-detected (pdf | html)
tags: [research/ml] # optional — same slash-hierarchy as content
note: > # optional — why this is referenced
Cited in the scaling-laws essay; section 4 is the load-bearing part.
- url: "https://www.gwern.net/Scaling-hypothesis"
type: html # optional override when detection is wrong
visibility: public # public (default) | private
- url: "https://example.com/paywalled-report"
paywalled: true # author-set; the original sits behind a paywall
visibility: private # archived for the author; artifact not deployed
| Field | Required | Notes |
|---|---|---|
url |
yes | The original URL. The identity of the entry. |
slug |
no | Override the auto-derived slug. Must be unique. |
title |
no | Override the auto-derived title. |
type |
no | pdf | html. Auto-detected from Content-Type / extension. |
tags |
no | Slash-hierarchy tags (Tags.hs). Place the work on tag indexes. |
note |
no | Author's reason for archiving; shown on the archive page. |
visibility |
no | public (default) or private. |
paywalled |
no | Author-set flag: the original is gated. Declared, not inferred — no reliable automated detection exists. Drives a banner note only. |
source-date |
no | Publication date of the original, if known. |
Per-entry provenance — archive/{slug}/PROVENANCE.json
Committed alongside the artifact. Written by tools/archive.py fetch and then
stable for the lifetime of that snapshot — wayback is the one field backfilled
later (by make archive-wayback).
"Immutable" means immutable for the current snapshot, not forever.
archive.py refresh deliberately re-snapshots an entry and replaces both
the artifact and its PROVENANCE.json (new sha256, new archived date),
moving the old sha256 into previous-sha256. A refresh is a conscious act;
absent one, the file does not change.
PROVENANCE.json holds the facts that make the archival claim verifiable:
tools/archive.py fetch re-hashes every present artifact against the recorded
sha256 on every run — before the Hakyll build — and exits non-zero on a
mismatch, halting make build. The verification lives in the Python tool,
not Archive.hs: the Haskell toolchain carries no SHA-256 library, and
archive.py runs first in the pipeline regardless. Archive.hs trusts a
present (provenance, artifact) pair and skips any entry lacking either.
{
"url": "https://arxiv.org/abs/2403.12345",
"slug": "arxiv-2403-12345",
"title": "Scaling Laws for Neural Language Models",
"type": "pdf",
"artifact": "document.pdf",
"sha256": "9f86d0818884...",
"previous-sha256": null,
"bytes": 2317004,
"archived": "2026-05-21",
"source-date": "2024-03-15",
"snapshot-quality": "ok",
"wayback": "https://web.archive.org/web/20260521.../https://arxiv.org/abs/2403.12345"
}
previous-sha256 is null on first fetch and set by refresh to the
immediately-prior snapshot's hash, so the last prior snapshot is reachable
(via git log -S) without deeper archaeology. PROVENANCE.json lives with
the artifact, not in a rolling global file, so the immutable claim is
genuinely immutable in git history.
Mutable state — data/archive-state.json
Written only by tools/archive.py check. Holds the volatile link-rot
status, keyed by URL. Gitignored (data/ generated files already are); a fresh
clone simply rebuilds it on the next scan. Until a scan has run, every entry
renders as the safe default (live, no link flip).
{
"https://arxiv.org/abs/2403.12345": {
"status": "live",
"checked": "2026-05-21",
"consecutive-failures": 0,
"status-since": "2026-05-21"
}
}
status ∈ live / moved / rotted / error — set by the scanner.
(paywalled is not here: it is a manual manifest flag, not a scanner state.)
consecutive-failures + status-since implement the rot hysteresis (Phase 5).
Hakyll input — data/archive-index.json
A small map written by tools/archive.py fetch, consumed inside the Hakyll
build by Backlinks.hs and the link-annotation filter. fetch always
rewrites this file to mirror the current manifest exactly — whether or not any
network I/O occurred — so an entry un-listed from the manifest (even without a
GC) immediately stops being treated as archived, and Backlinks.hs never keeps
writing backlinks toward a slug whose page no longer exists. The index is cheap
to recompute (manifest + provenance, no network) and must never lag the
manifest. Kept separate from archive-state.json so the Haskell side loads a
minimal, stable shape; treated exactly like the existing data/annotations.json
build input.
{
"https://arxiv.org/abs/2403.12345": {
"slug": "arxiv-2403-12345",
"type": "pdf",
"title": "Scaling Laws for Neural Language Models",
"aliases": [
"http://arxiv.org/abs/2403.12345",
"https://arxiv.org/abs/2403.12345v1",
"https://arxiv.org/abs/2403.12345v2",
"https://arxiv.org/pdf/2403.12345",
"https://arxiv.org/pdf/2403.12345.pdf"
]
}
}
aliases is the equivalent-URL set (see URL matching, under Backlinks). The
Haskell side flattens it into an alias → entry lookup on load.
When archive-index.json is absent — .venv not set up, or archive.py
has never run — it is treated as empty: Backlinks.hs and Filters/Archive.hs
silently no-op, and the build succeeds unchanged. This is the same
.venv-gated silent-skip convention used by embed.py and the photography
extractors. (This exact phrasing recurs below; it is the canonical statement of
the property.)
Eviction & removal
Removing an archived work is a first-class, supported operation — a takedown request, an author request, a legal concern, or a quality cull will arrive, and probably before the system is mature. The cardinal rule: no build step ever deletes a committed artifact. Deletion is opt-in and explicit.
Procedure (documented in the manifest.yaml header comment), in order:
-
Record the removal in
archive/removed.yamlfirst — before touching the manifest:- url: "https://example.com/withdrawn-article" slug: example-com-withdrawn-article removed: 2026-06-01 reason: takedown # takedown | author-request | legal | quality note: "DMCA from X; see archived email."Field Required Notes urlyes The original URL (matches the manifest URL at time of removal) slugyes The slug whose archive/{slug}/directorymake archive-gcis authorized to deleteremovedyes ISO date of removal reasonyes Closed enum: takedown|author-request|legal|qualitynoteno Free-text context -
Delete the entry's line from
manifest.yaml. -
Run
make archive-gc(opt-in; never invoked bymake build). It deletes onlyarchive/{slug}/directories whose slug is recorded inremoved.yaml. A directory orphaned by a rename, a branch switch, or a typo'd manifest edit — i.e. not inremoved.yaml— is never deleted; it is reported to stderr with its slug and a one-line hint, andgcexits non-zero while any orphan is present (--ignore-orphanssuppresses the non-zero exit once the author has consciously reviewed them). The author commits the deletion.
An orphaned archive/{slug}/ directory (manifest line gone, not yet GC'd) is
inert in the meantime: Archive.hs generates pages and routes artifacts only
for current manifest.yaml entries, so an orphan produces no page and is not
deployed.
removed.yaml is not a hostile-tracking list. It exists so that (a)
make archive-gc knows exactly what is safe to delete, (b) re-adding a removed
URL to the manifest is surfaced loudly at build time, (c) the link-rot scanner
skips removed entries instead of probing them forever, and (d) make archive-suggest never re-suggests a deliberately-removed work. A removed URL
still cited from a site page falls back to the original-only link: no archive
affordance, no backlink canonicalization.
Routing & generated pages
| URL | Source | Notes |
|---|---|---|
/archive/ |
Generated from manifest.yaml |
Index of all archived works; text list, filter by type, tag, status |
/archive/{slug}/ |
Generated per manifest entry | The archive page — wrapper chrome + embedded snapshot |
/archive/{slug}/document.pdf |
archive/{slug}/document.pdf |
Raw artifact, copied through unchanged |
/archive/{slug}/snapshot.html |
archive/{slug}/snapshot.html |
Raw HTML snapshot, copied through unchanged |
/archive/{tag}/ |
Existing Tags.hs |
Archive entries with tags join the normal tag indexes |
PROVENANCE.json is build input, not a routed page — it is consumed by
Archive.hs, not served (the archive page surfaces the relevant fields).
Slugs are auto-derived as {domain-stem}-{path-slug}, truncated, with a short
hash appended on collision (arxiv-2403-12345, gwern-net-scaling-hypothesis).
slug: in the manifest overrides.
/archive/ is not a homepage portal — it is infrastructure. It is reachable
from /colophon (where the site explains its own machinery), from the footer's
infrastructure links, and optionally as a shelf on /library.html. The
/archive/ page also carries the removal-request notice.
The archive page
/archive/{slug}/ is a wrapper: site chrome around a preserved artifact.
Top to bottom:
- Archive banner. An unmissable strip: "Archived copy — snapshot taken 2026-05-21. View the original ↗". The original URL is the most prominent link on the page. The page never pretends to be the source.
- Metadata block. Title, original URL, archive date, source publication
date, content hash (short form), file size, snapshot quality, the author's
note, the Wayback Machine link, and current link-rotstatus. - The artifact.
- PDF — the raw
document.pdfembedded in an<iframe>, rendered by the browser's native PDF viewer. Deliberately not the site's PDF.js viewer: a hyperlinked archive should display the document as it is. - HTML — the
monolithsnapshot loaded in a sandboxed<iframe>:sandboxwithoutallow-scripts(JS already stripped at fetch time) andreferrerpolicy="no-referrer"(so a click inside the snapshot does not leaklevineuwirth.org/archive/...— and which essay the reader came from — to the original site). The snapshot file itself carries a restrictiveContent-Security-Policy<meta>tag, injected at fetch time, as defense-in-depth (see Fetch pipeline).
- PDF — the raw
- Full text. The extracted readable text (
document.txt/snapshot.txt) rendered into the DOM — collapsed in a<details>for PDFs, inline for HTML. This block is the load-bearing one for indexing:embed.pyand Pagefind see text, not an opaque iframe. It also gives readers a fast, styled, dark-mode reading path that does not depend on the original's markup. - Referenced by. The backlinks list — every site page that cites this work. (See Backlinks integration.)
- Related. The similar-pages list — semantically near content, site pages and other archives alike. (See Similar-pages integration.)
A removal-request line — the partials/archive-removal-notice.html partial,
carrying ln@levineuwirth.org — is included on every archive page and on
/archive/. It is its own partial, included directly by archive.html and
archive-index.html; the site-wide page-footer.html is not touched.
The page carries <meta name="robots" content="noindex">. The head.html
partial currently has no robots hook; adding a noindex context flag is part
of Phase 1.
Fetch & snapshot pipeline
tools/archive.py — a Python tool, gated on .venv, silent-skip when absent,
matching the established embed.py / extract-exif.py pattern. Subcommands:
archive.py fetch— for every manifest URL without an artifact: download it, detect the type, store it, extract text, writePROVENANCE.json. Always rewritesarchive-index.jsonto mirror the manifest (see below). Recordswayback: null(filled in later). Incremental — only URLs without an artifact incur network I/O.archive.py wayback— submit URLs whosePROVENANCE.jsonhaswayback: nullto the Wayback Machine; backfill the returned URL. (make archive-wayback)archive.py check— the link-rot scan. (make archive-check, Phase 5)archive.py suggest— scandata/*.bibforurlanddoifields; a DOI-only entry is resolved to itshttps://doi.org/{doi}form. Prints a diff of works cited but not yet inmanifest.yaml, excluding any URL already inarchive/removed.yaml— a deliberately-removed work is never re-suggested. (make archive-suggest)archive.py gc— deletearchive/{slug}/directories whose slug is recorded inremoved.yaml. Orphan directories (not inmanifest.yaml, not inremoved.yaml) are never deleted: each is reported to stderr with its slug and a one-line hint, andgcexits non-zero while any orphan is present (--ignore-orphansto override). (make archive-gc)archive.py refresh {slug}— deliberately re-snapshot one entry, replacing both the artifact and itsPROVENANCE.json; the priorsha256is written toprevious-sha256and printed.
fetch is keyed on (slug, url) together
If a slug's directory already exists and its PROVENANCE.json records a
different URL than the manifest now gives — the author edited a URL but kept
the slug — fetch refuses to overwrite the committed artifact. It prints
URL changed for {slug}: run 'archive.py refresh {slug}' to re-snapshot and
leaves the entry untouched. Overwriting a committed artifact is always an
explicit act (refresh), never a side effect of fetch — the same principle
as GC requiring removed.yaml.
Regardless of whether any artifact was fetched, fetch finishes by rewriting
data/archive-index.json from the current manifest + provenance, so the index
can never lag a manifest edit.
Direct download via requests, with a per-request timeout and the size cap
(25 MB; warn + skip above). User-Agent:
levineuwirth.org/archive (ln@levineuwirth.org; removal requests honored).
Stored as document.pdf; text extracted with pdftotext.
HTML
monolith -j {url} produces a single self-contained HTML file: CSS, images,
and fonts inlined as data URIs, JavaScript stripped (-j).
monolith is a single static Rust binary — no headless browser. Unlike Leaflet
and PDF.js (servable assets fetched at build time and gitignored), monolith is
a build-time executable: the pinned linux-x86_64 binary is committed at
tools/bin/monolith, with its version and sha256 recorded in
tools/monolith-version.txt. Committing it removes a network dependency from
make build and keeps the archive pipeline reproducible from a bare clone.
(If the build host ever changes architecture, re-vendor the matching binary.)
After capture, archive.py injects a CSP <meta> into the snapshot's <head>:
<meta http-equiv="Content-Security-Policy"
content="default-src 'none'; img-src data:;
style-src 'unsafe-inline'; style-src-elem 'unsafe-inline';
style-src-attr 'unsafe-inline'; font-src data:;
script-src 'none'; object-src 'none'; frame-src 'none'">
monolith inlines images and fonts as data URIs, and inlines styles both as
<style> elements and as inline style="" attributes — so style-src-elem
and style-src-attr are spelled out alongside style-src to cover both in
browsers that honour the granular directives. script-src 'none' /
object-src 'none' / frame-src 'none' are explicit because monolith inlines
SVGs as data: images, and an SVG can carry a <script> block — the iframe
sandbox already blocks execution, but a belt-and-suspenders claim should not
rely on the sandbox alone. This CSP permits everything a correct snapshot needs
and blocks every network fetch and script a broken or malicious snapshot might
attempt. Correct rendering under this CSP is verified cross-browser as a
Phase 2 exit criterion. (An nginx location ^~ /archive/ block may add the
header at the HTTP level too; the baked-in <meta> is what makes make dev's
plain server safe.)
monolith failure modes — capture is not always faithful, and fails
quietly. Known cases: lazy-loaded images using data-src (common on Substack,
Medium, modern blogs) are not resolved — the snapshot looks complete but is
missing images; soft-paywalled pages (Medium, NYT) often serve full article
HTML to the fetch and gate it with a client-side overlay, so -j yields a
snapshot that looks like unauthorized access (it is not — the server sent it
— but the optics are poor); <picture>/srcset sources are inconsistently
inlined. archive.py therefore classifies each capture and records
snapshot-quality ∈ ok / degraded / js-required in PROVENANCE.json;
degraded captures are flagged on /archive/ and /build/. The author reviews
the rendered snapshot before committing archive/ (Phase 2 exit criterion). A
headless-browser fallback for js-required pages is deferred — see Open
questions.
Wayback Machine — non-blocking
Wayback submission is never on the critical path of a build. archive.py fetch records wayback: null and moves on. make archive-wayback runs
separately, POSTs the outstanding URLs to https://web.archive.org/save/
(retrying transient 5xx, tolerating rate limits and hangs), and backfills the
returned timestamped URL into each PROVENANCE.json. This second, independent
copy means a rotted entry whose local artifact is somehow lost still has a
fallback. If the original is already dead at first fetch, archive.py fetch
pulls the most recent existing Wayback capture instead.
Politeness & safety
The manifest is author-controlled, so SSRF is not a real threat, but the tool still: sets per-request timeouts, enforces the 25 MB cap, rate-limits to one request per host at a time, and identifies itself honestly. Beyond that:
- Honour
X-Robots-Tag: noarchive— and the equivalent<meta name="robots" content="noarchive">in an HTML response body (cheap to check: it is in the head of the document just fetched). If either is present, the fetch is abandoned and the manifest entry flagged. This is the directive that actually governs archiving (as opposed to crawling); respecting it costs nothing and makes the posture defensible. - Skip authenticated content.
archive.pynever sends cookies or credentials. If a URL needs authentication, it is not archived; at most it is a manualvisibility: privateartifact. robots.txtis not gated. A curated, single-shot, attributed,noindex'd fetch of a URL the site already cites is not crawling — it is the same operation a reader's browser performs on click. This matches Save-Page-Now and reference-manager norms. The load-bearing ethical commitment is the removal channel, advertised on/archive/, on every archive page, and inside the User-Agent string.
Text extraction & indexing
The "Full text" block is what makes an archived work indexable rather than an opaque blob. Extraction:
- PDF →
pdftotext(frompoppler, already a build dependency for thepdf-thumbsMakefile target). Stored asdocument.txt. - HTML → readable text pulled from the
monolithsnapshot withBeautifulSoup(already a dependency ofembed.py). Headings are preserved. Stored assnapshot.txt.
Both .txt files are gitignored. archive.py fetch regenerates a .txt
whenever the artifact's current SHA-256 differs from the value stamped in the
adjacent *.txt.sha256 sidecar (also gitignored), then re-stamps it. This
catches every way the committed artifact and the local — gitignored, not
git pull-ed — text could drift apart: a refresh, a pdftotext upgrade, a
truncated file. The indexed text is thus always in sync with the embedded
artifact.
Once the archive page renders this text into _site/archive/{slug}/index.html:
embed.pywalks_site/**/*.htmlafter the Hakyll build. Archive pages are ordinary HTML files in that tree, so they are embedded with no change toembed.py— they automatically join both the page-level similarity corpus (similar-links.json) and the paragraph-level semantic index (semantic-index.bin/semantic-meta.json).- Pagefind likewise indexes them automatically. Two filter tags on the
archive template —
type: archiveand the link-rotstatus— letsearch-filters.jsseparate archive hits from native content and let a reader see (or exclude)rotted-citation archive pages.
The one requirement this imposes: the archived text must be in the rendered
DOM, not only inside the PDF.js / sandbox iframe. embed.py's BeautifulSoup
pass and Pagefind both see DOM text only. Hence the "Full text" block in §4 of
the archive page is non-optional.
Backlinks integration — "Referenced by"
The goal: an archived paper's page shows every site page that cites it.
Today Backlinks.hs runs in two passes (see its module header). Pass 1
(version "links") extracts links per content file; isPageLink drops every
external URL. Pass 2 inverts target → [sources]. The archive needs two
surgical changes, both driven by data/archive-index.json:
- Pass 1 — keep archived externals.
isPageLinkis widened: an external URL is kept if it matches an entry inarchive-index.json. Non-archived externals are still dropped, exactly as now. - Pass 2 — canonicalize to the archive URL. When inverting, an archived
external URL is rewritten to its
/archive/{slug}/key.
backlinksField then works unchanged: the archive page looks up its own route
and finds its citing pages. The archive template labels the section
"Referenced by" rather than "Backlinks" — semantically truer for a
third-party work — but the underlying field is the same.
This is purely additive: the visible link in the essay still points at the
original URL (reader expectation is preserved); only the backlink relationship
is recorded against the archive page. Archive pages do not need to be added to
Patterns.allContent — they only receive backlinks, and that needs a route,
not a version "links" pass.
When archive-index.json is absent — .venv not set up, or archive.py
has never run — it is treated as empty: Backlinks.hs and Filters/Archive.hs
silently no-op, and the build succeeds unchanged. For Backlinks.hs that means
every external URL is dropped exactly as today, with no canonicalization and no
error. This is a hard requirement, not a nicety: it preserves the established
.venv-gated silent-skip convention so a contributor without the Python
environment still gets a clean build.
URL matching — the alias problem
A cited URL in the wild has many equivalent forms: http:// vs https://,
trailing slash or not, ?utm_source=… query junk, arXiv abs ↔ pdf ↔
versioned (/abs/2403.12345, /abs/2403.12345v2, /pdf/2403.12345.pdf). If
the index is keyed only by the manifest's canonical URL, a citation to any
variant misses, and "Referenced by" silently under-counts — a failure that
breaks nothing visibly and is miserable to debug.
So archive.py computes the equivalent-URL set per entry and stores it as
aliases in archive-index.json. The normalization is deliberately narrow:
- Tracking parameters are stripped —
utm_*,fbclid,gclid,mc_*,ref,igshid,_hsenc,_hsmi,mkt_tok. - All other query parameters are preserved. A
?v=…, a?id=…, a Wayback timestamp is load-bearing; blanket query stripping would alias…/article?id=42to every other article on the host. http/httpsare folded, trailing slashes normalized, and known arXiv families (abs/pdf/ versioned) expanded.
Backlinks.hs matches an incoming link against any alias before keying it to
the archive URL.
Granular backlinks (Phase 4 refinement)
If a citation targets a fragment — …/abs/2403.12345#section-4, or a PDF page
…/document.pdf#page=7 — the fragment is preserved through pass 2 instead of
being stripped by normaliseUrl. The archive page can then group "Referenced
by" entries by which section/page they cite: "Section 4 — referenced by [Essay
A], [Essay B]." This is the "indexed granularly, by section" behaviour, on the
backlinks side.
Similar-pages integration — "Related"
This side is almost free. embed.py produces data/similar-links.json (page
similarity) from every file in _site/. Once archive pages render with their
full text (above), they are in the corpus:
- An essay's "Related" block can surface an archived paper.
- An archive page's "Related" block surfaces neighbouring archives and the site content nearest to it.
SimilarLinks.hs needs no change — /archive/{slug}/ is just another URL key,
and similarLinksField resolves it like any page. Two small embed.py config
nudges: add /archive/ to EXCLUDE_URLS (the index is a list page and would
otherwise dominate neighbours), and let individual archive pages through.
Cost — a Phase 4 risk with a concrete trigger. embed.py has a coarse
whole-run staleness skip but no per-document incrementality: when it does run,
it re-embeds the entire corpus. A serious archive (hundreds of entries, several
MB of extracted text each for long papers) materially extends every run that
executes. Phase 4 measures this and applies a fixed trigger: once the archive
passes 50 entries, or embed.py's runtime exceeds 60 seconds, add a
per-document embedding cache keyed by content hash to embed.py. Below both
thresholds, the full-corpus re-embed is left alone — premature optimization
otherwise.
Granular similar-pages (deferred)
embed.py already builds a paragraph-level index
(semantic-index.bin + semantic-meta.json, keyed {url, title, heading, excerpt}). An archived HTML snapshot's preserved headings mean its sections get
distinct paragraph vectors automatically — the data for section-granular
"Related" exists the moment archive text is in the DOM. What does not yet
exist is a UI that consumes it per-section, for any content type. A
per-section "Related" block is deferred site-wide; the archive system feeds
the granular index regardless. For PDFs, section structure is unreliable
(pdftotext flattens it); per-page chunking is the realistic granularity —
see Open questions.
Link annotation in content
When the author writes a link to a URL that is archived, the build appends a
small archive affordance — a superscript "[A]" / "archived" marker next to the
link — pointing at /archive/{slug}/. No per-link markup; entirely automatic.
Implementation: a Pandoc filter, Filters/Archive.hs, registered in
Filters.hs. For every Link whose URL matches archive-index.json (alias
set included), it appends the affordance inline.
Filter ordering — pinned, then verified. Per /colophon, the site's AST
chain is markdown → pandoc → citations → wikilinks → preprocessing → sidenotes → smallcaps/dropcaps → links → images → math. Filters/Archive.hs is pinned
immediately after smallcaps/dropcaps and immediately before links — not
merely "somewhere before links". The reason is the narrower window matters:
smallcaps/dropcaps rewrites the text content of nodes, so if Archive.hs
decorated first, the [A] affordance could be swept into a smallcaps run or
mistaken for an opening character by dropcap logic. Running it after
smallcaps/dropcaps appends the affordance to already-styled text that nothing
downstream re-touches; running it before links lets the link-decoration pass
(and any future popup hooks) act on the already-annotated tree. This chain is
transcribed from a published page; Phase 3 confirms it against Filters.hs's
actual registration order before the position is pinned in code — a doc and
the implementation can drift.
Confirmed (2026-05-22). Filters.hs's applyAll applies, innermost
first: Images → SourceRefs → Code → Math → Dropcaps → Smallcaps → Links → Typography → Sidenotes → Aftermatter. The /colophon narrative is a loose
paraphrase — Images and Math run early, Sidenotes runs late — but
Smallcaps and Links are adjacent, so Filters.Archive is pinned between
them, exactly as specified above. (/colophon is prose, not authoritative for
filter order, and was left unchanged.)
When archive-index.json is absent — .venv not set up, or archive.py
has never run — it is treated as empty: Backlinks.hs and Filters/Archive.hs
silently no-op, and the build succeeds unchanged. For Filters/Archive.hs that
means every Link passes through un-annotated, no error raised.
Bibliography — confirmed (2026-05-22): a separate context field.
Citations.hs runs applyCitations before the applyAll filter chain; it
partitions the citeproc refs Div out of the document AST
(extractBibliography) and renders it to an HTML string via writeHtml5String
for the template's $bibliography$ field. The body filter chain — and so
Filters.Archive — never sees the bibliography. Prose links get affordances;
bibliography reference links do not.
This does not put the broken popup layer on the critical path, as the
draft feared. Citations.hs already performs AST surgery on each bibliography
entry (enhanceEntry — it wraps file: PDF links and appends keyword strips),
so the realistic annotation hook is enhanceEntry, reusing Filters.Archive's
index lookup — no popup dependency. That is deferred to a Phase 3 follow-up:
it first needs a check that chicago-notes.csl renders a cited work's
url/doi as a Link node (a CSL style that omits URLs would leave nothing
to match). Phase 3 ships prose-link annotation; bibliography annotation is
documented as in-scope and hookable via enhanceEntry, pending that check. A
future popup rewrite may also consult archive-index.json, but the archive
system depends on neither the current nor a future popup implementation.
Link-rot detection & maintenance (Phase 5)
tools/archive.py check issues a HEAD (falling back to a ranged GET) to
every original URL in the manifest and updates data/archive-state.json.
Hysteresis is asymmetric. Rotting is slow; recovery is fast.
- Rotting. A failed probe increments
consecutive-failuresand setsstatus: error. Only after 3 consecutive failed scans spanning ≥ 14 days does the status becomerotted. A single transient failure — a Cloudflare challenge, a temporary 5xx, a DNS hiccup — therefore never flips a live citation. - Recovery. A single successful probe resets
consecutive-failuresto 0 and returns the status straight tolive, fromerrororrottedalike. There is no cost to un-rotting eagerly — if the original is reachable again, the reader should go there — so recovery needs no hysteresis.
status |
Meaning | Rendering effect |
|---|---|---|
live |
Original reachable, unchanged | Normal: link to original, archive as backup |
moved |
3xx to a new location | Banner notes the move; new URL recorded |
rotted |
Failed the hysteresis threshold (3 fails / ≥14 days) | Build flips the primary link to the archive copy; original shown struck-through as "(dead link)" |
error |
Transient / inconclusive — below the hysteresis threshold | No rendering change; retried next scan |
paywalled is deliberately absent from this table: a soft paywall returns
200, so an automated HEAD/GET cannot reliably detect it. Paywall status is
the manual paywalled: true manifest flag instead, and it drives only a banner
note — never a link flip.
The flip on rotted is the actual link-rot cure: a reader of a 2019 essay
clicks through to a working local snapshot instead of a 404, with no manual
intervention — and only after the rot is confirmed, not guessed.
check is a slow network job, not something every make build should pay for.
It runs on its own cadence — a periodic local make archive-check, or a
scheduled remote agent. It is decoupled from the main build: the build consumes
whatever archive-state.json exists.
Build-pipeline integration
New steps slot into the Makefile build target, gated on .venv (silent
skip), consistent with embed.py and the photography extractors:
make build:
git auto-commit content/ (existing — archive/ NOT swept in)
tools/convert-images.sh (existing)
pdf-thumbs (existing)
download-pdfjs.sh / download-leaflet.sh (existing)
→ tools/archive.py fetch (NEW — fetch missing artifacts,
extract text, write
PROVENANCE.json +
archive-index.json)
extract-exif / palette / dimensions (existing)
cabal run site -- build (existing — now also routes archive/)
pagefind --site _site (existing — now also indexes archive pages)
tools/embed.py (existing — now also embeds archive pages)
stamp-build-time.py / compress-assets.sh (existing)
tools/archive.py fetch runs before cabal run site -- build so the
artifacts, PROVENANCE.json files, and archive-index.json all exist when
Hakyll routes the archive/ tree and when Backlinks.hs loads the index.
fetch is incremental — a normal build with no new manifest entries does no
network I/O — but it still rewrites archive-index.json every run. Wayback
submission is not in this path. The monolith binary is committed
(tools/bin/monolith), so there is no download step.
make build never deletes anything under archive/. Artifact removal is
exclusively the job of the opt-in make archive-gc (see Eviction).
Standalone targets, none a dependency of build:
make archive-check— link-rot scan.make archive-wayback— backfill outstanding Wayback captures.make archive-suggest— print the "cited but not archived" diff againstdata/*.bib(DOI-only entries resolved;removed.yamlentries excluded).make archive-gc— deletearchive/{slug}/directories whose slug is recorded inremoved.yaml; report (never delete) orphans that are not.
Build module structure
New Haskell module:
build/Archive.hs— patterns, routing rules, and contexts for the archive. Generates/archive/and every/archive/{slug}/page fromarchive/manifest.yaml+PROVENANCE.json+data/archive-state.json; routes the raw artifacts through unchanged. Pages and routed artifacts come only from currentmanifest.yamlentries, so an orphanedarchive/{slug}/directory is inert (no page, not deployed). Integrity (SHA-256) verification istools/archive.py's job — it runs first and halts the build on a mismatch;Archive.hstrusts a present (provenance, artifact) pair and skips any entry lacking either. Separated fromSite.hsfor the same reasonCatalog.hs,Authors.hs, andPhotography.hsare — scoped concerns, isolated reasoning.
New Pandoc filter:
build/Filters/Archive.hs— the link-annotation filter; registered inFilters.hsimmediately aftersmallcaps/dropcaps, before thelinkspass. No-op whenarchive-index.jsonis absent.
Edits to existing modules:
build/Patterns.hs— addarchivePattern(artifact files) andarchiveManifest. Add archive entries totagIndexableso tagged archives reach the tag indexes. (Deliberately not added toallContent: archive pages receive backlinks but are not crawled for outbound links in v1.)build/Backlinks.hs— loaddata/archive-index.json(silent no-op if absent); widenisPageLinkto keep archived externals; match incoming links against the alias set; canonicalize them to/archive/{slug}/in pass 2.build/Site.hs— wire the archive rules fromArchive.hs; add the/archive/link to the footer /colophonrouting.build/Stats.hs— contribute archive metrics to the/build/telemetry page: count; total bytes; median artifact age; counts bysnapshot-quality,status, andvisibility;paywalledcount; and any orphan slugs (directories not inmanifest.yamland not inremoved.yaml— they should not exist, so surface them where drift is visible).templates/partials/head.html— add anoindexcontext hook and a$if(archive)$link tostatic/css/archive.css(the archive pages' stylesheet — banner, provenance panel, artifact viewer, index list; scoped under#markdownBodyto clear the prose rules intypography.css).
Templates
New files under templates/:
| File | Role |
|---|---|
archive-index.html |
/archive/ — the full text list, type/tag/status filters; includes archive-removal-notice |
archive.html |
/archive/{slug}/ — banner, metadata, embedded artifact, full text, Referenced-by, Related; includes archive-removal-notice |
New partials:
| File | Role |
|---|---|
partials/archive-banner.html |
The "archived copy / view original" strip — reused by archive.html and any inline archive embed |
partials/archive-card.html |
Archive-entry card (text-only; no thumbnail in v1) for the index and for /library.html |
partials/archive-removal-notice.html |
The removal-request line (ln@levineuwirth.org); included directly by archive.html and archive-index.html |
Existing partials reused unchanged: nav.html, head.html (with the new
noindex flag), footer.html, page-footer.html. The removal notice is a
new partial precisely so page-footer.html stays untouched.
Storage, repo size & .gitignore
Committed: the artifacts (document.pdf, snapshot.html), PROVENANCE.json,
manifest.yaml, removed.yaml, and the pinned monolith binary
(tools/bin/monolith). Gitignored: everything regenerable.
Append to .gitignore:
# Archive: generated text + its staleness stamp (recreated from the committed
# artifact on every build — deterministic, so committing them is churn).
archive/**/*.txt
archive/**/*.txt.sha256
# Archive: generated state (written by tools/archive.py).
# NOTE: archive/**/PROVENANCE.json is deliberately NOT ignored — it is the
# committed, immutable record of each archival event.
data/archive-state.json
data/archive-index.json
Repo-size policy. Archived artifacts are immutable once taken, so they add
no history bloat — but the working tree grows. v1 commits them: a preservation
guarantee that depends on an un-versioned side store is a weaker guarantee, and
git clone → make build must reproduce the whole site.
- Per-artifact cap: 25 MB.
archive.py fetchwarns and skips above it; a deliberately-oversize artifact is committed withgit add -f. This stops a 200 MB scan from being swept silently into a commit. - Migration tripwire. If
archive/exceeds ~5 GB, or doubles year-over-year, evaluate moving the artifact store out of the main repo — to a separatearchiverepository or a content-addressed store the VPS rsyncs independently.tools/archive.pyreads the store root from a single config value, so the move is a config change, not a redesign. - Never git LFS. LFS smudges the property that makes this system worth
having: with LFS,
git cloneno longer yields the artifacts unless the LFS server is up and authenticated. For a system whose value proposition is "this survives," that is a regression. If migration is needed, the destination is a separate repo or object store — not LFS in this one.
Legal, ethical & SEO posture
Archiving third-party content touches copyright. The design's guardrails:
noindexon every archive page. The archive preserves; it does not republish to search engines or compete with originals for ranking.- The original is the hero. Every archive page links prominently to the source and is explicitly framed as a dated archived copy.
- A real removal channel, everywhere. A request to
ln@levineuwirth.orggets the entry removed (see Eviction). The channel is advertised on/archive/, on every individual archive page, and inside the fetcher's User-Agent string. This is the load-bearing ethical commitment;robots.txtis only a proxy for it. noarchivehonoured. BothX-Robots-Tag: noarchive(HTTP header) and<meta name="robots" content="noarchive">(HTML body) abort a fetch.- Authenticated content skipped. The fetcher sends no credentials. Anything behind a login is not archived.
visibility: privatekeeps a snapshot in-repo for the author's own reference without deploying the artifact to_site/— the appropriate setting for licensed material the author may read but should not redistribute. The archive page still exists (metadata + "held offline"), so link-rot tracking and the Wayback link survive.- Curated, not crawled. The archive only ever contains works this site deliberately references — a fundamentally different posture from a scraper.
- Attribution preserved. Author, source title, source date, and original URL are surfaced on every archive page.
This is a personal-scale citation archive, consistent with long-standing practice on research-oriented personal sites. It is not a content platform.
Phased implementation
Each phase has explicit exit criteria. Do not start a phase until the previous one passes.
Phase 1 — Skeleton, PDF only
Bootstrap entry: NIST FIPS 203 (ML-KEM), PDF at
https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf — a stable, auth-free
PDF already cited in data/simd-paper.bib, so the test entry keeps its value
after Phase 1 ships.
- Define
archive/manifest.yamlandarchive/removed.yamlschemas; createmanifest.yamlwith the bootstrap entry tools/archive.py fetch— PDF download, size cap,pdftotext,.txt.sha256staleness stamp, write per-entryPROVENANCE.json; always rewritearchive-index.json; refuse a(slug, url)mismatch, and re-hash every committed artifact (non-zero exit on a SHA mismatch)build/Archive.hs— routing for/archive/,/archive/{slug}/, and the rawdocument.pdf; orphaned directories produce no page (a pass-1 refinement subsequently added a Haskell-side SHA-256 re-hash viasha256sum, so the integrity guarantee holds even whenarchive.pydid not run first — directcabalinvocations, deploy hosts without.venv, etc.)templates/archive.html,templates/archive-index.html,partials/archive-banner.html,partials/archive-removal-notice.html- PDF artifact embedded on the page (Phase 2 changed this to a raw,
browser-native
<iframe>embed — see the Display — PDF decision) - Extracted text rendered into the page DOM (collapsed
<details>) noindexhook inhead.html; set on archive pages- Eviction works end-to-end —
make archive-gc,removed.yamlgating, orphan reporting (see Eviction & removal) - Wire
tools/archive.py fetchinto the Makefile,.venv-gated .gitignoreadditions (PROVENANCE.jsonexplicitly not ignored)
Exit criteria: the FIPS 203 PDF renders at /archive/{slug}/ with banner,
metadata, working PDF.js embed, visible extracted text, and a removal-request
notice; /archive/ lists it; both carry noindex. The eviction procedure
(record in removed.yaml → drop the manifest line → make archive-gc) removes
the artifact; a manifest line deleted without a removed.yaml entry leaves
the artifact intact and emits a warning. Running make build ten times in
succession with no manifest edits produces no changes under archive/ — no
deletions, no PROVENANCE.json rewrites, no artifact replacements.
Met (2026-05-22). FIPS 203 fetched (1.25 MB, 3601 lines of extracted
text); /archive/nist-fips-203/ renders with banner, metadata, PDF.js iframe,
in-DOM full text, and removal notice; /archive/ lists it; both carry
noindex. gc was verified on both paths — an orphan directory is reported
and left intact (exit 1); a removed.yaml-listed directory is deleted while
the manifest entry is untouched. archive/ is byte-identical across repeated
fetch + build cycles. The PDF.js iframe is correctly wired; rendering the
viewer needs static/pdfjs/, which make build vendors via download-pdfjs.sh.
Phase 2 — HTML snapshots
Bootstrap entry: https://cr.yp.to/aes-speed.html (slug: djb-aes-speed)
— Bernstein's cache-timing-attacks page, cited in data/simd-paper.bib. A
stable, JavaScript-free static page, so its snapshot is reproducible and
classifies cleanly as ok; like FIPS 203 it keeps its value after the phase
ships.
- Commit the pinned
monolithbinary attools/bin/monolith; record version + sha256 intools/monolith-version.txt tools/archive.py fetch— HTML branch:monolith --no-js, CSP<meta>injection (style-src+-elem+-attr,script-src/object-src/frame-src 'none'), text extraction viaBeautifulSoup, type detectionsnapshot-qualityclassification (ok/degraded/js-required) written toPROVENANCE.json; degraded captures flagged on/archive/- Sandboxed
<iframe>rendering (referrerpolicy="no-referrer", noallow-scripts) inarchive.html
Exit criteria: an HTML URL snapshots to a self-contained file with a CSP
<meta>, renders in a sandboxed no-referrer iframe with the original's styling
isolated, and shows extracted readable text in site chrome; the sandboxed
snapshot renders correctly under the CSP in both Firefox and a Chromium-based
browser; capture quality is classified and a degraded snapshot is visibly
flagged; the author has reviewed the rendered snapshot before committing it.
Met (2026-05-22). monolith 2.10.1 (monolith-gnu-linux-x86_64) is
vendored at tools/bin/monolith with its version + sha256 in
tools/monolith-version.txt; archive.py fetch locates it via $MONOLITH_BIN
→ tools/bin/monolith → $PATH, and warns-and-skips (build continues) when it
is absent. cr.yp.to/aes-speed.html snapshots to a 26 KB self-contained
snapshot.html with the archive CSP <meta> as the first <head> child;
/archive/djb-aes-speed/ renders it in a sandboxed, no-referrer iframe with
291 lines of extracted prose shown inline as <p> paragraphs; snapshot-quality
classifies ok, and a (synthetically forced) degraded entry shows the warning
note on the page and a flag on /archive/. fetch is idempotent — archive/
is byte-identical across re-runs. The committed artifact is snapshot.html;
snapshot.txt + .sha256 are gitignored (the existing archive/**/*.txt
globs already cover them).
Author-gated, by design (exit-criteria wording). Two criteria are not
machine-checkable here and remain the author's: (1) the cross-browser CSP
render in Firefox and a Chromium browser; (2) the per-snapshot review before
committing archive/. The vendored monolith binary and the FIPS 203 / djb
artifacts are staged but not committed — committing archive/ and
tools/bin/monolith is the deliberate author act the design specifies.
One real-world note from the bootstrap: cr.yp.to ships
<meta name="robots" content="none">. Per spec none ≡ noindex, nofollow —
it is not noarchive, so the snapshot proceeded correctly; only an explicit
noarchive (header or meta) aborts a fetch.
Phase 3 — Link annotation & Wayback
- Confirm
Filters.hs's actual filter registration order matches the AST chain documented on/colophonbefore pinning the filter's position - Confirm whether the bibliography is rendered into the document AST or a separate context field — this decides whether bibliography annotation is in scope here or gated on the popup rewrite (see Link annotation)
build/Filters/Archive.hs— annotate body links to archived URLs; register inFilters.hsaftersmallcaps/dropcaps, beforelinks; no-op whenarchive-index.jsonis absentarchive.py wayback+make archive-wayback— non-blocking submission, backfillwaybackintoPROVENANCE.jsonvisibility: privatehandling (artifact not routed to_site/)
Exit criteria: a prose link to an archived URL gets an automatic archive
affordance; a build without .venv (no archive-index.json) still succeeds
with links un-annotated; every entry has a recorded Wayback URL after make archive-wayback; a private entry's page renders without deploying its
artifact; the bibliography-annotation path is documented as either in-scope or
popup-gated.
Met (2026-05-22). build/Filters/Archive.hs walks body Link nodes and,
for any URL in data/archive-index.json (canonical + alias set, fragment- and
trailing-slash-tolerant), appends a superscript archive-affordance link to
/archive/<slug>/ — emitted as RawInline HTML so the downstream Links
pass leaves it alone. It is registered in Filters.applyAll between
Smallcaps and Links; the index loads once via an unsafePerformIO CAF and
an absent/empty index makes the filter the identity (verified: a prose link to
the archived cr.yp.to/aes-speed.html gains the affordance, a non-archived
link does not). archive.py wayback (+ make archive-wayback) submits each
entry lacking a wayback capture to the Wayback Machine and backfills
PROVENANCE.json; it always exits 0 and is never on a build's critical path.
visibility: private is a manifest.yaml field: a private entry's artifact is
never routed to _site/ (artifacts are routed by an explicit public-only list,
which also stops an orphan directory's artifact deploying), and its page
renders provenance + a "held offline" panel with no embed and no extracted text
(verified: a private _site/archive/<slug>/ contains only index.html).
Two items are deliberately scoped out of this pass, both documented above:
bibliography annotation (the bibliography is a separate $bibliography$
field; the hook is Citations.hs's enhanceEntry, pending a CSL-URL check —
not popup-gated) and pull-from-Wayback when the original is dead at fetch
time (it belongs with Phase 5 link-rot detection, where a dead URL is the
central case and a Wayback-sourced artifact's provenance can be handled
properly). The live make archive-wayback run is author-initiated — it submits
public captures to a third-party service.
Phase 4 — Backlinks & similar-pages indexing
Backlinks.hs— loadarchive-index.json(silent no-op if absent); widenisPageLink; match the alias set; canonicalize archived externals to/archive/{slug}/in pass 2- "Referenced by" section on
archive.html embed.py— add/archive/toEXCLUDE_URLS; verify archive pages joinsimilar-links.jsonand the paragraph index- Measure
embed.pyruntime against a populated archive; add a per-document embedding cache (keyed by content hash) once the archive passes 50 entries orembed.pyexceeds 60 s - "Related" section on
archive.html - Fragment-preserving backlinks → grouped "Referenced by" by section/page
Exit criteria: an archive page lists the essays that cite it under
"Referenced by", including citations that used an alias URL form; essays surface
relevant archived works under "Related"; a fragment-targeted citation appears
grouped under its section; embed.py runtime with the archive populated is
measured and either under the thresholds or the cache is in place.
Met (2026-05-22). A shared build/ArchiveIndex.hs loads
data/archive-index.json once (the unsafePerformIO CAF formerly private to
Filters.Archive); Backlinks.hs and Filters.Archive both consume it.
Backlinks.isPageLink keeps an archived external URL regardless of scheme or
extension; pass 2 (targetKey) canonicalises it to the archived work's
/archive/<slug>/ page key — computed as the same string fed through
normaliseUrl that backlinksField uses for the page's own route, so the two
always agree. archiveEntryCtx gains referencedByField and
similarLinksField; archive.html renders $if(referenced-by)$ /
$if(similar-links)$ sections. referencedByField reuses the backlinks lookup
but groups sources by the fragment each citation targets — a #page=12
citation renders under a "Page 12" subheading, a bare citation in a flat list
above. embed.py excludes the /archive/ index from the corpus (individual
entry pages stay in) and is measured at ~12 s for the whole site (43 → 25
pages, 802 paragraphs) — far under the 60 s threshold and the 50-entry trigger,
so the per-document embedding cache is correctly not built (premature at this
scale; revisit at the threshold).
Verified end-to-end with a temporary citation in content/about.md: the
FIPS 203 page listed it under "Referenced by" with a flat entry and a grouped
"Page 12" entry; both archive pages surfaced the SIMD/PQC essay and each other
under "Related"; the /archive/ index was absent from similar-links.json.
One pre-existing embed.py issue was surfaced and fixed: the /source/
repository code mirror was in the similarity corpus — a template file was
surfacing as a neighbour, titled with its unrendered $title$ placeholder. An
EXCLUDE_PREFIXES rule now keeps /source/ out, which also dropped 18 junk
pages from the site-wide corpus (43 → 25).
Phase 5 — Link-rot detection & maintenance
Prerequisite — resolved 2026-05-22. /build/ had been serving a stale
cached page: its build-varying telemetry is gathered in unsafeCompiler, which
Hakyll does not dependency-track, so the page recompiled only when tracked
content changed. Fixed — build/Main.hs writes a per-build
data/build-stamp.txt that Stats.hs loads as a dependency, forcing /build/
and /stats/ to recompile every build. The archive-metrics exit criterion
below is now measurable.
tools/archive.py check+make archive-check— HEAD/GET scan- Asymmetric hysteresis:
rottedrequires 3 consecutive failed scans over ≥ 14 days; a single success →live;consecutive-failures+status-sincetracked inarchive-state.json - Dead-link rendering: flip primary link to the archive on
rotted - Pagefind
statusfilter tag wired intosearch-filters.js - Archive metrics on
/build/telemetry (Stats.hs) /archive/index shows per-entry health
Test endpoint: reserve a controlled host — e.g. archive-test.levineuwirth.org,
a sub-host the author owns — that can be toggled to return 404 on demand, so the
rot-detection test flips without depending on a third party's uptime.
Exit criteria: the controlled test URL is detected as rotted only after
the hysteresis threshold is met, and the citing essay's link then flips to the
archived copy; a single transient failure does not flip it; restoring the URL
returns it to live on the next successful scan; the /build/ page reports
archive coverage and health; search results can be filtered by archive status.
Met (2026-05-22). tools/archive.py check HEAD/GET-probes every manifest
URL (HEAD first, ranged GET on 403/405/501) and updates the gitignored
data/archive-state.json, which mirrors the manifest exactly (state for
dropped URLs is discarded). The asymmetric hysteresis in next_state is
unit-verified against synthetic scenarios — fail/fail/fail across 20 days flips
to rotted; three fast fails within 2 days stay at error; a single ok from
any non-live status recovers immediately to live. ArchiveIndex.hs exposes
the parsed status to consumers as archiveStatusForSlug. Filters.Archive
flips a rotted body link's href to /archive/<slug>/ (adding an
archive-rotted class and a solid "archived" affordance marker) — verified
end-to-end with a hand-crafted rotted state file: a content link to the
djb URL was rewritten to the archive page; reverting the state restored the
original link. archive.html carries data-pagefind-filter="type:archive, status:$status$", a "Link status" row in the provenance panel, and a
status-note callout in the header for non-live states. The /archive/ index
flags rotted entries with a solid "link rotted" chip. Stats.hs /build/
gains a "Link archive" section (count, total size, median age, by-status /
by-quality / by-visibility breakdowns, paywalled count, orphan directories) —
verified showing the test state's error 1 · rotted 1 mix.
Rendering staleness — by design. Rot status is consumed at build time via
@unsafePerformIO@ CAFs; archive entry pages and content pages don't have a
Hakyll dependency edge to archive-state.json (that would only fix half the
problem — the archive pages — while leaving content-link flips stale, since
Filters.Archive runs during content compilation and can't cheaply force
every content page to depend on the state). So after make archive-check,
an incremental build can leave both surfaces uniformly stale until a clean
build refreshes everything. make deploy always does make clean, which
makes the deployed site consistent. The /build/ page is the one
always-fresh surface: it recompiles every build via the existing build-stamp
dependency, so its archive metrics always reflect the current scan.
Test endpoint deferred. Spinning up archive-test.levineuwirth.org and
running it through a 14-day-spanning fail streak is a multi-week real-world
verification the author runs (or a CI cron); the hysteresis logic itself is
unit-tested deterministically in next_state, and the rendering side is
verified by the hand-crafted rotted state file.
Search-UI filter (search-filters.js) — partial. The data-side is in
place: every archive page carries data-pagefind-filter="type:archive, status:$status$", so Pagefind's filter index now distinguishes archive hits
by rot status and (when @pagefind-ui@ is configured to show filters) lists
them as a filterable facet. The remaining work — wiring a custom UI control
into search-filters.js — is a deliberate refinement, not done in Phase 5:
its existing status filter is reserved for epistemic status (working
model / drafting / etc.) sourced from data/epistemic-meta.json, so adding an
archive status dimension needs a name to avoid the collision plus new
filter-panel buttons. Search-UX best iterated with the live page in front of
the author.
Open / deferred questions
Non-blocking, and now a short list — the draft's larger set was resolved into Decisions during review.
- JS-heavy / SPA pages.
monolithcannot execute JavaScript;js-requiredcaptures are degraded. A headless-browser fallback (SingleFile, Chromium capture) would handle them but adds a heavyweight dependency. Defer until a real entry needs it. - First-viewport thumbnails. Dropped for v1 —
/archive/is a text list. A visual grid does not earn its keep at small N; revisit past ~50 entries. - PDF section-granularity.
pdftotextflattens structure. Per-page chunking (#page=Nanchors, per-page text) is the realistic granularity for PDF backlinks and semantic indexing. Defer. - Per-section "Related" UI. The paragraph-level semantic index already receives archive text; a UI surfacing section-level "Related" does not exist for any content type yet. Out of scope here; a site-wide feature.
- Snapshot versioning. v1 snapshots are immutable per snapshot;
refreshreplaces in place but recordsprevious-sha256. If a referenced work is meaningfully revised, should a new dated snapshot be kept alongside the old (document-2027-01-01.pdf) with a version switcher?previous-sha256is the seed — extend it to a list and the switcher reads it. Defer until needed. - Intra-archive link rewriting. When archived page A links to a URL that is also archived, A's snapshot could be rewritten to point at the local copy of B — keeping the reader inside the preserved set. Gwern-style; defer.
- Media beyond PDF/HTML. EPUB, plain images, video. Out of scope for v1;
typeis an open enum so it can extend.
References
WRITING.md— authoring conventions; the link-annotation feature will be documented there once Phase 3 landsPHOTOGRAPHY.md— the closest precedent: authored-input/generated-sidecar split, phased build,.venv-gated tools, vendored binariesbuild/Backlinks.hs— two-pass backlinks;isPageLinkis the integration pointbuild/SimilarLinks.hs— "Related" block; consumesembed.pyoutputtools/embed.py— embedding pipeline; archive pages join its corpus for freebuild/Patterns.hs— canonical content patternsbuild/Tags.hs— slash-hierarchy tags (reused for archive tags)tools/download-leaflet.sh,tools/download-pdfjs.sh— the sha256-pinning convention;monolithis committed directly rather than downloaded (a build-time executable, not a servable asset)nginx/popup-proxy.conf— the metadata proxy; related but distinct (caches previews, does not preserve documents)
</content>