1536 lines
88 KiB
Markdown
1536 lines
88 KiB
Markdown
# Archive
|
|
|
|
Design and implementation plan for the link-archiving system of levineuwirth.org.
|
|
This is the source of truth for how external references are preserved, hosted,
|
|
displayed, and indexed. It sits alongside `WRITING.md`, `PHOTOGRAPHY.md`,
|
|
`HOMEPAGE.md`, and `MARKS.md` as authoritative spec.
|
|
|
|
## Status
|
|
|
|
**Reviewed and ratified 2026-05-21, with revisions.** The original draft was
|
|
reviewed against the live site over three rounds; the decisions below
|
|
incorporate every round of deltas and are now locked.
|
|
|
|
**Phase 1 complete (2026-05-22).** PDF entries: `archive/manifest.yaml`,
|
|
`tools/archive.py` (`fetch` + `gc`), `build/Archive.hs`, the four templates,
|
|
and the Makefile / `head.html` / `.gitignore` wiring are built and verified —
|
|
`/archive/` and `/archive/nist-fips-203/` render.
|
|
|
|
**Phase 2 complete (2026-05-22).** HTML snapshots: the pinned `monolith`
|
|
binary is vendored at `tools/bin/monolith`, `archive.py fetch` snapshots HTML
|
|
pages (CSP injected, text extracted, quality classified), and `archive.html`
|
|
renders them in a sandboxed iframe — `/archive/djb-aes-speed/` renders. The
|
|
cross-browser CSP check and the per-snapshot review remain author-gated by
|
|
design.
|
|
|
|
**Archive pages styled (2026-05-22).** `static/css/archive.css` gives the
|
|
index and entry pages a framed treatment (banner callout, provenance panel,
|
|
artifact viewer); the PDF embed was changed to the raw `document.pdf` (browser-
|
|
native viewer), symmetric with HTML snapshots — see the Display — PDF decision.
|
|
|
|
**Phase 3 complete (2026-05-22).** Link annotation + Wayback: `Filters/Archive.hs`
|
|
appends an archive affordance to body links whose target is archived;
|
|
`archive.py wayback` (+ `make archive-wayback`) backfills Wayback captures;
|
|
`visibility: private` keeps an entry's artifact in-repo but undeployed.
|
|
Bibliography annotation is documented as a `Citations.hs` follow-up.
|
|
|
|
**Phase 4 complete (2026-05-22).** Backlinks + similar-pages: `Backlinks.hs`
|
|
keeps archived external links and canonicalises them to their `/archive/<slug>/`
|
|
page, so an archived work lists every essay that cites it under "Referenced by"
|
|
(grouped by the fragment each citation targets); `archive.html` also carries a
|
|
"Related" block from the `embed.py` similarity corpus, which now indexes archive
|
|
pages and excludes the `/archive/` index.
|
|
|
|
**Phase 5 complete (2026-05-22).** Link-rot detection: `tools/archive.py check`
|
|
(+ `make archive-check`) HEAD/GET-probes every manifest URL and updates the
|
|
gitignored `data/archive-state.json` under asymmetric hysteresis (`rotted`
|
|
needs 3 fails over ≥14 days; a single success recovers immediately).
|
|
`Filters.Archive` flips a body link to the archive when its target is `rotted`;
|
|
each archive page surfaces its link status (provenance row, header note,
|
|
Pagefind `status` filter tag); `/archive/` flags rotted entries; `/build/`
|
|
gains a "Link archive" telemetry section. The search-UI `status` filter wiring
|
|
in `search-filters.js` is deliberately partial — see the Phase 5 Met note.
|
|
|
|
**All five phases done.** Refinements next; see the Phase 5 Met note for the
|
|
documented deferrals (search-UI status filter; bibliography annotation from
|
|
Phase 3; pull-from-Wayback at fetch time).
|
|
|
|
**Refinements (2026-05-22).** A code-review pass found and fixed several
|
|
correctness and posture issues across the system:
|
|
|
|
- **Missing committed artifact no longer re-fetches silently.** `cmd_fetch`
|
|
used to skip its SHA guard when the artifact was absent and then download
|
|
fresh bytes whose hash differed from the recorded `sha256` — replacing the
|
|
recorded snapshot without surfacing it. The guard now also halts when
|
|
`PROVENANCE.json` is present but the artifact is missing, requiring the
|
|
author to restore the committed bytes before rebuilding.
|
|
- **`archive/removed.yaml` is now enforced in `fetch` and `check`.** It was
|
|
only read by `gc`. A removed URL re-added to the manifest now halts
|
|
`cmd_fetch` loudly; `cmd_check` skips removed URLs so the link-rot
|
|
scanner does not keep probing a deliberate takedown.
|
|
- **SHA verification closed the `.venv`-bypass hole.** The original
|
|
decision relied solely on `archive.py fetch` re-hashing, but that step is
|
|
`.venv`-gated — a contributor or deploy host without `.venv`, or a direct
|
|
`cabal run site -- build`, would publish a tampered artifact unchecked.
|
|
`build/Archive.hs` now also re-hashes via `sha256sum` from
|
|
`loadArchiveEntries` and halts the build on a mismatch, so the guarantee
|
|
holds independent of the Python step.
|
|
- **Raw artifacts are no longer publicly indexable.** Pass 1 added a
|
|
`robots.txt` `Disallow: /archive/`, which pass 2 then reverted (see
|
|
below — it was counter-productive). Pass 1's other change — injecting
|
|
`<meta name=robots content="noindex, noarchive">` into every new HTML
|
|
snapshot alongside the archive CSP — remains in place; the
|
|
deploy-side header for raw PDFs landed in pass 2 as `nginx/archive.conf`.
|
|
- **The documented `archive.py refresh {slug}` subcommand is implemented.**
|
|
It clears the slug's directory, re-fetches via `cmd_fetch`, and records
|
|
the prior `sha256` as `previous-sha256` in the new `PROVENANCE.json`. The
|
|
URL-changed error message in `cmd_fetch` now points at it instead of
|
|
asking the author to delete the directory by hand.
|
|
- **`url_aliases` widened** to the design's full equivalent-URL set:
|
|
tracking-parameter stripping (`utm_*`, `fbclid`, `gclid`, `mc_*`, `ref`,
|
|
`igshid`, `_hsenc`, `_hsmi`, `mkt_tok`) and arXiv abs / pdf / versioned /
|
|
`.pdf` form expansion. Phase 1 had deliberately kept these as a Phase 4
|
|
deferral, but Phase 4 missed the follow-through.
|
|
- **`X-Robots-Tag: noarchive` is now honoured on both HEAD and GET.** Some
|
|
servers omit the header on HEAD but emit it on GET; HTML capture now
|
|
aborts if either response carries the directive.
|
|
|
|
Three smaller items remain documented and deferred:
|
|
|
|
- **Archive tags joining the site-wide tag indexes.** `manifest.yaml`'s
|
|
`tags:` is authored but `Tags.hs`/`Patterns.tagIndexable` does not yet
|
|
ingest archive entries — it needs a Tags.hs-side integration with its
|
|
own design pass (archive pages aren't `match`ed Hakyll items in the
|
|
normal way).
|
|
- **`archive.py suggest`** (bibliography discovery — diff `.bib` URLs
|
|
against the manifest) is documented but not implemented.
|
|
- **The controlled-host end-to-end link-rot test** (reserve
|
|
`archive-test.levineuwirth.org`, run it through a 14-day-spanning fail
|
|
streak, watch the flip happen) is inherently a multi-week real-world
|
|
verification the author runs; the hysteresis logic is unit-tested
|
|
deterministically and the rendering side is verified by a hand-crafted
|
|
`rotted` state file.
|
|
|
|
**Refinements pass 2 (2026-05-23).** A second code-review pass surfaced
|
|
correctness gaps the first pass missed:
|
|
|
|
- **`refresh` is now atomic.** It used to delete the slug directory and
|
|
then call `cmd_fetch`; a failed re-fetch left the entry with no
|
|
snapshot at all, while `refresh` returned 0 (because `cmd_fetch`
|
|
reports per-entry skips, not a process failure). The slug directory is
|
|
now *renamed* to a `.refresh-backup` sibling; success removes the
|
|
backup, any failure restores it. Verified by hiding the `monolith`
|
|
binary and confirming the prior snapshot survives intact.
|
|
- **Invalid `visibility` values fail closed.** The `ManifestEntry` parser
|
|
used to accept any string and only treat the exact `"private"` as
|
|
private — a typo like `privte` would publish a work the author intended
|
|
to keep offline. The parser now rejects any value other than `public`
|
|
or `private`, and `readManifest` halts the build on any parse error of
|
|
a present file (instead of warning + returning an empty list — that
|
|
silent-skip was for `file absent`, not `file present but corrupt`).
|
|
- **Lookup-side URL normalisation.** Alias generation alone cannot cover
|
|
unbounded forms (arXiv versions, arbitrary tracking-parameter
|
|
combinations). `ArchiveIndex` now normalises both index keys and
|
|
lookup inputs through the same `normalizeUrl` (drop fragment, strip
|
|
tracking, fold http→https, arXiv-canonicalise, trim trailing slash).
|
|
Verified: `https://cr.yp.to/aes-speed.html`,
|
|
`https://cr.yp.to/aes-speed.html?utm_source=mail`, and
|
|
`http://cr.yp.to/aes-speed.html/` all match the same archived entry.
|
|
- **Raw-artifact indexing posture corrected.** The Phase-5 `robots.txt`
|
|
`Disallow: /archive/` was counter-productive: a URL blocked by
|
|
robots.txt can still appear in results when externally linked, and the
|
|
Disallow also prevents compliant crawlers from reading the wrapper
|
|
pages' `<meta name=robots>`. The Disallow is reverted; a new
|
|
`nginx/archive.conf` snippet emits `X-Robots-Tag: noindex, noarchive`
|
|
for the whole `/archive/` tree, which crawlers honour for any resource
|
|
(HTML and PDF alike). The deploy vhost should `include
|
|
snippets/archive.conf`.
|
|
- **`cmd_wayback` skips `removed.yaml`.** The eviction procedure says
|
|
record in `removed.yaml` *before* dropping the manifest line; `fetch`
|
|
and `check` now honour that ordering, but `wayback` did not. A removed
|
|
entry whose manifest line was still in place could be submitted to a
|
|
third-party archive after a takedown was recorded.
|
|
- **The shipped HTML snapshot was refreshed in the working tree** so it
|
|
carries the noarchive meta the Phase-5 inject promises. `archive.py
|
|
refresh djb-aes-speed` re-fetched cr.yp.to, applied
|
|
`inject_archive_metas`, and recorded the prior SHA as `previous-sha256`.
|
|
`archive/djb-aes-speed/{snapshot.html, PROVENANCE.json}` now reflect the
|
|
new bytes; matching SHA is verified by `Archive.hs`. *Caveat surfaced
|
|
in pass 3 (below): the prior snapshot was not committed at the moment
|
|
of this refresh, so its bytes are no longer recoverable via `git log
|
|
-S`. A pass-3 fix to `refresh` now refuses to replace an uncommitted
|
|
prior, but the historical artifact survives — `previous-sha256`
|
|
records a hash whose bytes this working tree cannot reproduce.*
|
|
- **The URL-changed error in `cmd_fetch`** now points at
|
|
`archive.py refresh {slug}` instead of asking the author to delete the
|
|
directory by hand.
|
|
|
|
Tag integration remains the one deferred refinement (it needs a Tags.hs
|
|
design pass).
|
|
|
|
**Refinements pass 3 (2026-05-23).** A third audit surfaced gaps the pass-2
|
|
fixes didn't fully close:
|
|
|
|
- **`refresh` refuses to replace an uncommitted prior snapshot.** Pass 2
|
|
preserved a prior snapshot through *failed* re-fetches, but a *successful*
|
|
one happily discarded uncommitted bytes — `previous-sha256` then pointed
|
|
at a hash no `git log -S` could recover. Pass 3 shells out to `git
|
|
ls-files` + `git diff --quiet HEAD` and refuses the refresh unless both
|
|
the prior PROVENANCE.json and its artifact are tracked and clean.
|
|
- **`refresh` is atomic across *every* exit path.** Pass 2 handled the
|
|
ordinary `cmd_fetch returns 0 but the artifact wasn't produced` case but
|
|
not fatal `sys.exit`s (e.g. a `removed.yaml` conflict halting `cmd_fetch`
|
|
mid-refresh) nor mid-refresh exceptions, and it never rolled back the
|
|
`data/archive-index.json` rewrite. The work is now wrapped in
|
|
`try/finally` that restores both the slug directory and the index on any
|
|
exit path — normal failure, `SystemExit`, `KeyboardInterrupt`, or
|
|
exception.
|
|
- **Removal enforcement now uses the same equivalence as link matching.**
|
|
Pass 2 introduced `normalizeUrl` for incoming citations but compared
|
|
removals as literal URL strings, so a tracking-laden manifest URL could
|
|
bypass a takedown. Python gains `normalize_url` mirroring the Haskell
|
|
helper, and `fetch` / `check` / `wayback` compare normalised forms.
|
|
`cmd_fetch` additionally rejects two manifest entries whose canonical
|
|
forms collide — that would otherwise route both under one slug.
|
|
- **`fetch_html` honours `X-Robots-Tag: noarchive` on the captured GET too.**
|
|
Pass 1 added HEAD + ranged-GET probes, but a server can emit the header
|
|
only on the full document response. The Python tool now downloads that
|
|
response itself, checks its header and body directives, then passes those
|
|
exact bytes to `monolith --base-url ... -` so the saved snapshot is not
|
|
obtained through a second unobservable document request.
|
|
- **`nginx/archive.conf` is wired into the deploy template** and
|
|
re-`include`s `security-headers.conf` inside its `location` block.
|
|
`nginx/vhost.conf.example` now includes `archive.conf`; the snippet
|
|
itself re-emits the baseline headers because nginx's `add_header` chain
|
|
is inherited from a parent only when the current context declares *no*
|
|
`add_header` directives — without the re-include, /archive/ would lose
|
|
HSTS, CSP, etc.
|
|
- **Contract doc cleanups.** The Phase-5 paragraph claiming `robots.txt`
|
|
disallows `/archive/` is reworded to acknowledge the pass-2 reversal;
|
|
the Phase-1 checkbox claiming `Archive.hs` does not re-hash is updated
|
|
to point at `verifyArtifactSha`; the pass-2 note about the refreshed
|
|
djb snapshot now carries the caveat that its prior bytes were
|
|
uncommitted and are therefore unrecoverable.
|
|
|
|
The historical `previous-sha256` value in `archive/djb-aes-speed/
|
|
PROVENANCE.json` is left in place: it is a truthful record that *a* prior
|
|
snapshot existed and what its hash was. It just is not recoverable from
|
|
git in this working tree — the pass-3 `refresh` precondition exists so
|
|
that property is never broken again.
|
|
|
|
**Refinements pass 4 (2026-05-23).** A fourth audit completed the
|
|
failure-closed paths:
|
|
|
|
- **Direct Hakyll builds now enforce removals and missing-artifact failures.**
|
|
`Archive.hs` reads `removed.yaml`, rejects normalized manifest conflicts
|
|
and duplicate archive targets, and aborts if provenance exists without its
|
|
artifact. `ArchiveIndex.hs` filters the generated index through the live
|
|
manifest minus normalized removals, so a stale ignored index cannot retain
|
|
archive affordances after a takedown when `archive.py` was skipped.
|
|
- **`refresh` verifies the prior bytes before replacing them.** A prior
|
|
snapshot must now be present, tracked, clean, and match its recorded
|
|
SHA-256 before its hash can be written into `previous-sha256`.
|
|
- **Failed refresh restores an originally-absent index state.** If
|
|
`data/archive-index.json` did not exist before a failed refresh, any index
|
|
created by the attempted fetch is deleted during rollback.
|
|
|
|
The genuinely-open questions that remain are collected at the end — the list is
|
|
short.
|
|
|
|
---
|
|
|
|
## Motivation
|
|
|
|
The site cites external work — papers, articles, blog posts, documentation.
|
|
Three things go wrong with a plain hyperlink over time:
|
|
|
|
1. **Link rot.** The target moves, paywalls, or vanishes. A 2019 essay's
|
|
citations decay silently; nobody notices until a reader clicks.
|
|
2. **Content drift.** The target stays up but changes. The sentence you quoted
|
|
is no longer the sentence at that URL.
|
|
3. **Opacity to the site's own machinery.** An external link is invisible to
|
|
`Backlinks.hs` (`isPageLink` drops every `http(s)://` URL) and to
|
|
`embed.py` (it indexes only `_site/**/*.html`). The site knows nothing about
|
|
the things it most often points at. A paper cited by six essays has no page,
|
|
no backlinks list, no place in any "Related" set.
|
|
|
|
The archive fixes all three by keeping a **local, hosted, immutable snapshot**
|
|
of each referenced work, giving it a stable URL on this domain, and making that
|
|
URL a first-class citizen of the existing backlinks and similar-pages systems.
|
|
|
|
This is deliberately *not* a general web crawler. It archives a curated set:
|
|
the things this site references. The author adds a URL to a manifest; the build
|
|
does the rest.
|
|
|
|
### Relationship to existing pieces
|
|
|
|
| Existing piece | What it does | Why the archive is different |
|
|
|----------------|--------------|------------------------------|
|
|
| `static/papers/` | Hosts Levi's **own** typeset PDFs (`preprint:`, `{{pdf:}}`) | The archive holds **third-party** works. Distinct directory, distinct purpose. Never conflate the two. |
|
|
| nginx `popup-proxy.conf` | Caches **metadata** (title/abstract) from arXiv / archive.org / PubMed for hover previews | Caches structured metadata, not documents. A preview accelerator, not preservation. |
|
|
| `Backlinks.hs` | Inverts **internal** links into a "who links here" map | Indexes site content only; external URLs are dropped. The archive makes referenced works internal enough to index. |
|
|
| `embed.py` / `SimilarLinks.hs` | Semantic "Related" block from `_site/**/*.html` embeddings | Only sees site pages. Archived works become site pages, so they enter the embedding corpus for free. |
|
|
|
|
---
|
|
|
|
## Goals
|
|
|
|
- **Preservation.** Every referenced work the author chooses to archive has a
|
|
byte-for-byte local snapshot that survives the original going dark.
|
|
- **Stable hosting.** Each snapshot is reachable at a permanent
|
|
`/archive/{slug}/` URL on levineuwirth.org, rendered in site chrome.
|
|
- **Hyperlink-able.** Archive URLs are ordinary internal links: usable in
|
|
prose, wikilinks, citations, and `further-reading`.
|
|
- **Indexed.** Archived works appear in the **backlinks** ("Referenced by") and
|
|
**similar-pages** ("Related") systems exactly as native content does — and,
|
|
where the source structure allows, granularly by section.
|
|
- **Curated, low-friction.** Adding an archive is one line in one manifest.
|
|
Everything else — fetch, text extraction, page generation, indexing — is
|
|
automatic and build-time.
|
|
- **Static-friendly.** Every archive page renders at build time; JS is layered
|
|
on, never required. Matches the rest of the site's contract.
|
|
- **Honest.** Archive pages never impersonate the original. They are framed as
|
|
archived copies, link prominently to the source, are kept out of search
|
|
engines, and carry a real, advertised removal channel on every page.
|
|
- **Safe by default.** No build step ever deletes or overwrites a committed
|
|
artifact; destruction and replacement are always explicit, opt-in acts.
|
|
|
|
---
|
|
|
|
## Decisions (locked)
|
|
|
|
| Topic | Decision | Rationale |
|
|
|-------|----------|-----------|
|
|
| Trigger | Curated manifest, not auto-crawl | Archives what the site *references*, not the web. Legally and operationally sane. |
|
|
| Authored input | One hand-edited file: `archive/manifest.yaml` | One line per archived link. Mirrors `data/commonplace.yaml`'s authoring model. |
|
|
| Bibliography seeding | **Rejected** as auto-seeding. `make archive-suggest` prints a "cited but not archived" diff; the author copies lines by hand. | Keeps the manifest the *identity* of the archive, not a cache of the `.bib` files. |
|
|
| Per-entry provenance | `archive/{slug}/PROVENANCE.json`, committed — immutable for the current snapshot | An immutability claim that isn't in version control isn't immutable. |
|
|
| Mutable state | `data/archive-state.json`, gitignored — link-rot status only | Strict split: immutable facts committed, volatile status disposable. |
|
|
| Hakyll input | `data/archive-index.json` — `url` + aliases → slug, written by the tool | Minimal stable shape for the Haskell side; treated like `data/annotations.json`. |
|
|
| Missing-index behaviour | `Backlinks.hs` and `Filters/Archive.hs` silently no-op when `archive-index.json` is absent | Preserves the established `.venv`-gated silent-skip convention. The archive degrades to invisible, never to an error. |
|
|
| `fetch` idempotence | `fetch` is keyed on `(slug, url)` together; a slug whose recorded URL has changed is refused, not overwritten. `fetch` always rewrites `archive-index.json` to mirror the manifest. | A committed artifact is replaced only by an explicit `refresh`, never as a `fetch` side effect. |
|
|
| Artifact storage | `archive/{slug}/` at repo root, **committed to git** | A preservation guarantee that depends on an un-versioned store is weaker. Repo stays reproducible. |
|
|
| Per-artifact size cap | 25 MB; `archive.py fetch` warns and skips above it; `git add -f` to override deliberately | A 200 MB scan must never land in an auto-commit silently. |
|
|
| Storage migration | If `archive/` exceeds ~5 GB or doubles year-over-year, evaluate a separate archive repo / object store. **Never git LFS.** | LFS breaks `git clone → make build` reproducibility — a regression for a preservation system. |
|
|
| HTML snapshots | `monolith -j` → one self-contained HTML file; the pinned `monolith` binary is committed at `tools/bin/monolith` | Single static binary, no headless browser. Strips JS. Committing it (vs downloading) removes a network dependency and keeps the build reproducible from a bare clone. |
|
|
| PDF snapshots | Direct download via `requests` | Papers are usually clean PDF URLs (arXiv etc.). |
|
|
| Display — PDF | The raw `document.pdf` in an `<iframe>` — the browser's native PDF viewer renders it | A hyperlinked archive should display the document exactly as it is. Symmetric with the HTML snapshot (both embed the raw artifact); no PDF.js wrapper. `static/pdfjs/` stays vendored for the site's own `{{pdf:}}` embeds. |
|
|
| Display — HTML | Snapshot in a sandboxed `<iframe>` (`referrerpolicy="no-referrer"`, no `allow-scripts`) + CSP `<meta>` baked into the snapshot + extracted text in the wrapper | Sandbox isolates markup; CSP is defense-in-depth; no-referrer stops leaking the reading path; extracted text feeds indexing. |
|
|
| Snapshot quality | Recorded per entry (`ok` / `degraded` / `js-required`); degraded snapshots flagged on `/archive/` and `/build/` | `monolith` fails quietly on lazy-loaded images and SPAs; silent degradation is the enemy. |
|
|
| Index thumbnails | **Dropped for v1.** `/archive/` is a text list. | At v1 scale a text list is faster to scan and to build than a thumbnail grid; revisit past ~50 entries (it is deferred capability, not a rejected one). |
|
|
| Second archive | Submit every URL to the Wayback Machine — **non-blocking**; record the URL when it returns, backfill via `make archive-wayback` | Belt-and-suspenders, never on the critical path of a build. |
|
|
| URL scheme | `/archive/{slug}/` | Permanent, human-readable, internal. |
|
|
| URL matching | `archive-index.json` carries each entry's equivalent-URL aliases; **only tracking parameters** are stripped, other query parameters preserved; backlinks match any alias | Without it, "Referenced by" silently under-counts; blanket query stripping would over-match. |
|
|
| Homepage portal | No | Infrastructure, not a content section. Reachable from `/archive/`, `/colophon`, footer. |
|
|
| Search engines | `noindex` on every archive page | Preserving, not republishing or competing with originals. |
|
|
| `robots.txt` | Not gated: a curated single-shot fetch of an already-cited URL is not crawling. But honour `X-Robots-Tag: noarchive` and `<meta name="robots" content="noarchive">`; skip anything behind authentication. | Matches Save-Page-Now / reference-manager norms. The load-bearing ethic is the removal channel, not `robots.txt`. |
|
|
| Removal channel | A request to `ln@levineuwirth.org` is honoured; advertised on `/archive/`, on **every archive page**, and in the fetcher's User-Agent string | This is the real ethical commitment `robots.txt` only proxies for. |
|
|
| Pagefind | Archived full text is indexed, tagged by `type: archive` and by link-rot `status` | Searching everything you've cited is a feature; the tags let results be filtered or excluded. |
|
|
| Visibility levels | `public` (default) / `private` | `private` keeps the artifact in-repo but undeployed, for content not safe to redistribute. |
|
|
| Paywalled originals | A manual `paywalled: true` manifest flag — **not** an automated scanner state. Soft paywalls return `200` and cannot be reliably detected. | Drives a banner note only, never a link flip. |
|
|
| Eviction | Opt-in `make archive-gc`, **never part of `make build`**. Procedure: record in `removed.yaml` *first*, then drop the manifest line, then GC. GC deletes only slugs listed in `removed.yaml`. | A rename, branch-switch, or typo'd manifest edit must not silently eat committed artifacts. |
|
|
| Snapshot mutability | Immutable for the current snapshot; `archive.py refresh` deliberately replaces it | A stable citation target must not move under readers — except by an explicit act. |
|
|
| Rot hysteresis | Asymmetric: `rotted` requires 3 consecutive failed scans over ≥ 14 days; one failure is `error`. Recovery is immediate — a single success → `live`. | A transient failure must not flip a live citation; a recovered original should be reached eagerly, so un-rotting needs no delay. |
|
|
| SHA verification | Both `archive.py fetch` *and* `build/Archive.hs` re-hash every committed artifact against `PROVENANCE.json` and halt non-zero on a mismatch. `archive.py` runs first in `make build`; `Archive.hs` shells out to `sha256sum` from `loadArchiveEntries`, so the integrity guarantee holds even when `archive.py` did not run (no `.venv`, a direct `cabal run site -- build`, or a deploy host that bypasses `make build`). | The original "Python tool is the sufficient enforcement point" assumption was unsafe: the Python step is `.venv`-gated, and a contributor or deploy without it could publish a tampered artifact unchecked. Two enforcement points cost a `sha256sum` call per entry and close the hole. |
|
|
|
|
---
|
|
|
|
## Content model & directory structure
|
|
|
|
```
|
|
archive/
|
|
├── manifest.yaml # AUTHORED — the curated list of links
|
|
├── removed.yaml # AUTHORED — record of evicted entries
|
|
├── arxiv-2403-12345/
|
|
│ ├── document.pdf # the snapshot (committed)
|
|
│ ├── PROVENANCE.json # immutable archival facts (committed)
|
|
│ ├── document.txt # extracted text (gitignored, regenerated)
|
|
│ └── document.txt.sha256 # artifact SHA the .txt was built from (gitignored)
|
|
├── gwern-net-scaling-hypothesis/
|
|
│ ├── snapshot.html # self-contained monolith snapshot (committed)
|
|
│ ├── PROVENANCE.json # immutable archival facts (committed)
|
|
│ ├── snapshot.txt # extracted readable text (gitignored)
|
|
│ └── snapshot.txt.sha256 # artifact SHA the .txt was built from (gitignored)
|
|
└── ...
|
|
```
|
|
|
|
- `archive/` is a top-level directory, sibling to `content/`, `static/`, and
|
|
`data/` — **not** under `content/`. Files in `content/` are author-written
|
|
Markdown processed by Pandoc; `archive/` holds raw third-party artifacts plus
|
|
the manifest and provenance.
|
|
- One directory per entry, keyed by **slug**.
|
|
- Committed: the artifact (`document.pdf` / `snapshot.html`) — the preservation
|
|
payload — and `PROVENANCE.json` — the immutable record of the archival event.
|
|
- Gitignored: the regenerable extracted text (`*.txt`) and its staleness stamp
|
|
(`*.txt.sha256`) — deterministic from the committed artifact, so committing
|
|
them is pure churn. This mirrors the photography sidecar and `*.webp`
|
|
companion rules already in `.gitignore`.
|
|
- `make build`'s auto-commit stages `content/` **only**. Changes under
|
|
`archive/` (new artifacts, `PROVENANCE.json`, manifest edits) are committed
|
|
**deliberately by the author**. This is a feature, not a gap: it is the
|
|
eyeball-before-commit checkpoint where a degraded snapshot gets caught.
|
|
|
|
### Authored input — `archive/manifest.yaml`
|
|
|
|
The **only** file the author edits for normal operation. Adding an archive =
|
|
adding one list item. Minimum is a bare `url:`; everything else is optional or
|
|
auto-derived.
|
|
|
|
```yaml
|
|
# archive/manifest.yaml — curated list of works to preserve.
|
|
# Edited by hand. Tools never write to this file.
|
|
# Per-artifact cap: 25 MB. Above that, archive.py warns and skips the fetch;
|
|
# commit an oversize artifact deliberately with `git add -f`.
|
|
# To evict an entry, see archive/removed.yaml — record there FIRST, then
|
|
# delete the line here, then run `make archive-gc`.
|
|
|
|
- url: "https://arxiv.org/abs/2403.12345"
|
|
# slug: auto-derived → arxiv-2403-12345 (override only to disambiguate)
|
|
# title: auto-derived from the artifact / popup-proxy metadata
|
|
# type: auto-detected (pdf | html)
|
|
tags: [research/ml] # optional — same slash-hierarchy as content
|
|
note: > # optional — why this is referenced
|
|
Cited in the scaling-laws essay; section 4 is the load-bearing part.
|
|
|
|
- url: "https://www.gwern.net/Scaling-hypothesis"
|
|
type: html # optional override when detection is wrong
|
|
visibility: public # public (default) | private
|
|
|
|
- url: "https://example.com/paywalled-report"
|
|
paywalled: true # author-set; the original sits behind a paywall
|
|
visibility: private # archived for the author; artifact not deployed
|
|
```
|
|
|
|
| Field | Required | Notes |
|
|
|-------|----------|-------|
|
|
| `url` | yes | The original URL. The identity of the entry. |
|
|
| `slug` | no | Override the auto-derived slug. Must be unique. |
|
|
| `title` | no | Override the auto-derived title. |
|
|
| `type` | no | `pdf` \| `html`. Auto-detected from `Content-Type` / extension. |
|
|
| `tags` | no | Slash-hierarchy tags (`Tags.hs`). Place the work on tag indexes. |
|
|
| `note` | no | Author's reason for archiving; shown on the archive page. |
|
|
| `visibility` | no | `public` (default) or `private`. |
|
|
| `paywalled` | no | Author-set flag: the original is gated. Declared, not inferred — no reliable automated detection exists. Drives a banner note only. |
|
|
| `source-date` | no | Publication date of the original, if known. |
|
|
|
|
### Per-entry provenance — `archive/{slug}/PROVENANCE.json`
|
|
|
|
Committed alongside the artifact. Written by `tools/archive.py fetch` and then
|
|
stable for the lifetime of that snapshot — `wayback` is the one field backfilled
|
|
later (by `make archive-wayback`).
|
|
|
|
**"Immutable" means immutable for the *current* snapshot, not forever.**
|
|
`archive.py refresh` deliberately re-snapshots an entry and **replaces** both
|
|
the artifact and its `PROVENANCE.json` (new `sha256`, new `archived` date),
|
|
moving the old `sha256` into `previous-sha256`. A refresh is a conscious act;
|
|
absent one, the file does not change.
|
|
|
|
`PROVENANCE.json` holds the facts that make the archival claim verifiable:
|
|
`tools/archive.py fetch` re-hashes every present artifact against the recorded
|
|
`sha256` on every run — *before* the Hakyll build — and **exits non-zero on a
|
|
mismatch, halting `make build`**. The verification lives in the Python tool,
|
|
not `Archive.hs`: the Haskell toolchain carries no SHA-256 library, and
|
|
`archive.py` runs first in the pipeline regardless. `Archive.hs` trusts a
|
|
present (provenance, artifact) pair and skips any entry lacking either.
|
|
|
|
```json
|
|
{
|
|
"url": "https://arxiv.org/abs/2403.12345",
|
|
"slug": "arxiv-2403-12345",
|
|
"title": "Scaling Laws for Neural Language Models",
|
|
"type": "pdf",
|
|
"artifact": "document.pdf",
|
|
"sha256": "9f86d0818884...",
|
|
"previous-sha256": null,
|
|
"bytes": 2317004,
|
|
"archived": "2026-05-21",
|
|
"source-date": "2024-03-15",
|
|
"snapshot-quality": "ok",
|
|
"wayback": "https://web.archive.org/web/20260521.../https://arxiv.org/abs/2403.12345"
|
|
}
|
|
```
|
|
|
|
`previous-sha256` is `null` on first fetch and set by `refresh` to the
|
|
immediately-prior snapshot's hash, so the last prior snapshot is reachable
|
|
(via `git log -S`) without deeper archaeology. `PROVENANCE.json` lives **with
|
|
the artifact**, not in a rolling global file, so the immutable claim is
|
|
genuinely immutable in git history.
|
|
|
|
### Mutable state — `data/archive-state.json`
|
|
|
|
Written **only** by `tools/archive.py check`. Holds the volatile link-rot
|
|
status, keyed by URL. Gitignored (`data/` generated files already are); a fresh
|
|
clone simply rebuilds it on the next scan. Until a scan has run, every entry
|
|
renders as the safe default (`live`, no link flip).
|
|
|
|
```json
|
|
{
|
|
"https://arxiv.org/abs/2403.12345": {
|
|
"status": "live",
|
|
"checked": "2026-05-21",
|
|
"consecutive-failures": 0,
|
|
"status-since": "2026-05-21"
|
|
}
|
|
}
|
|
```
|
|
|
|
`status` ∈ `live` / `moved` / `rotted` / `error` — set by the scanner.
|
|
(`paywalled` is *not* here: it is a manual manifest flag, not a scanner state.)
|
|
`consecutive-failures` + `status-since` implement the rot hysteresis (Phase 5).
|
|
|
|
### Hakyll input — `data/archive-index.json`
|
|
|
|
A small map written by `tools/archive.py fetch`, consumed inside the Hakyll
|
|
build by `Backlinks.hs` and the link-annotation filter. **`fetch` always
|
|
rewrites this file to mirror the current manifest exactly** — whether or not any
|
|
network I/O occurred — so an entry un-listed from the manifest (even without a
|
|
GC) immediately stops being treated as archived, and `Backlinks.hs` never keeps
|
|
writing backlinks toward a slug whose page no longer exists. The index is cheap
|
|
to recompute (manifest + provenance, no network) and must never lag the
|
|
manifest. Kept separate from `archive-state.json` so the Haskell side loads a
|
|
minimal, stable shape; treated exactly like the existing `data/annotations.json`
|
|
build input.
|
|
|
|
```json
|
|
{
|
|
"https://arxiv.org/abs/2403.12345": {
|
|
"slug": "arxiv-2403-12345",
|
|
"type": "pdf",
|
|
"title": "Scaling Laws for Neural Language Models",
|
|
"aliases": [
|
|
"http://arxiv.org/abs/2403.12345",
|
|
"https://arxiv.org/abs/2403.12345v1",
|
|
"https://arxiv.org/abs/2403.12345v2",
|
|
"https://arxiv.org/pdf/2403.12345",
|
|
"https://arxiv.org/pdf/2403.12345.pdf"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
`aliases` is the equivalent-URL set (see URL matching, under Backlinks). The
|
|
Haskell side flattens it into an `alias → entry` lookup on load.
|
|
|
|
**When `archive-index.json` is absent** — `.venv` not set up, or `archive.py`
|
|
has never run — it is treated as empty: `Backlinks.hs` and `Filters/Archive.hs`
|
|
silently no-op, and the build succeeds unchanged. This is the same
|
|
`.venv`-gated silent-skip convention used by `embed.py` and the photography
|
|
extractors. (This exact phrasing recurs below; it is the canonical statement of
|
|
the property.)
|
|
|
|
### Eviction & removal
|
|
|
|
Removing an archived work is a first-class, supported operation — a takedown
|
|
request, an author request, a legal concern, or a quality cull will arrive, and
|
|
probably before the system is mature. The cardinal rule: **no build step ever
|
|
deletes a committed artifact.** Deletion is opt-in and explicit.
|
|
|
|
Procedure (documented in the `manifest.yaml` header comment), in order:
|
|
|
|
1. **Record the removal in `archive/removed.yaml` first** — before touching the
|
|
manifest:
|
|
|
|
```yaml
|
|
- url: "https://example.com/withdrawn-article"
|
|
slug: example-com-withdrawn-article
|
|
removed: 2026-06-01
|
|
reason: takedown # takedown | author-request | legal | quality
|
|
note: "DMCA from X; see archived email."
|
|
```
|
|
|
|
| Field | Required | Notes |
|
|
|-------|----------|-------|
|
|
| `url` | yes | The original URL (matches the manifest URL at time of removal) |
|
|
| `slug` | yes | The slug whose `archive/{slug}/` directory `make archive-gc` is authorized to delete |
|
|
| `removed` | yes | ISO date of removal |
|
|
| `reason` | yes | Closed enum: `takedown` \| `author-request` \| `legal` \| `quality` |
|
|
| `note` | no | Free-text context |
|
|
|
|
2. Delete the entry's line from `manifest.yaml`.
|
|
3. Run `make archive-gc` (opt-in; **never** invoked by `make build`). It deletes
|
|
only `archive/{slug}/` directories whose slug is recorded in `removed.yaml`.
|
|
A directory orphaned by a rename, a branch switch, or a typo'd manifest edit
|
|
— i.e. *not* in `removed.yaml` — is **never deleted**; it is reported to
|
|
stderr with its slug and a one-line hint, and `gc` exits non-zero while any
|
|
orphan is present (`--ignore-orphans` suppresses the non-zero exit once the
|
|
author has consciously reviewed them). The author commits the deletion.
|
|
|
|
An orphaned `archive/{slug}/` directory (manifest line gone, not yet GC'd) is
|
|
inert in the meantime: `Archive.hs` generates pages and routes artifacts only
|
|
for current `manifest.yaml` entries, so an orphan produces no page and is not
|
|
deployed.
|
|
|
|
`removed.yaml` is **not** a hostile-tracking list. It exists so that (a)
|
|
`make archive-gc` knows exactly what is safe to delete, (b) re-adding a removed
|
|
URL to the manifest is surfaced loudly at build time, (c) the link-rot scanner
|
|
skips removed entries instead of probing them forever, and (d) `make
|
|
archive-suggest` never re-suggests a deliberately-removed work. A removed URL
|
|
still cited from a site page falls back to the original-only link: no archive
|
|
affordance, no backlink canonicalization.
|
|
|
|
---
|
|
|
|
## Routing & generated pages
|
|
|
|
| URL | Source | Notes |
|
|
|-----|--------|-------|
|
|
| `/archive/` | Generated from `manifest.yaml` | Index of all archived works; text list, filter by type, tag, status |
|
|
| `/archive/{slug}/` | Generated per manifest entry | The archive page — wrapper chrome + embedded snapshot |
|
|
| `/archive/{slug}/document.pdf` | `archive/{slug}/document.pdf` | Raw artifact, copied through unchanged |
|
|
| `/archive/{slug}/snapshot.html` | `archive/{slug}/snapshot.html` | Raw HTML snapshot, copied through unchanged |
|
|
| `/archive/{tag}/` | Existing `Tags.hs` | Archive entries with tags join the normal tag indexes |
|
|
|
|
`PROVENANCE.json` is build input, not a routed page — it is consumed by
|
|
`Archive.hs`, not served (the archive page surfaces the relevant fields).
|
|
|
|
Slugs are auto-derived as `{domain-stem}-{path-slug}`, truncated, with a short
|
|
hash appended on collision (`arxiv-2403-12345`, `gwern-net-scaling-hypothesis`).
|
|
`slug:` in the manifest overrides.
|
|
|
|
`/archive/` is **not** a homepage portal — it is infrastructure. It is reachable
|
|
from `/colophon` (where the site explains its own machinery), from the footer's
|
|
infrastructure links, and optionally as a shelf on `/library.html`. The
|
|
`/archive/` page also carries the removal-request notice.
|
|
|
|
---
|
|
|
|
## The archive page
|
|
|
|
`/archive/{slug}/` is a **wrapper**: site chrome around a preserved artifact.
|
|
Top to bottom:
|
|
|
|
1. **Archive banner.** An unmissable strip: "Archived copy — snapshot taken
|
|
2026-05-21. View the original ↗". The original URL is the most prominent
|
|
link on the page. The page never pretends to be the source.
|
|
2. **Metadata block.** Title, original URL, archive date, source publication
|
|
date, content hash (short form), file size, snapshot quality, the author's
|
|
`note`, the Wayback Machine link, and current link-rot `status`.
|
|
3. **The artifact.**
|
|
- **PDF** — the raw `document.pdf` embedded in an `<iframe>`, rendered by
|
|
the browser's native PDF viewer. Deliberately *not* the site's PDF.js
|
|
viewer: a hyperlinked archive should display the document as it is.
|
|
- **HTML** — the `monolith` snapshot loaded in a sandboxed `<iframe>`:
|
|
`sandbox` without `allow-scripts` (JS already stripped at fetch time) and
|
|
`referrerpolicy="no-referrer"` (so a click inside the snapshot does not
|
|
leak `levineuwirth.org/archive/...` — and which essay the reader came
|
|
from — to the original site). The snapshot file itself carries a
|
|
restrictive `Content-Security-Policy` `<meta>` tag, injected at fetch time,
|
|
as defense-in-depth (see Fetch pipeline).
|
|
4. **Full text.** The extracted readable text (`document.txt` / `snapshot.txt`)
|
|
rendered into the DOM — collapsed in a `<details>` for PDFs, inline for HTML.
|
|
This block is the load-bearing one for indexing: `embed.py` and Pagefind see
|
|
text, not an opaque iframe. It also gives readers a fast, styled, dark-mode
|
|
reading path that does not depend on the original's markup.
|
|
5. **Referenced by.** The backlinks list — every site page that cites this work.
|
|
(See Backlinks integration.)
|
|
6. **Related.** The similar-pages list — semantically near content, site pages
|
|
and other archives alike. (See Similar-pages integration.)
|
|
|
|
A removal-request line — the `partials/archive-removal-notice.html` partial,
|
|
carrying `ln@levineuwirth.org` — is included on **every** archive page and on
|
|
`/archive/`. It is its own partial, included directly by `archive.html` and
|
|
`archive-index.html`; the site-wide `page-footer.html` is *not* touched.
|
|
|
|
The page carries `<meta name="robots" content="noindex">`. The `head.html`
|
|
partial currently has no robots hook; adding a `noindex` context flag is part
|
|
of Phase 1.
|
|
|
|
---
|
|
|
|
## Fetch & snapshot pipeline
|
|
|
|
`tools/archive.py` — a Python tool, gated on `.venv`, silent-skip when absent,
|
|
matching the established `embed.py` / `extract-exif.py` pattern. Subcommands:
|
|
|
|
- `archive.py fetch` — for every manifest URL without an artifact: download it,
|
|
detect the type, store it, extract text, write `PROVENANCE.json`. Always
|
|
rewrites `archive-index.json` to mirror the manifest (see below). Records
|
|
`wayback: null` (filled in later). Incremental — only URLs without an
|
|
artifact incur network I/O.
|
|
- `archive.py wayback` — submit URLs whose `PROVENANCE.json` has `wayback: null`
|
|
to the Wayback Machine; backfill the returned URL. (`make archive-wayback`)
|
|
- `archive.py check` — the link-rot scan. (`make archive-check`, Phase 5)
|
|
- `archive.py suggest` — scan `data/*.bib` for `url` and `doi` fields; a
|
|
DOI-only entry is resolved to its `https://doi.org/{doi}` form. Prints a diff
|
|
of works cited but not yet in `manifest.yaml`, **excluding any URL already in
|
|
`archive/removed.yaml`** — a deliberately-removed work is never re-suggested.
|
|
(`make archive-suggest`)
|
|
- `archive.py gc` — delete `archive/{slug}/` directories whose slug is recorded
|
|
in `removed.yaml`. Orphan directories (not in `manifest.yaml`, not in
|
|
`removed.yaml`) are never deleted: each is reported to stderr with its slug
|
|
and a one-line hint, and `gc` exits non-zero while any orphan is present
|
|
(`--ignore-orphans` to override). (`make archive-gc`)
|
|
- `archive.py refresh {slug}` — deliberately re-snapshot one entry, replacing
|
|
both the artifact and its `PROVENANCE.json`; the prior `sha256` is written to
|
|
`previous-sha256` and printed.
|
|
|
|
### `fetch` is keyed on `(slug, url)` together
|
|
|
|
If a slug's directory already exists and its `PROVENANCE.json` records a
|
|
*different* URL than the manifest now gives — the author edited a URL but kept
|
|
the slug — `fetch` **refuses to overwrite** the committed artifact. It prints
|
|
`URL changed for {slug}: run 'archive.py refresh {slug}' to re-snapshot` and
|
|
leaves the entry untouched. Overwriting a committed artifact is always an
|
|
explicit act (`refresh`), never a side effect of `fetch` — the same principle
|
|
as GC requiring `removed.yaml`.
|
|
|
|
Regardless of whether any artifact was fetched, `fetch` finishes by rewriting
|
|
`data/archive-index.json` from the current manifest + provenance, so the index
|
|
can never lag a manifest edit.
|
|
|
|
### PDF
|
|
|
|
Direct download via `requests`, with a per-request timeout and the size cap
|
|
(25 MB; warn + skip above). User-Agent:
|
|
`levineuwirth.org/archive (ln@levineuwirth.org; removal requests honored)`.
|
|
Stored as `document.pdf`; text extracted with `pdftotext`.
|
|
|
|
### HTML
|
|
|
|
`monolith -j {url}` produces a single self-contained HTML file: CSS, images,
|
|
and fonts inlined as data URIs, JavaScript stripped (`-j`).
|
|
|
|
`monolith` is a single static Rust binary — no headless browser. Unlike Leaflet
|
|
and PDF.js (servable assets fetched at build time and gitignored), `monolith` is
|
|
a build-time **executable**: the pinned linux-x86_64 binary is **committed** at
|
|
`tools/bin/monolith`, with its version and sha256 recorded in
|
|
`tools/monolith-version.txt`. Committing it removes a network dependency from
|
|
`make build` and keeps the archive pipeline reproducible from a bare clone.
|
|
(If the build host ever changes architecture, re-vendor the matching binary.)
|
|
|
|
After capture, `archive.py` injects a CSP `<meta>` into the snapshot's `<head>`:
|
|
|
|
```html
|
|
<meta http-equiv="Content-Security-Policy"
|
|
content="default-src 'none'; img-src data:;
|
|
style-src 'unsafe-inline'; style-src-elem 'unsafe-inline';
|
|
style-src-attr 'unsafe-inline'; font-src data:;
|
|
script-src 'none'; object-src 'none'; frame-src 'none'">
|
|
```
|
|
|
|
`monolith` inlines images and fonts as data URIs, and inlines styles both as
|
|
`<style>` elements *and* as inline `style=""` attributes — so `style-src-elem`
|
|
and `style-src-attr` are spelled out alongside `style-src` to cover both in
|
|
browsers that honour the granular directives. `script-src 'none'` /
|
|
`object-src 'none'` / `frame-src 'none'` are explicit because `monolith` inlines
|
|
SVGs as `data:` images, and an SVG can carry a `<script>` block — the iframe
|
|
sandbox already blocks execution, but a belt-and-suspenders claim should not
|
|
rely on the sandbox alone. This CSP permits everything a correct snapshot needs
|
|
and blocks every network fetch and script a broken or malicious snapshot might
|
|
attempt. Correct rendering under this CSP is verified cross-browser as a
|
|
Phase 2 exit criterion. (An nginx `location ^~ /archive/` block may add the
|
|
header at the HTTP level too; the baked-in `<meta>` is what makes `make dev`'s
|
|
plain server safe.)
|
|
|
|
**`monolith` failure modes** — capture is not always faithful, and fails
|
|
*quietly*. Known cases: lazy-loaded images using `data-src` (common on Substack,
|
|
Medium, modern blogs) are not resolved — the snapshot looks complete but is
|
|
missing images; soft-paywalled pages (Medium, NYT) often serve full article
|
|
HTML to the fetch and gate it with a client-side overlay, so `-j` yields a
|
|
snapshot that *looks* like unauthorized access (it is not — the server sent it
|
|
— but the optics are poor); `<picture>`/`srcset` sources are inconsistently
|
|
inlined. `archive.py` therefore classifies each capture and records
|
|
`snapshot-quality` ∈ `ok` / `degraded` / `js-required` in `PROVENANCE.json`;
|
|
degraded captures are flagged on `/archive/` and `/build/`. The author reviews
|
|
the rendered snapshot before committing `archive/` (Phase 2 exit criterion). A
|
|
headless-browser fallback for `js-required` pages is deferred — see Open
|
|
questions.
|
|
|
|
### Wayback Machine — non-blocking
|
|
|
|
Wayback submission is **never on the critical path of a build.** `archive.py
|
|
fetch` records `wayback: null` and moves on. `make archive-wayback` runs
|
|
separately, POSTs the outstanding URLs to `https://web.archive.org/save/`
|
|
(retrying transient 5xx, tolerating rate limits and hangs), and backfills the
|
|
returned timestamped URL into each `PROVENANCE.json`. This second, independent
|
|
copy means a rotted entry whose local artifact is somehow lost still has a
|
|
fallback. If the original is *already* dead at first fetch, `archive.py fetch`
|
|
pulls the most recent existing Wayback capture instead.
|
|
|
|
### Politeness & safety
|
|
|
|
The manifest is author-controlled, so SSRF is not a real threat, but the tool
|
|
still: sets per-request timeouts, enforces the 25 MB cap, rate-limits to one
|
|
request per host at a time, and identifies itself honestly. Beyond that:
|
|
|
|
- **Honour `X-Robots-Tag: noarchive`** — and the equivalent
|
|
`<meta name="robots" content="noarchive">` in an HTML response body (cheap to
|
|
check: it is in the head of the document just fetched). If either is present,
|
|
the fetch is abandoned and the manifest entry flagged. This is the directive
|
|
that actually governs *archiving* (as opposed to crawling); respecting it
|
|
costs nothing and makes the posture defensible.
|
|
- **Skip authenticated content.** `archive.py` never sends cookies or
|
|
credentials. If a URL needs authentication, it is not archived; at most it is
|
|
a manual `visibility: private` artifact.
|
|
- **`robots.txt` is not gated.** A curated, single-shot, attributed, `noindex`'d
|
|
fetch of a URL the site already cites is not crawling — it is the same
|
|
operation a reader's browser performs on click. This matches Save-Page-Now
|
|
and reference-manager norms. The load-bearing ethical commitment is the
|
|
removal channel, advertised on `/archive/`, on every archive page, and inside
|
|
the User-Agent string.
|
|
|
|
---
|
|
|
|
## Text extraction & indexing
|
|
|
|
The "Full text" block is what makes an archived work *indexable* rather than an
|
|
opaque blob. Extraction:
|
|
|
|
- **PDF** → `pdftotext` (from `poppler`, already a build dependency for the
|
|
`pdf-thumbs` Makefile target). Stored as `document.txt`.
|
|
- **HTML** → readable text pulled from the `monolith` snapshot with
|
|
`BeautifulSoup` (already a dependency of `embed.py`). Headings are preserved.
|
|
Stored as `snapshot.txt`.
|
|
|
|
Both `.txt` files are gitignored. `archive.py fetch` regenerates a `.txt`
|
|
whenever the artifact's current SHA-256 differs from the value stamped in the
|
|
adjacent `*.txt.sha256` sidecar (also gitignored), then re-stamps it. This
|
|
catches every way the committed artifact and the local — gitignored, not
|
|
`git pull`-ed — text could drift apart: a `refresh`, a `pdftotext` upgrade, a
|
|
truncated file. The indexed text is thus always in sync with the embedded
|
|
artifact.
|
|
|
|
Once the archive page renders this text into `_site/archive/{slug}/index.html`:
|
|
|
|
- **`embed.py`** walks `_site/**/*.html` *after* the Hakyll build. Archive pages
|
|
are ordinary HTML files in that tree, so they are embedded with **no change to
|
|
`embed.py`** — they automatically join both the page-level similarity corpus
|
|
(`similar-links.json`) and the paragraph-level semantic index
|
|
(`semantic-index.bin` / `semantic-meta.json`).
|
|
- **Pagefind** likewise indexes them automatically. Two filter tags on the
|
|
archive template — `type: archive` and the link-rot `status` — let
|
|
`search-filters.js` separate archive hits from native content and let a reader
|
|
see (or exclude) `rotted`-citation archive pages.
|
|
|
|
The one requirement this imposes: the archived text **must** be in the rendered
|
|
DOM, not only inside the PDF.js / sandbox iframe. `embed.py`'s `BeautifulSoup`
|
|
pass and Pagefind both see DOM text only. Hence the "Full text" block in §4 of
|
|
the archive page is non-optional.
|
|
|
|
---
|
|
|
|
## Backlinks integration — "Referenced by"
|
|
|
|
The goal: an archived paper's page shows every site page that cites it.
|
|
|
|
Today `Backlinks.hs` runs in two passes (see its module header). Pass 1
|
|
(`version "links"`) extracts links per content file; `isPageLink` **drops every
|
|
external URL**. Pass 2 inverts `target → [sources]`. The archive needs two
|
|
surgical changes, both driven by `data/archive-index.json`:
|
|
|
|
1. **Pass 1 — keep archived externals.** `isPageLink` is widened: an external
|
|
URL is *kept* if it matches an entry in `archive-index.json`. Non-archived
|
|
externals are still dropped, exactly as now.
|
|
2. **Pass 2 — canonicalize to the archive URL.** When inverting, an archived
|
|
external URL is rewritten to its `/archive/{slug}/` key.
|
|
|
|
`backlinksField` then works unchanged: the archive page looks up its own route
|
|
and finds its citing pages. The archive template labels the section
|
|
**"Referenced by"** rather than "Backlinks" — semantically truer for a
|
|
third-party work — but the underlying field is the same.
|
|
|
|
This is purely additive: the *visible* link in the essay still points at the
|
|
original URL (reader expectation is preserved); only the backlink *relationship*
|
|
is recorded against the archive page. Archive pages do not need to be added to
|
|
`Patterns.allContent` — they only *receive* backlinks, and that needs a route,
|
|
not a `version "links"` pass.
|
|
|
|
**When `archive-index.json` is absent** — `.venv` not set up, or `archive.py`
|
|
has never run — it is treated as empty: `Backlinks.hs` and `Filters/Archive.hs`
|
|
silently no-op, and the build succeeds unchanged. For `Backlinks.hs` that means
|
|
every external URL is dropped exactly as today, with no canonicalization and no
|
|
error. This is a hard requirement, not a nicety: it preserves the established
|
|
`.venv`-gated silent-skip convention so a contributor without the Python
|
|
environment still gets a clean build.
|
|
|
|
### URL matching — the alias problem
|
|
|
|
A cited URL in the wild has many equivalent forms: `http://` vs `https://`,
|
|
trailing slash or not, `?utm_source=…` query junk, arXiv `abs` ↔ `pdf` ↔
|
|
versioned (`/abs/2403.12345`, `/abs/2403.12345v2`, `/pdf/2403.12345.pdf`). If
|
|
the index is keyed only by the manifest's canonical URL, a citation to any
|
|
variant misses, and **"Referenced by" silently under-counts** — a failure that
|
|
breaks nothing visibly and is miserable to debug.
|
|
|
|
So `archive.py` computes the equivalent-URL set per entry and stores it as
|
|
`aliases` in `archive-index.json`. The normalization is deliberately narrow:
|
|
|
|
- **Tracking parameters are stripped** — `utm_*`, `fbclid`, `gclid`, `mc_*`,
|
|
`ref`, `igshid`, `_hsenc`, `_hsmi`, `mkt_tok`.
|
|
- **All other query parameters are preserved.** A `?v=…`, a `?id=…`, a Wayback
|
|
timestamp is load-bearing; blanket query stripping would alias
|
|
`…/article?id=42` to every other article on the host.
|
|
- `http`/`https` are folded, trailing slashes normalized, and known arXiv
|
|
families (`abs` / `pdf` / versioned) expanded.
|
|
|
|
`Backlinks.hs` matches an incoming link against any alias before keying it to
|
|
the archive URL.
|
|
|
|
### Granular backlinks (Phase 4 refinement)
|
|
|
|
If a citation targets a fragment — `…/abs/2403.12345#section-4`, or a PDF page
|
|
`…/document.pdf#page=7` — the fragment is preserved through pass 2 instead of
|
|
being stripped by `normaliseUrl`. The archive page can then group "Referenced
|
|
by" entries by which section/page they cite: *"Section 4 — referenced by [Essay
|
|
A], [Essay B]."* This is the "indexed granularly, by section" behaviour, on the
|
|
backlinks side.
|
|
|
|
---
|
|
|
|
## Similar-pages integration — "Related"
|
|
|
|
This side is almost free. `embed.py` produces `data/similar-links.json` (page
|
|
similarity) from every file in `_site/`. Once archive pages render with their
|
|
full text (above), they are in the corpus:
|
|
|
|
- An **essay's** "Related" block can surface an archived paper.
|
|
- An **archive page's** "Related" block surfaces neighbouring archives and the
|
|
site content nearest to it.
|
|
|
|
`SimilarLinks.hs` needs no change — `/archive/{slug}/` is just another URL key,
|
|
and `similarLinksField` resolves it like any page. Two small `embed.py` config
|
|
nudges: add `/archive/` to `EXCLUDE_URLS` (the index is a list page and would
|
|
otherwise dominate neighbours), and let individual archive pages through.
|
|
|
|
**Cost — a Phase 4 risk with a concrete trigger.** `embed.py` has a coarse
|
|
whole-run staleness skip but no per-document incrementality: when it *does* run,
|
|
it re-embeds the entire corpus. A serious archive (hundreds of entries, several
|
|
MB of extracted text each for long papers) materially extends every run that
|
|
executes. Phase 4 measures this and applies a fixed trigger: **once the archive
|
|
passes 50 entries, or `embed.py`'s runtime exceeds 60 seconds, add a
|
|
per-document embedding cache** keyed by content hash to `embed.py`. Below both
|
|
thresholds, the full-corpus re-embed is left alone — premature optimization
|
|
otherwise.
|
|
|
|
### Granular similar-pages (deferred)
|
|
|
|
`embed.py` *already* builds a **paragraph-level** index
|
|
(`semantic-index.bin` + `semantic-meta.json`, keyed `{url, title, heading,
|
|
excerpt}`). An archived HTML snapshot's preserved headings mean its sections get
|
|
distinct paragraph vectors automatically — the data for section-granular
|
|
"Related" exists the moment archive text is in the DOM. What does *not* yet
|
|
exist is a UI that consumes it per-section, for *any* content type. A
|
|
per-section "Related" block is deferred site-wide; the archive system *feeds*
|
|
the granular index regardless. For PDFs, section structure is unreliable
|
|
(`pdftotext` flattens it); per-*page* chunking is the realistic granularity —
|
|
see Open questions.
|
|
|
|
---
|
|
|
|
## Link annotation in content
|
|
|
|
When the author writes a link to a URL that is archived, the build appends a
|
|
small archive affordance — a superscript "[A]" / "archived" marker next to the
|
|
link — pointing at `/archive/{slug}/`. No per-link markup; entirely automatic.
|
|
|
|
Implementation: a Pandoc filter, `Filters/Archive.hs`, registered in
|
|
`Filters.hs`. For every `Link` whose URL matches `archive-index.json` (alias
|
|
set included), it appends the affordance inline.
|
|
|
|
**Filter ordering — pinned, then verified.** Per `/colophon`, the site's AST
|
|
chain is `markdown → pandoc → citations → wikilinks → preprocessing → sidenotes
|
|
→ smallcaps/dropcaps → links → images → math`. `Filters/Archive.hs` is pinned
|
|
**immediately after `smallcaps/dropcaps` and immediately before `links`** — not
|
|
merely "somewhere before `links`". The reason is the narrower window matters:
|
|
`smallcaps/dropcaps` rewrites the *text content* of nodes, so if `Archive.hs`
|
|
decorated first, the `[A]` affordance could be swept into a smallcaps run or
|
|
mistaken for an opening character by dropcap logic. Running it after
|
|
`smallcaps/dropcaps` appends the affordance to already-styled text that nothing
|
|
downstream re-touches; running it before `links` lets the link-decoration pass
|
|
(and any future popup hooks) act on the already-annotated tree. This chain is
|
|
transcribed from a published page; **Phase 3 confirms it against `Filters.hs`'s
|
|
actual registration order** before the position is pinned in code — a doc and
|
|
the implementation can drift.
|
|
|
|
**Confirmed (2026-05-22).** `Filters.hs`'s `applyAll` applies, innermost
|
|
first: `Images → SourceRefs → Code → Math → Dropcaps → Smallcaps → Links →
|
|
Typography → Sidenotes → Aftermatter`. The `/colophon` narrative is a loose
|
|
paraphrase — `Images` and `Math` run early, `Sidenotes` runs late — but
|
|
`Smallcaps` and `Links` *are* adjacent, so `Filters.Archive` is pinned between
|
|
them, exactly as specified above. (`/colophon` is prose, not authoritative for
|
|
filter order, and was left unchanged.)
|
|
|
|
**When `archive-index.json` is absent** — `.venv` not set up, or `archive.py`
|
|
has never run — it is treated as empty: `Backlinks.hs` and `Filters/Archive.hs`
|
|
silently no-op, and the build succeeds unchanged. For `Filters/Archive.hs` that
|
|
means every `Link` passes through un-annotated, no error raised.
|
|
|
|
**Bibliography — confirmed (2026-05-22): a separate context field.**
|
|
`Citations.hs` runs `applyCitations` *before* the `applyAll` filter chain; it
|
|
partitions the citeproc `refs` Div out of the document AST
|
|
(`extractBibliography`) and renders it to an HTML string via `writeHtml5String`
|
|
for the template's `$bibliography$` field. The body filter chain — and so
|
|
`Filters.Archive` — never sees the bibliography. Prose links get affordances;
|
|
bibliography reference links do not.
|
|
|
|
This does **not** put the broken popup layer on the critical path, as the
|
|
draft feared. `Citations.hs` already performs AST surgery on each bibliography
|
|
entry (`enhanceEntry` — it wraps `file:` PDF links and appends keyword strips),
|
|
so the realistic annotation hook is `enhanceEntry`, reusing `Filters.Archive`'s
|
|
index lookup — no popup dependency. That is **deferred to a Phase 3 follow-up**:
|
|
it first needs a check that `chicago-notes.csl` renders a cited work's
|
|
`url`/`doi` as a `Link` node (a CSL style that omits URLs would leave nothing
|
|
to match). Phase 3 ships prose-link annotation; bibliography annotation is
|
|
documented as in-scope and hookable via `enhanceEntry`, pending that check. A
|
|
future popup rewrite may *also* consult `archive-index.json`, but the archive
|
|
system depends on neither the current nor a future popup implementation.
|
|
|
|
---
|
|
|
|
## Link-rot detection & maintenance (Phase 5)
|
|
|
|
`tools/archive.py check` issues a `HEAD` (falling back to a ranged `GET`) to
|
|
every original URL in the manifest and updates `data/archive-state.json`.
|
|
|
|
**Hysteresis is asymmetric.** Rotting is slow; recovery is fast.
|
|
|
|
- *Rotting.* A failed probe increments `consecutive-failures` and sets
|
|
`status: error`. Only after **3 consecutive failed scans spanning ≥ 14 days**
|
|
does the status become `rotted`. A single transient failure — a Cloudflare
|
|
challenge, a temporary 5xx, a DNS hiccup — therefore never flips a live
|
|
citation.
|
|
- *Recovery.* A **single** successful probe resets `consecutive-failures` to 0
|
|
and returns the status straight to `live`, from `error` or `rotted` alike.
|
|
There is no cost to un-rotting eagerly — if the original is reachable again,
|
|
the reader should go there — so recovery needs no hysteresis.
|
|
|
|
| `status` | Meaning | Rendering effect |
|
|
|----------|---------|------------------|
|
|
| `live` | Original reachable, unchanged | Normal: link to original, archive as backup |
|
|
| `moved` | 3xx to a new location | Banner notes the move; new URL recorded |
|
|
| `rotted` | Failed the hysteresis threshold (3 fails / ≥14 days) | Build flips the *primary* link to the archive copy; original shown struck-through as "(dead link)" |
|
|
| `error` | Transient / inconclusive — below the hysteresis threshold | No rendering change; retried next scan |
|
|
|
|
`paywalled` is deliberately **absent** from this table: a soft paywall returns
|
|
`200`, so an automated `HEAD`/`GET` cannot reliably detect it. Paywall status is
|
|
the manual `paywalled: true` manifest flag instead, and it drives only a banner
|
|
note — never a link flip.
|
|
|
|
The flip on `rotted` is the actual link-rot *cure*: a reader of a 2019 essay
|
|
clicks through to a working local snapshot instead of a 404, with no manual
|
|
intervention — and only after the rot is confirmed, not guessed.
|
|
|
|
`check` is a slow network job, not something every `make build` should pay for.
|
|
It runs on its own cadence — a periodic local `make archive-check`, or a
|
|
scheduled remote agent. It is decoupled from the main build: the build consumes
|
|
whatever `archive-state.json` exists.
|
|
|
|
---
|
|
|
|
## Build-pipeline integration
|
|
|
|
New steps slot into the `Makefile` `build` target, gated on `.venv` (silent
|
|
skip), consistent with `embed.py` and the photography extractors:
|
|
|
|
```
|
|
make build:
|
|
git auto-commit content/ (existing — archive/ NOT swept in)
|
|
tools/convert-images.sh (existing)
|
|
pdf-thumbs (existing)
|
|
download-pdfjs.sh / download-leaflet.sh (existing)
|
|
→ tools/archive.py fetch (NEW — fetch missing artifacts,
|
|
extract text, write
|
|
PROVENANCE.json +
|
|
archive-index.json)
|
|
extract-exif / palette / dimensions (existing)
|
|
cabal run site -- build (existing — now also routes archive/)
|
|
pagefind --site _site (existing — now also indexes archive pages)
|
|
tools/embed.py (existing — now also embeds archive pages)
|
|
stamp-build-time.py / compress-assets.sh (existing)
|
|
```
|
|
|
|
`tools/archive.py fetch` runs **before** `cabal run site -- build` so the
|
|
artifacts, `PROVENANCE.json` files, and `archive-index.json` all exist when
|
|
Hakyll routes the `archive/` tree and when `Backlinks.hs` loads the index.
|
|
`fetch` is incremental — a normal build with no new manifest entries does no
|
|
network I/O — but it still rewrites `archive-index.json` every run. Wayback
|
|
submission is **not** in this path. The `monolith` binary is committed
|
|
(`tools/bin/monolith`), so there is no download step.
|
|
|
|
**`make build` never deletes anything under `archive/`.** Artifact removal is
|
|
exclusively the job of the opt-in `make archive-gc` (see Eviction).
|
|
|
|
Standalone targets, none a dependency of `build`:
|
|
|
|
- `make archive-check` — link-rot scan.
|
|
- `make archive-wayback` — backfill outstanding Wayback captures.
|
|
- `make archive-suggest` — print the "cited but not archived" diff against
|
|
`data/*.bib` (DOI-only entries resolved; `removed.yaml` entries excluded).
|
|
- `make archive-gc` — delete `archive/{slug}/` directories whose slug is
|
|
recorded in `removed.yaml`; report (never delete) orphans that are not.
|
|
|
|
---
|
|
|
|
## Build module structure
|
|
|
|
New Haskell module:
|
|
|
|
- **`build/Archive.hs`** — patterns, routing rules, and contexts for the
|
|
archive. Generates `/archive/` and every `/archive/{slug}/` page from
|
|
`archive/manifest.yaml` + `PROVENANCE.json` + `data/archive-state.json`;
|
|
routes the raw artifacts through unchanged. Pages and routed artifacts come
|
|
only from current `manifest.yaml` entries, so an orphaned `archive/{slug}/`
|
|
directory is inert (no page, not deployed). Integrity (SHA-256) verification
|
|
is `tools/archive.py`'s job — it runs first and halts the build on a
|
|
mismatch; `Archive.hs` trusts a present (provenance, artifact) pair and skips
|
|
any entry lacking either. Separated from `Site.hs` for the same reason
|
|
`Catalog.hs`, `Authors.hs`, and `Photography.hs` are — scoped concerns,
|
|
isolated reasoning.
|
|
|
|
New Pandoc filter:
|
|
|
|
- **`build/Filters/Archive.hs`** — the link-annotation filter; registered in
|
|
`Filters.hs` immediately after `smallcaps/dropcaps`, before the `links` pass.
|
|
No-op when `archive-index.json` is absent.
|
|
|
|
Edits to existing modules:
|
|
|
|
- **`build/Patterns.hs`** — add `archivePattern` (artifact files) and
|
|
`archiveManifest`. Add archive entries to `tagIndexable` so tagged archives
|
|
reach the tag indexes. (Deliberately *not* added to `allContent`: archive
|
|
pages receive backlinks but are not crawled for outbound links in v1.)
|
|
- **`build/Backlinks.hs`** — load `data/archive-index.json` (silent no-op if
|
|
absent); widen `isPageLink` to keep archived externals; match incoming links
|
|
against the alias set; canonicalize them to `/archive/{slug}/` in pass 2.
|
|
- **`build/Site.hs`** — wire the archive rules from `Archive.hs`; add the
|
|
`/archive/` link to the footer / `colophon` routing.
|
|
- **`build/Stats.hs`** — contribute archive metrics to the `/build/` telemetry
|
|
page: count; total bytes; median artifact age; counts by `snapshot-quality`,
|
|
`status`, and `visibility`; `paywalled` count; and any orphan slugs
|
|
(directories not in `manifest.yaml` and not in `removed.yaml` — they should
|
|
not exist, so surface them where drift is visible).
|
|
- **`templates/partials/head.html`** — add a `noindex` context hook and a
|
|
`$if(archive)$` link to `static/css/archive.css` (the archive pages'
|
|
stylesheet — banner, provenance panel, artifact viewer, index list;
|
|
scoped under `#markdownBody` to clear the prose rules in `typography.css`).
|
|
|
|
---
|
|
|
|
## Templates
|
|
|
|
New files under `templates/`:
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `archive-index.html` | `/archive/` — the full text list, type/tag/status filters; includes `archive-removal-notice` |
|
|
| `archive.html` | `/archive/{slug}/` — banner, metadata, embedded artifact, full text, Referenced-by, Related; includes `archive-removal-notice` |
|
|
|
|
New partials:
|
|
|
|
| File | Role |
|
|
|------|------|
|
|
| `partials/archive-banner.html` | The "archived copy / view original" strip — reused by `archive.html` and any inline archive embed |
|
|
| `partials/archive-card.html` | Archive-entry card (text-only; no thumbnail in v1) for the index and for `/library.html` |
|
|
| `partials/archive-removal-notice.html` | The removal-request line (`ln@levineuwirth.org`); included directly by `archive.html` and `archive-index.html` |
|
|
|
|
Existing partials reused unchanged: `nav.html`, `head.html` (with the new
|
|
`noindex` flag), `footer.html`, `page-footer.html`. The removal notice is a
|
|
*new* partial precisely so `page-footer.html` stays untouched.
|
|
|
|
---
|
|
|
|
## Storage, repo size & `.gitignore`
|
|
|
|
Committed: the artifacts (`document.pdf`, `snapshot.html`), `PROVENANCE.json`,
|
|
`manifest.yaml`, `removed.yaml`, and the pinned `monolith` binary
|
|
(`tools/bin/monolith`). Gitignored: everything regenerable.
|
|
|
|
Append to `.gitignore`:
|
|
|
|
```
|
|
# Archive: generated text + its staleness stamp (recreated from the committed
|
|
# artifact on every build — deterministic, so committing them is churn).
|
|
archive/**/*.txt
|
|
archive/**/*.txt.sha256
|
|
|
|
# Archive: generated state (written by tools/archive.py).
|
|
# NOTE: archive/**/PROVENANCE.json is deliberately NOT ignored — it is the
|
|
# committed, immutable record of each archival event.
|
|
data/archive-state.json
|
|
data/archive-index.json
|
|
```
|
|
|
|
**Repo-size policy.** Archived artifacts are immutable once taken, so they add
|
|
no *history* bloat — but the working tree grows. v1 commits them: a preservation
|
|
guarantee that depends on an un-versioned side store is a weaker guarantee, and
|
|
`git clone` → `make build` must reproduce the whole site.
|
|
|
|
- **Per-artifact cap: 25 MB.** `archive.py fetch` warns and skips above it; a
|
|
deliberately-oversize artifact is committed with `git add -f`. This stops a
|
|
200 MB scan from being swept silently into a commit.
|
|
- **Migration tripwire.** If `archive/` exceeds **~5 GB**, or **doubles
|
|
year-over-year**, evaluate moving the artifact store out of the main repo —
|
|
to a separate `archive` repository or a content-addressed store the VPS
|
|
rsyncs independently. `tools/archive.py` reads the store root from a single
|
|
config value, so the move is a config change, not a redesign.
|
|
- **Never git LFS.** LFS smudges the property that makes this system worth
|
|
having: with LFS, `git clone` no longer yields the artifacts unless the LFS
|
|
server is up and authenticated. For a system whose value proposition is "this
|
|
survives," that is a regression. If migration is needed, the destination is a
|
|
separate repo or object store — not LFS in this one.
|
|
|
|
---
|
|
|
|
## Legal, ethical & SEO posture
|
|
|
|
Archiving third-party content touches copyright. The design's guardrails:
|
|
|
|
- **`noindex` on every archive page.** The archive preserves; it does not
|
|
republish to search engines or compete with originals for ranking.
|
|
- **The original is the hero.** Every archive page links prominently to the
|
|
source and is explicitly framed as a dated archived copy.
|
|
- **A real removal channel, everywhere.** A request to `ln@levineuwirth.org`
|
|
gets the entry removed (see Eviction). The channel is advertised on
|
|
`/archive/`, on **every individual archive page**, and inside the fetcher's
|
|
User-Agent string. This is the load-bearing ethical commitment; `robots.txt`
|
|
is only a proxy for it.
|
|
- **`noarchive` honoured.** Both `X-Robots-Tag: noarchive` (HTTP header) and
|
|
`<meta name="robots" content="noarchive">` (HTML body) abort a fetch.
|
|
- **Authenticated content skipped.** The fetcher sends no credentials. Anything
|
|
behind a login is not archived.
|
|
- **`visibility: private`** keeps a snapshot in-repo for the author's own
|
|
reference without deploying the artifact to `_site/` — the appropriate
|
|
setting for licensed material the author may read but should not redistribute.
|
|
The archive *page* still exists (metadata + "held offline"), so link-rot
|
|
tracking and the Wayback link survive.
|
|
- **Curated, not crawled.** The archive only ever contains works this site
|
|
deliberately references — a fundamentally different posture from a scraper.
|
|
- **Attribution preserved.** Author, source title, source date, and original
|
|
URL are surfaced on every archive page.
|
|
|
|
This is a personal-scale citation archive, consistent with long-standing
|
|
practice on research-oriented personal sites. It is not a content platform.
|
|
|
|
---
|
|
|
|
## Phased implementation
|
|
|
|
Each phase has explicit exit criteria. Do not start a phase until the previous
|
|
one passes.
|
|
|
|
### Phase 1 — Skeleton, PDF only
|
|
|
|
Bootstrap entry: **NIST FIPS 203 (ML-KEM)**, PDF at
|
|
`https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf` — a stable, auth-free
|
|
PDF already cited in `data/simd-paper.bib`, so the test entry keeps its value
|
|
after Phase 1 ships.
|
|
|
|
- [x] Define `archive/manifest.yaml` and `archive/removed.yaml` schemas; create
|
|
`manifest.yaml` with the bootstrap entry
|
|
- [x] `tools/archive.py fetch` — PDF download, size cap, `pdftotext`,
|
|
`.txt.sha256` staleness stamp, write per-entry `PROVENANCE.json`; always
|
|
rewrite `archive-index.json`; refuse a `(slug, url)` mismatch, and
|
|
re-hash every committed artifact (non-zero exit on a SHA mismatch)
|
|
- [x] `build/Archive.hs` — routing for `/archive/`, `/archive/{slug}/`, and the
|
|
raw `document.pdf`; orphaned directories produce no page (a pass-1
|
|
refinement subsequently added a Haskell-side SHA-256 re-hash via
|
|
`sha256sum`, so the integrity guarantee holds even when `archive.py`
|
|
did not run first — direct `cabal` invocations, deploy hosts without
|
|
`.venv`, etc.)
|
|
- [x] `templates/archive.html`, `templates/archive-index.html`,
|
|
`partials/archive-banner.html`, `partials/archive-removal-notice.html`
|
|
- [x] PDF artifact embedded on the page (Phase 2 changed this to a raw,
|
|
browser-native `<iframe>` embed — see the Display — PDF decision)
|
|
- [x] Extracted text rendered into the page DOM (collapsed `<details>`)
|
|
- [x] `noindex` hook in `head.html`; set on archive pages
|
|
- [x] **Eviction works** end-to-end — `make archive-gc`, `removed.yaml` gating,
|
|
orphan reporting (see Eviction & removal)
|
|
- [x] Wire `tools/archive.py fetch` into the Makefile, `.venv`-gated
|
|
- [x] `.gitignore` additions (`PROVENANCE.json` explicitly *not* ignored)
|
|
|
|
**Exit criteria:** the FIPS 203 PDF renders at `/archive/{slug}/` with banner,
|
|
metadata, working PDF.js embed, visible extracted text, and a removal-request
|
|
notice; `/archive/` lists it; both carry `noindex`. The eviction procedure
|
|
(record in `removed.yaml` → drop the manifest line → `make archive-gc`) removes
|
|
the artifact; a manifest line deleted *without* a `removed.yaml` entry leaves
|
|
the artifact intact and emits a warning. **Running `make build` ten times in
|
|
succession with no manifest edits produces no changes under `archive/`** — no
|
|
deletions, no `PROVENANCE.json` rewrites, no artifact replacements.
|
|
|
|
**Met (2026-05-22).** FIPS 203 fetched (1.25 MB, 3601 lines of extracted
|
|
text); `/archive/nist-fips-203/` renders with banner, metadata, PDF.js iframe,
|
|
in-DOM full text, and removal notice; `/archive/` lists it; both carry
|
|
`noindex`. `gc` was verified on both paths — an orphan directory is reported
|
|
and left intact (exit 1); a `removed.yaml`-listed directory is deleted while
|
|
the manifest entry is untouched. `archive/` is byte-identical across repeated
|
|
fetch + build cycles. The PDF.js iframe is correctly wired; rendering the
|
|
viewer needs `static/pdfjs/`, which `make build` vendors via `download-pdfjs.sh`.
|
|
|
|
### Phase 2 — HTML snapshots
|
|
|
|
Bootstrap entry: **`https://cr.yp.to/aes-speed.html`** (`slug: djb-aes-speed`)
|
|
— Bernstein's cache-timing-attacks page, cited in `data/simd-paper.bib`. A
|
|
stable, JavaScript-free static page, so its snapshot is reproducible and
|
|
classifies cleanly as `ok`; like FIPS 203 it keeps its value after the phase
|
|
ships.
|
|
|
|
- [x] Commit the pinned `monolith` binary at `tools/bin/monolith`; record
|
|
version + sha256 in `tools/monolith-version.txt`
|
|
- [x] `tools/archive.py fetch` — HTML branch: `monolith --no-js`, CSP `<meta>`
|
|
injection (`style-src` + `-elem` + `-attr`, `script-src`/`object-src`/
|
|
`frame-src 'none'`), text extraction via `BeautifulSoup`, type detection
|
|
- [x] `snapshot-quality` classification (`ok` / `degraded` / `js-required`)
|
|
written to `PROVENANCE.json`; degraded captures flagged on `/archive/`
|
|
- [x] Sandboxed `<iframe>` rendering (`referrerpolicy="no-referrer"`, no
|
|
`allow-scripts`) in `archive.html`
|
|
|
|
**Exit criteria:** an HTML URL snapshots to a self-contained file with a CSP
|
|
`<meta>`, renders in a sandboxed no-referrer iframe with the original's styling
|
|
isolated, and shows extracted readable text in site chrome; the sandboxed
|
|
snapshot renders correctly under the CSP in **both Firefox and a Chromium-based
|
|
browser**; capture quality is classified and a `degraded` snapshot is visibly
|
|
flagged; the author has reviewed the rendered snapshot before committing it.
|
|
|
|
**Met (2026-05-22).** `monolith` 2.10.1 (`monolith-gnu-linux-x86_64`) is
|
|
vendored at `tools/bin/monolith` with its version + sha256 in
|
|
`tools/monolith-version.txt`; `archive.py fetch` locates it via `$MONOLITH_BIN`
|
|
→ `tools/bin/monolith` → `$PATH`, and warns-and-skips (build continues) when it
|
|
is absent. `cr.yp.to/aes-speed.html` snapshots to a 26 KB self-contained
|
|
`snapshot.html` with the archive CSP `<meta>` as the first `<head>` child;
|
|
`/archive/djb-aes-speed/` renders it in a `sandbox`ed, `no-referrer` iframe with
|
|
291 lines of extracted prose shown inline as `<p>` paragraphs; `snapshot-quality`
|
|
classifies `ok`, and a (synthetically forced) `degraded` entry shows the warning
|
|
note on the page and a flag on `/archive/`. `fetch` is idempotent — `archive/`
|
|
is byte-identical across re-runs. The committed artifact is `snapshot.html`;
|
|
`snapshot.txt` + `.sha256` are gitignored (the existing `archive/**/*.txt`
|
|
globs already cover them).
|
|
|
|
**Author-gated, by design (exit-criteria wording).** Two criteria are not
|
|
machine-checkable here and remain the author's: (1) the cross-browser CSP
|
|
render in Firefox *and* a Chromium browser; (2) the per-snapshot review before
|
|
committing `archive/`. The vendored `monolith` binary and the FIPS 203 / djb
|
|
artifacts are staged but **not committed** — committing `archive/` and
|
|
`tools/bin/monolith` is the deliberate author act the design specifies.
|
|
|
|
One real-world note from the bootstrap: `cr.yp.to` ships
|
|
`<meta name="robots" content="none">`. Per spec `none` ≡ `noindex, nofollow` —
|
|
it is *not* `noarchive`, so the snapshot proceeded correctly; only an explicit
|
|
`noarchive` (header or meta) aborts a fetch.
|
|
|
|
### Phase 3 — Link annotation & Wayback
|
|
|
|
- [x] **Confirm `Filters.hs`'s actual filter registration order** matches the
|
|
AST chain documented on `/colophon` before pinning the filter's position
|
|
- [x] **Confirm** whether the bibliography is rendered into the document AST or
|
|
a separate context field — this decides whether bibliography annotation
|
|
is in scope here or gated on the popup rewrite (see Link annotation)
|
|
- [x] `build/Filters/Archive.hs` — annotate body links to archived URLs;
|
|
register in `Filters.hs` after `smallcaps/dropcaps`, before `links`;
|
|
no-op when `archive-index.json` is absent
|
|
- [x] `archive.py wayback` + `make archive-wayback` — non-blocking submission,
|
|
backfill `wayback` into `PROVENANCE.json`
|
|
- [x] `visibility: private` handling (artifact not routed to `_site/`)
|
|
|
|
**Exit criteria:** a prose link to an archived URL gets an automatic archive
|
|
affordance; a build without `.venv` (no `archive-index.json`) still succeeds
|
|
with links un-annotated; every entry has a recorded Wayback URL after `make
|
|
archive-wayback`; a `private` entry's page renders without deploying its
|
|
artifact; the bibliography-annotation path is documented as either in-scope or
|
|
popup-gated.
|
|
|
|
**Met (2026-05-22).** `build/Filters/Archive.hs` walks body `Link` nodes and,
|
|
for any URL in `data/archive-index.json` (canonical + alias set, fragment- and
|
|
trailing-slash-tolerant), appends a superscript `archive-affordance` link to
|
|
`/archive/<slug>/` — emitted as `RawInline` HTML so the downstream `Links`
|
|
pass leaves it alone. It is registered in `Filters.applyAll` between
|
|
`Smallcaps` and `Links`; the index loads once via an `unsafePerformIO` CAF and
|
|
an absent/empty index makes the filter the identity (verified: a prose link to
|
|
the archived `cr.yp.to/aes-speed.html` gains the affordance, a non-archived
|
|
link does not). `archive.py wayback` (+ `make archive-wayback`) submits each
|
|
entry lacking a `wayback` capture to the Wayback Machine and backfills
|
|
`PROVENANCE.json`; it always exits 0 and is never on a build's critical path.
|
|
`visibility: private` is a `manifest.yaml` field: a private entry's artifact is
|
|
never routed to `_site/` (artifacts are routed by an explicit public-only list,
|
|
which also stops an orphan directory's artifact deploying), and its page
|
|
renders provenance + a "held offline" panel with no embed and no extracted text
|
|
(verified: a private `_site/archive/<slug>/` contains only `index.html`).
|
|
|
|
Two items are deliberately scoped out of this pass, both documented above:
|
|
**bibliography annotation** (the bibliography is a separate `$bibliography$`
|
|
field; the hook is `Citations.hs`'s `enhanceEntry`, pending a CSL-URL check —
|
|
not popup-gated) and **pull-from-Wayback when the original is dead at fetch
|
|
time** (it belongs with Phase 5 link-rot detection, where a dead URL is the
|
|
central case and a Wayback-sourced artifact's provenance can be handled
|
|
properly). The live `make archive-wayback` run is author-initiated — it submits
|
|
public captures to a third-party service.
|
|
|
|
### Phase 4 — Backlinks & similar-pages indexing
|
|
|
|
- [x] `Backlinks.hs` — load `archive-index.json` (silent no-op if absent);
|
|
widen `isPageLink`; match the alias set; canonicalize archived externals
|
|
to `/archive/{slug}/` in pass 2
|
|
- [x] "Referenced by" section on `archive.html`
|
|
- [x] `embed.py` — add `/archive/` to `EXCLUDE_URLS`; verify archive pages join
|
|
`similar-links.json` and the paragraph index
|
|
- [x] **Measure `embed.py` runtime** against a populated archive; add a
|
|
per-document embedding cache (keyed by content hash) once the archive
|
|
passes 50 entries or `embed.py` exceeds 60 s
|
|
- [x] "Related" section on `archive.html`
|
|
- [x] Fragment-preserving backlinks → grouped "Referenced by" by section/page
|
|
|
|
**Exit criteria:** an archive page lists the essays that cite it under
|
|
"Referenced by", including citations that used an alias URL form; essays surface
|
|
relevant archived works under "Related"; a fragment-targeted citation appears
|
|
grouped under its section; `embed.py` runtime with the archive populated is
|
|
measured and either under the thresholds or the cache is in place.
|
|
|
|
**Met (2026-05-22).** A shared `build/ArchiveIndex.hs` loads
|
|
`data/archive-index.json` once (the `unsafePerformIO` CAF formerly private to
|
|
`Filters.Archive`); `Backlinks.hs` and `Filters.Archive` both consume it.
|
|
`Backlinks.isPageLink` keeps an archived external URL regardless of scheme or
|
|
extension; pass 2 (`targetKey`) canonicalises it to the archived work's
|
|
`/archive/<slug>/` page key — computed as the same string fed through
|
|
`normaliseUrl` that `backlinksField` uses for the page's own route, so the two
|
|
always agree. `archiveEntryCtx` gains `referencedByField` and
|
|
`similarLinksField`; `archive.html` renders `$if(referenced-by)$` /
|
|
`$if(similar-links)$` sections. `referencedByField` reuses the backlinks lookup
|
|
but groups sources by the fragment each citation targets — a `#page=12`
|
|
citation renders under a "Page 12" subheading, a bare citation in a flat list
|
|
above. `embed.py` excludes the `/archive/` index from the corpus (individual
|
|
entry pages stay in) and is measured at **~12 s** for the whole site (43 → 25
|
|
pages, 802 paragraphs) — far under the 60 s threshold and the 50-entry trigger,
|
|
so the per-document embedding cache is correctly *not* built (premature at this
|
|
scale; revisit at the threshold).
|
|
|
|
Verified end-to-end with a temporary citation in `content/about.md`: the
|
|
FIPS 203 page listed it under "Referenced by" with a flat entry *and* a grouped
|
|
"Page 12" entry; both archive pages surfaced the SIMD/PQC essay and each other
|
|
under "Related"; the `/archive/` index was absent from `similar-links.json`.
|
|
|
|
One pre-existing `embed.py` issue was surfaced and fixed: the `/source/`
|
|
repository code mirror was in the similarity corpus — a template file was
|
|
surfacing as a neighbour, titled with its unrendered `$title$` placeholder. An
|
|
`EXCLUDE_PREFIXES` rule now keeps `/source/` out, which also dropped 18 junk
|
|
pages from the site-wide corpus (43 → 25).
|
|
|
|
### Phase 5 — Link-rot detection & maintenance
|
|
|
|
**Prerequisite — resolved 2026-05-22.** `/build/` had been serving a stale
|
|
cached page: its build-varying telemetry is gathered in `unsafeCompiler`, which
|
|
Hakyll does not dependency-track, so the page recompiled only when tracked
|
|
*content* changed. Fixed — `build/Main.hs` writes a per-build
|
|
`data/build-stamp.txt` that `Stats.hs` loads as a dependency, forcing `/build/`
|
|
and `/stats/` to recompile every build. The archive-metrics exit criterion
|
|
below is now measurable.
|
|
|
|
- [x] `tools/archive.py check` + `make archive-check` — HEAD/GET scan
|
|
- [x] Asymmetric hysteresis: `rotted` requires 3 consecutive failed scans over
|
|
≥ 14 days; a single success → `live`; `consecutive-failures` +
|
|
`status-since` tracked in `archive-state.json`
|
|
- [x] Dead-link rendering: flip primary link to the archive on `rotted`
|
|
- [x] Pagefind `status` filter tag wired into `search-filters.js`
|
|
- [x] Archive metrics on `/build/` telemetry (`Stats.hs`)
|
|
- [x] `/archive/` index shows per-entry health
|
|
|
|
Test endpoint: reserve a controlled host — e.g. `archive-test.levineuwirth.org`,
|
|
a sub-host the author owns — that can be toggled to return 404 on demand, so the
|
|
rot-detection test flips without depending on a third party's uptime.
|
|
|
|
**Exit criteria:** the controlled test URL is detected as `rotted` only after
|
|
the hysteresis threshold is met, and the citing essay's link then flips to the
|
|
archived copy; a single transient failure does *not* flip it; restoring the URL
|
|
returns it to `live` on the next successful scan; the `/build/` page reports
|
|
archive coverage and health; search results can be filtered by archive `status`.
|
|
|
|
**Met (2026-05-22).** `tools/archive.py check` HEAD/GET-probes every manifest
|
|
URL (HEAD first, ranged GET on 403/405/501) and updates the gitignored
|
|
`data/archive-state.json`, which mirrors the manifest exactly (state for
|
|
dropped URLs is discarded). The asymmetric hysteresis in `next_state` is
|
|
unit-verified against synthetic scenarios — fail/fail/fail across 20 days flips
|
|
to `rotted`; three fast fails within 2 days stay at `error`; a single `ok` from
|
|
any non-live status recovers immediately to `live`. `ArchiveIndex.hs` exposes
|
|
the parsed status to consumers as `archiveStatusForSlug`. `Filters.Archive`
|
|
flips a `rotted` body link's href to `/archive/<slug>/` (adding an
|
|
`archive-rotted` class and a solid "archived" affordance marker) — verified
|
|
end-to-end with a hand-crafted `rotted` state file: a content link to the
|
|
djb URL was rewritten to the archive page; reverting the state restored the
|
|
original link. `archive.html` carries `data-pagefind-filter="type:archive,
|
|
status:$status$"`, a "Link status" row in the provenance panel, and a
|
|
status-note callout in the header for non-live states. The `/archive/` index
|
|
flags rotted entries with a solid "link rotted" chip. `Stats.hs` `/build/`
|
|
gains a "Link archive" section (count, total size, median age, by-status /
|
|
by-quality / by-visibility breakdowns, paywalled count, orphan directories) —
|
|
verified showing the test state's `error 1 · rotted 1` mix.
|
|
|
|
**Rendering staleness — by design.** Rot status is consumed at build time via
|
|
@unsafePerformIO@ CAFs; archive entry pages and content pages don't have a
|
|
Hakyll dependency edge to `archive-state.json` (that would only fix half the
|
|
problem — the archive pages — while leaving content-link flips stale, since
|
|
`Filters.Archive` runs during content compilation and can't cheaply force
|
|
every content page to depend on the state). So after `make archive-check`,
|
|
an *incremental* build can leave both surfaces uniformly stale until a clean
|
|
build refreshes everything. `make deploy` always does `make clean`, which
|
|
makes the deployed site consistent. The `/build/` page is the one
|
|
always-fresh surface: it recompiles every build via the existing build-stamp
|
|
dependency, so its archive metrics always reflect the current scan.
|
|
|
|
**Test endpoint deferred.** Spinning up `archive-test.levineuwirth.org` and
|
|
running it through a 14-day-spanning fail streak is a multi-week real-world
|
|
verification the author runs (or a CI cron); the hysteresis logic itself is
|
|
unit-tested deterministically in `next_state`, and the rendering side is
|
|
verified by the hand-crafted `rotted` state file.
|
|
|
|
**Search-UI filter (`search-filters.js`) — partial.** The data-side is in
|
|
place: every archive page carries `data-pagefind-filter="type:archive,
|
|
status:$status$"`, so Pagefind's filter index now distinguishes archive hits
|
|
by rot status and (when @pagefind-ui@ is configured to show filters) lists
|
|
them as a filterable facet. The remaining work — wiring a custom UI control
|
|
into `search-filters.js` — is a deliberate refinement, not done in Phase 5:
|
|
its existing `status` filter is reserved for *epistemic* status (working
|
|
model / drafting / etc.) sourced from `data/epistemic-meta.json`, so adding an
|
|
archive `status` dimension needs a name to avoid the collision plus new
|
|
filter-panel buttons. Search-UX best iterated with the live page in front of
|
|
the author.
|
|
|
|
---
|
|
|
|
## Open / deferred questions
|
|
|
|
Non-blocking, and now a short list — the draft's larger set was resolved into
|
|
Decisions during review.
|
|
|
|
- **JS-heavy / SPA pages.** `monolith` cannot execute JavaScript;
|
|
`js-required` captures are degraded. A headless-browser fallback (SingleFile,
|
|
Chromium capture) would handle them but adds a heavyweight dependency. Defer
|
|
until a real entry needs it.
|
|
- **First-viewport thumbnails.** Dropped for v1 — `/archive/` is a text list. A
|
|
visual grid does not earn its keep at small N; revisit past ~50 entries.
|
|
- **PDF section-granularity.** `pdftotext` flattens structure. Per-*page*
|
|
chunking (`#page=N` anchors, per-page text) is the realistic granularity for
|
|
PDF backlinks and semantic indexing. Defer.
|
|
- **Per-section "Related" UI.** The paragraph-level semantic index already
|
|
receives archive text; a UI surfacing section-level "Related" does not exist
|
|
for *any* content type yet. Out of scope here; a site-wide feature.
|
|
- **Snapshot versioning.** v1 snapshots are immutable per snapshot; `refresh`
|
|
replaces in place but records `previous-sha256`. If a referenced work is
|
|
meaningfully revised, should a new dated snapshot be kept *alongside* the old
|
|
(`document-2027-01-01.pdf`) with a version switcher? `previous-sha256` is the
|
|
seed — extend it to a list and the switcher reads it. Defer until needed.
|
|
- **Intra-archive link rewriting.** When archived page A links to a URL that is
|
|
*also* archived, A's snapshot could be rewritten to point at the local copy
|
|
of B — keeping the reader inside the preserved set. Gwern-style; defer.
|
|
- **Media beyond PDF/HTML.** EPUB, plain images, video. Out of scope for v1;
|
|
`type` is an open enum so it can extend.
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- `WRITING.md` — authoring conventions; the link-annotation feature will be
|
|
documented there once Phase 3 lands
|
|
- `PHOTOGRAPHY.md` — the closest precedent: authored-input/generated-sidecar
|
|
split, phased build, `.venv`-gated tools, vendored binaries
|
|
- `build/Backlinks.hs` — two-pass backlinks; `isPageLink` is the integration
|
|
point
|
|
- `build/SimilarLinks.hs` — "Related" block; consumes `embed.py` output
|
|
- `tools/embed.py` — embedding pipeline; archive pages join its corpus for free
|
|
- `build/Patterns.hs` — canonical content patterns
|
|
- `build/Tags.hs` — slash-hierarchy tags (reused for archive tags)
|
|
- `tools/download-leaflet.sh`, `tools/download-pdfjs.sh` — the sha256-pinning
|
|
convention; `monolith` is committed directly rather than downloaded (a
|
|
build-time executable, not a servable asset)
|
|
- `nginx/popup-proxy.conf` — the metadata proxy; related but distinct (caches
|
|
previews, does not preserve documents)
|
|
```
|
|
</content>
|