From 77e31efdaedc5b75f5932ebbf12afc2e6e2b5335 Mon Sep 17 00:00:00 2001 From: Levi Neuwirth Date: Sat, 23 May 2026 10:06:33 -0400 Subject: [PATCH] Add link archive system: snapshots, backlinks, link-rot MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Preserve external works the site cites against link rot, host them at permanent /archive// URLs in site chrome, and treat them as first-class citizens of the backlinks and similar-pages indexes. Curated, not crawled: the author adds one line to archive/manifest.yaml and the build fetches, hashes, snapshots, and indexes the work. * archive/manifest.yaml + tools/archive.py (fetch / refresh / wayback / check / gc) — PDFs downloaded directly, HTML pages snapshotted with a vendored monolith (tools/bin/monolith @ 2.10.1) into a single self-contained file with the archive CSP and a noarchive robots meta injected. Per-entry PROVENANCE.json committed; gitignored .txt sidecars regenerated from the artifact's SHA-256. * build/Archive.hs + build/ArchiveIndex.hs + build/Filters/Archive.hs — Hakyll rules for /archive/ and /archive//, a body Pandoc filter that appends an archive affordance to live citations and flips dead ones to the local copy on archive.py check's asymmetric hysteresis (rotted needs 3 fails over >= 14 days; one ok recovers). * build/Backlinks.hs — keeps archived external URLs through pass 1 and canonicalises them to /archive// in pass 2, producing a "Referenced by" section grouped by the fragment each citation targets. build/Stats.hs gains a "Link archive" telemetry block on /build/ (count, total size, median age, by-status / by-quality / by-visibility, orphans). * Integrity: archive.py fetch and build/Archive.hs (via sha256sum) both re-hash every committed artifact, so a tampered file halts the build even with cabal invoked directly or no .venv present. refresh refuses to replace an uncommitted prior snapshot and rolls back atomically on any exit path. removed.yaml is honoured by fetch, wayback, and check using canonical-form (tracking-stripped, arXiv-canonicalised) comparison. * visibility: private keeps an entry in-repo but undeployed. nginx/archive.conf emits X-Robots-Tag: noindex, noarchive for raw artifacts that cannot carry meta directives. The full design, phase plan (1-5), and three refinement passes live in ARCHIVE.md. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 10 + ARCHIVE.md | 1535 +++++++++++++++++ Makefile | 44 +- archive/djb-aes-speed/PROVENANCE.json | 14 + archive/djb-aes-speed/snapshot.html | 470 +++++ archive/manifest.yaml | 28 + archive/nist-fips-203/PROVENANCE.json | 14 + archive/nist-fips-203/document.pdf | Bin 0 -> 1252341 bytes archive/removed.yaml | 19 + build/Archive.hs | 579 +++++++ build/ArchiveIndex.hs | 255 +++ build/Backlinks.hs | 151 +- build/Filters.hs | 2 + build/Filters/Archive.hs | 82 + build/Main.hs | 22 +- build/Site.hs | 22 + build/Stats.hs | 31 + levineuwirth.cabal | 3 + nginx/archive.conf | 45 + nginx/vhost.conf.example | 6 + static/css/archive.css | 463 +++++ static/css/components.css | 47 + templates/archive-index.html | 23 + templates/archive.html | 109 ++ templates/partials/archive-banner.html | 5 + .../partials/archive-removal-notice.html | 5 + templates/partials/head.html | 2 + tools/archive.py | 1151 ++++++++++++ tools/bin/monolith | Bin 0 -> 12434488 bytes tools/embed.py | 13 +- tools/monolith-version.txt | 17 + 31 files changed, 5127 insertions(+), 40 deletions(-) create mode 100644 ARCHIVE.md create mode 100644 archive/djb-aes-speed/PROVENANCE.json create mode 100644 archive/djb-aes-speed/snapshot.html create mode 100644 archive/manifest.yaml create mode 100644 archive/nist-fips-203/PROVENANCE.json create mode 100644 archive/nist-fips-203/document.pdf create mode 100644 archive/removed.yaml create mode 100644 build/Archive.hs create mode 100644 build/ArchiveIndex.hs create mode 100644 build/Filters/Archive.hs create mode 100644 nginx/archive.conf create mode 100644 static/css/archive.css create mode 100644 templates/archive-index.html create mode 100644 templates/archive.html create mode 100644 templates/partials/archive-banner.html create mode 100644 templates/partials/archive-removal-notice.html create mode 100644 tools/archive.py create mode 100755 tools/bin/monolith create mode 100644 tools/monolith-version.txt diff --git a/.gitignore b/.gitignore index 651615a..e7ca3f8 100644 --- a/.gitignore +++ b/.gitignore @@ -69,10 +69,20 @@ data/similar-links.json data/backlinks.json data/build-stats.json data/build-start.txt +data/build-stamp.txt data/last-build-seconds.txt data/semantic-index.bin data/semantic-meta.json +# Archive: generated text + its staleness stamp (recreated from the +# committed artifact on every build — deterministic, so committing them is +# churn). archive/**/PROVENANCE.json is deliberately NOT ignored — it is +# the committed, immutable record of each archival event. +archive/**/*.txt +archive/**/*.txt.sha256 +data/archive-index.json +data/archive-state.json + # IGNORE.txt is for the local build and need not be synced. IGNORE.txt diff --git a/ARCHIVE.md b/ARCHIVE.md new file mode 100644 index 0000000..0c1d546 --- /dev/null +++ b/ARCHIVE.md @@ -0,0 +1,1535 @@ +# Archive + +Design and implementation plan for the link-archiving system of levineuwirth.org. +This is the source of truth for how external references are preserved, hosted, +displayed, and indexed. It sits alongside `WRITING.md`, `PHOTOGRAPHY.md`, +`HOMEPAGE.md`, and `MARKS.md` as authoritative spec. + +## Status + +**Reviewed and ratified 2026-05-21, with revisions.** The original draft was +reviewed against the live site over three rounds; the decisions below +incorporate every round of deltas and are now locked. + +**Phase 1 complete (2026-05-22).** PDF entries: `archive/manifest.yaml`, +`tools/archive.py` (`fetch` + `gc`), `build/Archive.hs`, the four templates, +and the Makefile / `head.html` / `.gitignore` wiring are built and verified — +`/archive/` and `/archive/nist-fips-203/` render. + +**Phase 2 complete (2026-05-22).** HTML snapshots: the pinned `monolith` +binary is vendored at `tools/bin/monolith`, `archive.py fetch` snapshots HTML +pages (CSP injected, text extracted, quality classified), and `archive.html` +renders them in a sandboxed iframe — `/archive/djb-aes-speed/` renders. The +cross-browser CSP check and the per-snapshot review remain author-gated by +design. + +**Archive pages styled (2026-05-22).** `static/css/archive.css` gives the +index and entry pages a framed treatment (banner callout, provenance panel, +artifact viewer); the PDF embed was changed to the raw `document.pdf` (browser- +native viewer), symmetric with HTML snapshots — see the Display — PDF decision. + +**Phase 3 complete (2026-05-22).** Link annotation + Wayback: `Filters/Archive.hs` +appends an archive affordance to body links whose target is archived; +`archive.py wayback` (+ `make archive-wayback`) backfills Wayback captures; +`visibility: private` keeps an entry's artifact in-repo but undeployed. +Bibliography annotation is documented as a `Citations.hs` follow-up. + +**Phase 4 complete (2026-05-22).** Backlinks + similar-pages: `Backlinks.hs` +keeps archived external links and canonicalises them to their `/archive//` +page, so an archived work lists every essay that cites it under "Referenced by" +(grouped by the fragment each citation targets); `archive.html` also carries a +"Related" block from the `embed.py` similarity corpus, which now indexes archive +pages and excludes the `/archive/` index. + +**Phase 5 complete (2026-05-22).** Link-rot detection: `tools/archive.py check` +(+ `make archive-check`) HEAD/GET-probes every manifest URL and updates the +gitignored `data/archive-state.json` under asymmetric hysteresis (`rotted` +needs 3 fails over ≥14 days; a single success recovers immediately). +`Filters.Archive` flips a body link to the archive when its target is `rotted`; +each archive page surfaces its link status (provenance row, header note, +Pagefind `status` filter tag); `/archive/` flags rotted entries; `/build/` +gains a "Link archive" telemetry section. The search-UI `status` filter wiring +in `search-filters.js` is deliberately partial — see the Phase 5 Met note. + +**All five phases done.** Refinements next; see the Phase 5 Met note for the +documented deferrals (search-UI status filter; bibliography annotation from +Phase 3; pull-from-Wayback at fetch time). + +**Refinements (2026-05-22).** A code-review pass found and fixed several +correctness and posture issues across the system: + +- **Missing committed artifact no longer re-fetches silently.** `cmd_fetch` + used to skip its SHA guard when the artifact was absent and then download + fresh bytes whose hash differed from the recorded `sha256` — replacing the + recorded snapshot without surfacing it. The guard now also halts when + `PROVENANCE.json` is present but the artifact is missing, requiring the + author to restore the committed bytes before rebuilding. +- **`archive/removed.yaml` is now enforced in `fetch` and `check`.** It was + only read by `gc`. A removed URL re-added to the manifest now halts + `cmd_fetch` loudly; `cmd_check` skips removed URLs so the link-rot + scanner does not keep probing a deliberate takedown. +- **SHA verification closed the `.venv`-bypass hole.** The original + decision relied solely on `archive.py fetch` re-hashing, but that step is + `.venv`-gated — a contributor or deploy host without `.venv`, or a direct + `cabal run site -- build`, would publish a tampered artifact unchecked. + `build/Archive.hs` now also re-hashes via `sha256sum` from + `loadArchiveEntries` and halts the build on a mismatch, so the guarantee + holds independent of the Python step. +- **Raw artifacts are no longer publicly indexable.** Pass 1 added a + `robots.txt` `Disallow: /archive/`, which pass 2 then reverted (see + below — it was counter-productive). Pass 1's other change — injecting + `` into every new HTML + snapshot alongside the archive CSP — remains in place; the + deploy-side header for raw PDFs landed in pass 2 as `nginx/archive.conf`. +- **The documented `archive.py refresh {slug}` subcommand is implemented.** + It clears the slug's directory, re-fetches via `cmd_fetch`, and records + the prior `sha256` as `previous-sha256` in the new `PROVENANCE.json`. The + URL-changed error message in `cmd_fetch` now points at it instead of + asking the author to delete the directory by hand. +- **`url_aliases` widened** to the design's full equivalent-URL set: + tracking-parameter stripping (`utm_*`, `fbclid`, `gclid`, `mc_*`, `ref`, + `igshid`, `_hsenc`, `_hsmi`, `mkt_tok`) and arXiv abs / pdf / versioned / + `.pdf` form expansion. Phase 1 had deliberately kept these as a Phase 4 + deferral, but Phase 4 missed the follow-through. +- **`X-Robots-Tag: noarchive` is now honoured on both HEAD and GET.** Some + servers omit the header on HEAD but emit it on GET; HTML capture now + aborts if either response carries the directive. + +Three smaller items remain documented and deferred: + +- **Archive tags joining the site-wide tag indexes.** `manifest.yaml`'s + `tags:` is authored but `Tags.hs`/`Patterns.tagIndexable` does not yet + ingest archive entries — it needs a Tags.hs-side integration with its + own design pass (archive pages aren't `match`ed Hakyll items in the + normal way). +- **`archive.py suggest`** (bibliography discovery — diff `.bib` URLs + against the manifest) is documented but not implemented. +- **The controlled-host end-to-end link-rot test** (reserve + `archive-test.levineuwirth.org`, run it through a 14-day-spanning fail + streak, watch the flip happen) is inherently a multi-week real-world + verification the author runs; the hysteresis logic is unit-tested + deterministically and the rendering side is verified by a hand-crafted + `rotted` state file. + +**Refinements pass 2 (2026-05-23).** A second code-review pass surfaced +correctness gaps the first pass missed: + +- **`refresh` is now atomic.** It used to delete the slug directory and + then call `cmd_fetch`; a failed re-fetch left the entry with no + snapshot at all, while `refresh` returned 0 (because `cmd_fetch` + reports per-entry skips, not a process failure). The slug directory is + now *renamed* to a `.refresh-backup` sibling; success removes the + backup, any failure restores it. Verified by hiding the `monolith` + binary and confirming the prior snapshot survives intact. +- **Invalid `visibility` values fail closed.** The `ManifestEntry` parser + used to accept any string and only treat the exact `"private"` as + private — a typo like `privte` would publish a work the author intended + to keep offline. The parser now rejects any value other than `public` + or `private`, and `readManifest` halts the build on any parse error of + a present file (instead of warning + returning an empty list — that + silent-skip was for `file absent`, not `file present but corrupt`). +- **Lookup-side URL normalisation.** Alias generation alone cannot cover + unbounded forms (arXiv versions, arbitrary tracking-parameter + combinations). `ArchiveIndex` now normalises both index keys and + lookup inputs through the same `normalizeUrl` (drop fragment, strip + tracking, fold http→https, arXiv-canonicalise, trim trailing slash). + Verified: `https://cr.yp.to/aes-speed.html`, + `https://cr.yp.to/aes-speed.html?utm_source=mail`, and + `http://cr.yp.to/aes-speed.html/` all match the same archived entry. +- **Raw-artifact indexing posture corrected.** The Phase-5 `robots.txt` + `Disallow: /archive/` was counter-productive: a URL blocked by + robots.txt can still appear in results when externally linked, and the + Disallow also prevents compliant crawlers from reading the wrapper + pages' ``. The Disallow is reverted; a new + `nginx/archive.conf` snippet emits `X-Robots-Tag: noindex, noarchive` + for the whole `/archive/` tree, which crawlers honour for any resource + (HTML and PDF alike). The deploy vhost should `include + snippets/archive.conf`. +- **`cmd_wayback` skips `removed.yaml`.** The eviction procedure says + record in `removed.yaml` *before* dropping the manifest line; `fetch` + and `check` now honour that ordering, but `wayback` did not. A removed + entry whose manifest line was still in place could be submitted to a + third-party archive after a takedown was recorded. +- **The shipped HTML snapshot was refreshed in the working tree** so it + carries the noarchive meta the Phase-5 inject promises. `archive.py + refresh djb-aes-speed` re-fetched cr.yp.to, applied + `inject_archive_metas`, and recorded the prior SHA as `previous-sha256`. + `archive/djb-aes-speed/{snapshot.html, PROVENANCE.json}` now reflect the + new bytes; matching SHA is verified by `Archive.hs`. *Caveat surfaced + in pass 3 (below): the prior snapshot was not committed at the moment + of this refresh, so its bytes are no longer recoverable via `git log + -S`. A pass-3 fix to `refresh` now refuses to replace an uncommitted + prior, but the historical artifact survives — `previous-sha256` + records a hash whose bytes this working tree cannot reproduce.* +- **The URL-changed error in `cmd_fetch`** now points at + `archive.py refresh {slug}` instead of asking the author to delete the + directory by hand. + +Tag integration remains the one deferred refinement (it needs a Tags.hs +design pass). + +**Refinements pass 3 (2026-05-23).** A third audit surfaced gaps the pass-2 +fixes didn't fully close: + +- **`refresh` refuses to replace an uncommitted prior snapshot.** Pass 2 + preserved a prior snapshot through *failed* re-fetches, but a *successful* + one happily discarded uncommitted bytes — `previous-sha256` then pointed + at a hash no `git log -S` could recover. Pass 3 shells out to `git + ls-files` + `git diff --quiet HEAD` and refuses the refresh unless both + the prior PROVENANCE.json and its artifact are tracked and clean. +- **`refresh` is atomic across *every* exit path.** Pass 2 handled the + ordinary `cmd_fetch returns 0 but the artifact wasn't produced` case but + not fatal `sys.exit`s (e.g. a `removed.yaml` conflict halting `cmd_fetch` + mid-refresh) nor mid-refresh exceptions, and it never rolled back the + `data/archive-index.json` rewrite. The work is now wrapped in + `try/finally` that restores both the slug directory and the index on any + exit path — normal failure, `SystemExit`, `KeyboardInterrupt`, or + exception. +- **Removal enforcement now uses the same equivalence as link matching.** + Pass 2 introduced `normalizeUrl` for incoming citations but compared + removals as literal URL strings, so a tracking-laden manifest URL could + bypass a takedown. Python gains `normalize_url` mirroring the Haskell + helper, and `fetch` / `check` / `wayback` compare normalised forms. + `cmd_fetch` additionally rejects two manifest entries whose canonical + forms collide — that would otherwise route both under one slug. +- **`fetch_html` honours `X-Robots-Tag: noarchive` on the captured GET too.** + Pass 1 added HEAD + ranged-GET probes, but a server can emit the header + only on the full document response. The Python tool now downloads that + response itself, checks its header and body directives, then passes those + exact bytes to `monolith --base-url ... -` so the saved snapshot is not + obtained through a second unobservable document request. +- **`nginx/archive.conf` is wired into the deploy template** and + re-`include`s `security-headers.conf` inside its `location` block. + `nginx/vhost.conf.example` now includes `archive.conf`; the snippet + itself re-emits the baseline headers because nginx's `add_header` chain + is inherited from a parent only when the current context declares *no* + `add_header` directives — without the re-include, /archive/ would lose + HSTS, CSP, etc. +- **Contract doc cleanups.** The Phase-5 paragraph claiming `robots.txt` + disallows `/archive/` is reworded to acknowledge the pass-2 reversal; + the Phase-1 checkbox claiming `Archive.hs` does not re-hash is updated + to point at `verifyArtifactSha`; the pass-2 note about the refreshed + djb snapshot now carries the caveat that its prior bytes were + uncommitted and are therefore unrecoverable. + +The historical `previous-sha256` value in `archive/djb-aes-speed/ +PROVENANCE.json` is left in place: it is a truthful record that *a* prior +snapshot existed and what its hash was. It just is not recoverable from +git in this working tree — the pass-3 `refresh` precondition exists so +that property is never broken again. + +**Refinements pass 4 (2026-05-23).** A fourth audit completed the +failure-closed paths: + +- **Direct Hakyll builds now enforce removals and missing-artifact failures.** + `Archive.hs` reads `removed.yaml`, rejects normalized manifest conflicts + and duplicate archive targets, and aborts if provenance exists without its + artifact. `ArchiveIndex.hs` filters the generated index through the live + manifest minus normalized removals, so a stale ignored index cannot retain + archive affordances after a takedown when `archive.py` was skipped. +- **`refresh` verifies the prior bytes before replacing them.** A prior + snapshot must now be present, tracked, clean, and match its recorded + SHA-256 before its hash can be written into `previous-sha256`. +- **Failed refresh restores an originally-absent index state.** If + `data/archive-index.json` did not exist before a failed refresh, any index + created by the attempted fetch is deleted during rollback. + +The genuinely-open questions that remain are collected at the end — the list is +short. + +--- + +## Motivation + +The site cites external work — papers, articles, blog posts, documentation. +Three things go wrong with a plain hyperlink over time: + +1. **Link rot.** The target moves, paywalls, or vanishes. A 2019 essay's + citations decay silently; nobody notices until a reader clicks. +2. **Content drift.** The target stays up but changes. The sentence you quoted + is no longer the sentence at that URL. +3. **Opacity to the site's own machinery.** An external link is invisible to + `Backlinks.hs` (`isPageLink` drops every `http(s)://` URL) and to + `embed.py` (it indexes only `_site/**/*.html`). The site knows nothing about + the things it most often points at. A paper cited by six essays has no page, + no backlinks list, no place in any "Related" set. + +The archive fixes all three by keeping a **local, hosted, immutable snapshot** +of each referenced work, giving it a stable URL on this domain, and making that +URL a first-class citizen of the existing backlinks and similar-pages systems. + +This is deliberately *not* a general web crawler. It archives a curated set: +the things this site references. The author adds a URL to a manifest; the build +does the rest. + +### Relationship to existing pieces + +| Existing piece | What it does | Why the archive is different | +|----------------|--------------|------------------------------| +| `static/papers/` | Hosts Levi's **own** typeset PDFs (`preprint:`, `{{pdf:}}`) | The archive holds **third-party** works. Distinct directory, distinct purpose. Never conflate the two. | +| nginx `popup-proxy.conf` | Caches **metadata** (title/abstract) from arXiv / archive.org / PubMed for hover previews | Caches structured metadata, not documents. A preview accelerator, not preservation. | +| `Backlinks.hs` | Inverts **internal** links into a "who links here" map | Indexes site content only; external URLs are dropped. The archive makes referenced works internal enough to index. | +| `embed.py` / `SimilarLinks.hs` | Semantic "Related" block from `_site/**/*.html` embeddings | Only sees site pages. Archived works become site pages, so they enter the embedding corpus for free. | + +--- + +## Goals + +- **Preservation.** Every referenced work the author chooses to archive has a + byte-for-byte local snapshot that survives the original going dark. +- **Stable hosting.** Each snapshot is reachable at a permanent + `/archive/{slug}/` URL on levineuwirth.org, rendered in site chrome. +- **Hyperlink-able.** Archive URLs are ordinary internal links: usable in + prose, wikilinks, citations, and `further-reading`. +- **Indexed.** Archived works appear in the **backlinks** ("Referenced by") and + **similar-pages** ("Related") systems exactly as native content does — and, + where the source structure allows, granularly by section. +- **Curated, low-friction.** Adding an archive is one line in one manifest. + Everything else — fetch, text extraction, page generation, indexing — is + automatic and build-time. +- **Static-friendly.** Every archive page renders at build time; JS is layered + on, never required. Matches the rest of the site's contract. +- **Honest.** Archive pages never impersonate the original. They are framed as + archived copies, link prominently to the source, are kept out of search + engines, and carry a real, advertised removal channel on every page. +- **Safe by default.** No build step ever deletes or overwrites a committed + artifact; destruction and replacement are always explicit, opt-in acts. + +--- + +## Decisions (locked) + +| Topic | Decision | Rationale | +|-------|----------|-----------| +| Trigger | Curated manifest, not auto-crawl | Archives what the site *references*, not the web. Legally and operationally sane. | +| Authored input | One hand-edited file: `archive/manifest.yaml` | One line per archived link. Mirrors `data/commonplace.yaml`'s authoring model. | +| Bibliography seeding | **Rejected** as auto-seeding. `make archive-suggest` prints a "cited but not archived" diff; the author copies lines by hand. | Keeps the manifest the *identity* of the archive, not a cache of the `.bib` files. | +| Per-entry provenance | `archive/{slug}/PROVENANCE.json`, committed — immutable for the current snapshot | An immutability claim that isn't in version control isn't immutable. | +| Mutable state | `data/archive-state.json`, gitignored — link-rot status only | Strict split: immutable facts committed, volatile status disposable. | +| Hakyll input | `data/archive-index.json` — `url` + aliases → slug, written by the tool | Minimal stable shape for the Haskell side; treated like `data/annotations.json`. | +| Missing-index behaviour | `Backlinks.hs` and `Filters/Archive.hs` silently no-op when `archive-index.json` is absent | Preserves the established `.venv`-gated silent-skip convention. The archive degrades to invisible, never to an error. | +| `fetch` idempotence | `fetch` is keyed on `(slug, url)` together; a slug whose recorded URL has changed is refused, not overwritten. `fetch` always rewrites `archive-index.json` to mirror the manifest. | A committed artifact is replaced only by an explicit `refresh`, never as a `fetch` side effect. | +| Artifact storage | `archive/{slug}/` at repo root, **committed to git** | A preservation guarantee that depends on an un-versioned store is weaker. Repo stays reproducible. | +| Per-artifact size cap | 25 MB; `archive.py fetch` warns and skips above it; `git add -f` to override deliberately | A 200 MB scan must never land in an auto-commit silently. | +| Storage migration | If `archive/` exceeds ~5 GB or doubles year-over-year, evaluate a separate archive repo / object store. **Never git LFS.** | LFS breaks `git clone → make build` reproducibility — a regression for a preservation system. | +| HTML snapshots | `monolith -j` → one self-contained HTML file; the pinned `monolith` binary is committed at `tools/bin/monolith` | Single static binary, no headless browser. Strips JS. Committing it (vs downloading) removes a network dependency and keeps the build reproducible from a bare clone. | +| PDF snapshots | Direct download via `requests` | Papers are usually clean PDF URLs (arXiv etc.). | +| Display — PDF | The raw `document.pdf` in an ` + $endif$ + $if(is-html)$ + + $endif$ + + $endif$ + + $if(fulltext)$ + $if(is-pdf)$ +
+ Full text (extracted) + $fulltext$ +
+ $endif$ + $if(is-html)$ +
+

Readable text (extracted)

+ $fulltext$ +
+ $endif$ + $endif$ + + $if(referenced-by)$ + + $endif$ + + $if(similar-links)$ + + $endif$ + + $partial("templates/partials/archive-removal-notice.html")$ + + + diff --git a/templates/partials/archive-banner.html b/templates/partials/archive-banner.html new file mode 100644 index 0000000..d08b3ec --- /dev/null +++ b/templates/partials/archive-banner.html @@ -0,0 +1,5 @@ +
+

Archived copy

+

A local preservation snapshot taken $archived$ — this page is not the original.

+ View the original ↗ +
diff --git a/templates/partials/archive-removal-notice.html b/templates/partials/archive-removal-notice.html new file mode 100644 index 0000000..c34a0f5 --- /dev/null +++ b/templates/partials/archive-removal-notice.html @@ -0,0 +1,5 @@ +

+ This is an archived copy, preserved so that a work cited across the site + survives the original going dark. To request removal, email + ln@levineuwirth.org. +

diff --git a/templates/partials/head.html b/templates/partials/head.html index c9eb017..beec3a0 100644 --- a/templates/partials/head.html +++ b/templates/partials/head.html @@ -2,6 +2,7 @@ $if(home)$Levi Neuwirth$else$$if(title)$$title$ — Levi Neuwirth$else$Levi Neuwirth$endif$$endif$ $if(description)$$endif$ +$if(noindex)$$endif$ @@ -49,6 +50,7 @@ $if(build)$$endif$ $if(reading)$$endif$ $if(composition)$$endif$ $if(photography)$$endif$ +$if(archive)$$endif$ $if(photography-map)$$endif$ $if(photography-map)$$endif$ $if(photography-map)$$endif$ diff --git a/tools/archive.py b/tools/archive.py new file mode 100644 index 0000000..2aacb0e --- /dev/null +++ b/tools/archive.py @@ -0,0 +1,1151 @@ +#!/usr/bin/env python3 +""" +archive.py — Build-time link-archiving tool for levineuwirth.org. + +Reads archive/manifest.yaml, fetches any manifest URL that has no local +artifact yet, stores it under archive//, extracts readable text, +writes the per-entry archive//PROVENANCE.json, and (re)writes the +Hakyll input data/archive-index.json. + +Two artifact types: + * pdf — downloaded directly, stored as document.pdf, text via pdftotext. + * html — snapshotted with `monolith` into a single self-contained + snapshot.html (JavaScript stripped, assets inlined as data + URIs), a restrictive Content-Security-Policy injected, + text extracted with BeautifulSoup. + +Subcommands: + fetch download missing artifacts, (re)generate sidecars + index + refresh deliberately re-snapshot a single entry, recording the prior + SHA in the new PROVENANCE.json's `previous-sha256` + wayback submit archived URLs to the Wayback Machine as a second, + independent copy; backfill the capture URL into PROVENANCE.json + check HEAD/GET-probe every manifest URL for link rot, updating + data/archive-state.json with asymmetric hysteresis + gc delete archive// directories listed in archive/removed.yaml + +Failure policy: + * Integrity errors — a committed artifact whose SHA-256 no longer + matches PROVENANCE.json, or a slug whose manifest URL has changed — + print loudly and exit non-zero, halting `make build`. + * Transient errors — a network failure, an over-cap download, a missing + `monolith` binary, a manifest entry missing its `url:` — print a + warning, skip that entry, and exit zero so the build proceeds (the + entry is retried on the next build). + +See ARCHIVE.md for the full design. + +Gated on .venv by the Makefile (same convention as embed.py). Non-stdlib +dependencies: PyYAML and beautifulsoup4, both already in pyproject.toml. +External tools: `pdftotext` (poppler) for PDF text, and the `monolith` +binary — vendored at tools/bin/monolith, see tools/monolith-version.txt. +""" + +from __future__ import annotations + +import datetime +import hashlib +import json +import os +import re +import shutil +import subprocess +import sys +import urllib.error +import urllib.request +from pathlib import Path +from urllib.parse import parse_qsl, quote, urlencode, urlparse, urlunparse + +import yaml + +# --------------------------------------------------------------------------- +# Configuration +# --------------------------------------------------------------------------- + +REPO_ROOT = Path(__file__).resolve().parent.parent +ARCHIVE_DIR = REPO_ROOT / "archive" +MANIFEST = ARCHIVE_DIR / "manifest.yaml" +REMOVED = ARCHIVE_DIR / "removed.yaml" +INDEX_OUT = REPO_ROOT / "data" / "archive-index.json" +STATE_OUT = REPO_ROOT / "data" / "archive-state.json" + +ROT_FAILS = 3 # consecutive failed scans before `rotted` is considered +ROT_DAYS = 14 # ... and the streak must also span at least this many days + +SIZE_CAP = 25 * 1024 * 1024 # 25 MB per-artifact cap +TIMEOUT = 60 # seconds, per network request +WAYBACK_TIMEOUT = 120 # seconds — Save Page Now is slow +USER_AGENT = ("levineuwirth.org/archive " + "(ln@levineuwirth.org; removal requests honored)") + +# Per-type on-disk names. The artifact is committed; the .txt is generated +# (gitignored) and regenerated whenever the artifact's SHA-256 changes. +ARTIFACT = {"pdf": "document.pdf", "html": "snapshot.html"} +TEXTFILE = {"pdf": "document.txt", "html": "snapshot.txt"} + +# Injected into every HTML snapshot's . Permits exactly what a +# faithful monolith capture needs — inlined images/fonts as data URIs and +# inline styles (as