diff --git a/.gitignore b/.gitignore index 651615a..e7ca3f8 100644 --- a/.gitignore +++ b/.gitignore @@ -69,10 +69,20 @@ data/similar-links.json data/backlinks.json data/build-stats.json data/build-start.txt +data/build-stamp.txt data/last-build-seconds.txt data/semantic-index.bin data/semantic-meta.json +# Archive: generated text + its staleness stamp (recreated from the +# committed artifact on every build — deterministic, so committing them is +# churn). archive/**/PROVENANCE.json is deliberately NOT ignored — it is +# the committed, immutable record of each archival event. +archive/**/*.txt +archive/**/*.txt.sha256 +data/archive-index.json +data/archive-state.json + # IGNORE.txt is for the local build and need not be synced. IGNORE.txt diff --git a/ARCHIVE.md b/ARCHIVE.md new file mode 100644 index 0000000..0c1d546 --- /dev/null +++ b/ARCHIVE.md @@ -0,0 +1,1535 @@ +# Archive + +Design and implementation plan for the link-archiving system of levineuwirth.org. +This is the source of truth for how external references are preserved, hosted, +displayed, and indexed. It sits alongside `WRITING.md`, `PHOTOGRAPHY.md`, +`HOMEPAGE.md`, and `MARKS.md` as authoritative spec. + +## Status + +**Reviewed and ratified 2026-05-21, with revisions.** The original draft was +reviewed against the live site over three rounds; the decisions below +incorporate every round of deltas and are now locked. + +**Phase 1 complete (2026-05-22).** PDF entries: `archive/manifest.yaml`, +`tools/archive.py` (`fetch` + `gc`), `build/Archive.hs`, the four templates, +and the Makefile / `head.html` / `.gitignore` wiring are built and verified — +`/archive/` and `/archive/nist-fips-203/` render. + +**Phase 2 complete (2026-05-22).** HTML snapshots: the pinned `monolith` +binary is vendored at `tools/bin/monolith`, `archive.py fetch` snapshots HTML +pages (CSP injected, text extracted, quality classified), and `archive.html` +renders them in a sandboxed iframe — `/archive/djb-aes-speed/` renders. The +cross-browser CSP check and the per-snapshot review remain author-gated by +design. + +**Archive pages styled (2026-05-22).** `static/css/archive.css` gives the +index and entry pages a framed treatment (banner callout, provenance panel, +artifact viewer); the PDF embed was changed to the raw `document.pdf` (browser- +native viewer), symmetric with HTML snapshots — see the Display — PDF decision. + +**Phase 3 complete (2026-05-22).** Link annotation + Wayback: `Filters/Archive.hs` +appends an archive affordance to body links whose target is archived; +`archive.py wayback` (+ `make archive-wayback`) backfills Wayback captures; +`visibility: private` keeps an entry's artifact in-repo but undeployed. +Bibliography annotation is documented as a `Citations.hs` follow-up. + +**Phase 4 complete (2026-05-22).** Backlinks + similar-pages: `Backlinks.hs` +keeps archived external links and canonicalises them to their `/archive//` +page, so an archived work lists every essay that cites it under "Referenced by" +(grouped by the fragment each citation targets); `archive.html` also carries a +"Related" block from the `embed.py` similarity corpus, which now indexes archive +pages and excludes the `/archive/` index. + +**Phase 5 complete (2026-05-22).** Link-rot detection: `tools/archive.py check` +(+ `make archive-check`) HEAD/GET-probes every manifest URL and updates the +gitignored `data/archive-state.json` under asymmetric hysteresis (`rotted` +needs 3 fails over ≥14 days; a single success recovers immediately). +`Filters.Archive` flips a body link to the archive when its target is `rotted`; +each archive page surfaces its link status (provenance row, header note, +Pagefind `status` filter tag); `/archive/` flags rotted entries; `/build/` +gains a "Link archive" telemetry section. The search-UI `status` filter wiring +in `search-filters.js` is deliberately partial — see the Phase 5 Met note. + +**All five phases done.** Refinements next; see the Phase 5 Met note for the +documented deferrals (search-UI status filter; bibliography annotation from +Phase 3; pull-from-Wayback at fetch time). + +**Refinements (2026-05-22).** A code-review pass found and fixed several +correctness and posture issues across the system: + +- **Missing committed artifact no longer re-fetches silently.** `cmd_fetch` + used to skip its SHA guard when the artifact was absent and then download + fresh bytes whose hash differed from the recorded `sha256` — replacing the + recorded snapshot without surfacing it. The guard now also halts when + `PROVENANCE.json` is present but the artifact is missing, requiring the + author to restore the committed bytes before rebuilding. +- **`archive/removed.yaml` is now enforced in `fetch` and `check`.** It was + only read by `gc`. A removed URL re-added to the manifest now halts + `cmd_fetch` loudly; `cmd_check` skips removed URLs so the link-rot + scanner does not keep probing a deliberate takedown. +- **SHA verification closed the `.venv`-bypass hole.** The original + decision relied solely on `archive.py fetch` re-hashing, but that step is + `.venv`-gated — a contributor or deploy host without `.venv`, or a direct + `cabal run site -- build`, would publish a tampered artifact unchecked. + `build/Archive.hs` now also re-hashes via `sha256sum` from + `loadArchiveEntries` and halts the build on a mismatch, so the guarantee + holds independent of the Python step. +- **Raw artifacts are no longer publicly indexable.** Pass 1 added a + `robots.txt` `Disallow: /archive/`, which pass 2 then reverted (see + below — it was counter-productive). Pass 1's other change — injecting + `` into every new HTML + snapshot alongside the archive CSP — remains in place; the + deploy-side header for raw PDFs landed in pass 2 as `nginx/archive.conf`. +- **The documented `archive.py refresh {slug}` subcommand is implemented.** + It clears the slug's directory, re-fetches via `cmd_fetch`, and records + the prior `sha256` as `previous-sha256` in the new `PROVENANCE.json`. The + URL-changed error message in `cmd_fetch` now points at it instead of + asking the author to delete the directory by hand. +- **`url_aliases` widened** to the design's full equivalent-URL set: + tracking-parameter stripping (`utm_*`, `fbclid`, `gclid`, `mc_*`, `ref`, + `igshid`, `_hsenc`, `_hsmi`, `mkt_tok`) and arXiv abs / pdf / versioned / + `.pdf` form expansion. Phase 1 had deliberately kept these as a Phase 4 + deferral, but Phase 4 missed the follow-through. +- **`X-Robots-Tag: noarchive` is now honoured on both HEAD and GET.** Some + servers omit the header on HEAD but emit it on GET; HTML capture now + aborts if either response carries the directive. + +Three smaller items remain documented and deferred: + +- **Archive tags joining the site-wide tag indexes.** `manifest.yaml`'s + `tags:` is authored but `Tags.hs`/`Patterns.tagIndexable` does not yet + ingest archive entries — it needs a Tags.hs-side integration with its + own design pass (archive pages aren't `match`ed Hakyll items in the + normal way). +- **`archive.py suggest`** (bibliography discovery — diff `.bib` URLs + against the manifest) is documented but not implemented. +- **The controlled-host end-to-end link-rot test** (reserve + `archive-test.levineuwirth.org`, run it through a 14-day-spanning fail + streak, watch the flip happen) is inherently a multi-week real-world + verification the author runs; the hysteresis logic is unit-tested + deterministically and the rendering side is verified by a hand-crafted + `rotted` state file. + +**Refinements pass 2 (2026-05-23).** A second code-review pass surfaced +correctness gaps the first pass missed: + +- **`refresh` is now atomic.** It used to delete the slug directory and + then call `cmd_fetch`; a failed re-fetch left the entry with no + snapshot at all, while `refresh` returned 0 (because `cmd_fetch` + reports per-entry skips, not a process failure). The slug directory is + now *renamed* to a `.refresh-backup` sibling; success removes the + backup, any failure restores it. Verified by hiding the `monolith` + binary and confirming the prior snapshot survives intact. +- **Invalid `visibility` values fail closed.** The `ManifestEntry` parser + used to accept any string and only treat the exact `"private"` as + private — a typo like `privte` would publish a work the author intended + to keep offline. The parser now rejects any value other than `public` + or `private`, and `readManifest` halts the build on any parse error of + a present file (instead of warning + returning an empty list — that + silent-skip was for `file absent`, not `file present but corrupt`). +- **Lookup-side URL normalisation.** Alias generation alone cannot cover + unbounded forms (arXiv versions, arbitrary tracking-parameter + combinations). `ArchiveIndex` now normalises both index keys and + lookup inputs through the same `normalizeUrl` (drop fragment, strip + tracking, fold http→https, arXiv-canonicalise, trim trailing slash). + Verified: `https://cr.yp.to/aes-speed.html`, + `https://cr.yp.to/aes-speed.html?utm_source=mail`, and + `http://cr.yp.to/aes-speed.html/` all match the same archived entry. +- **Raw-artifact indexing posture corrected.** The Phase-5 `robots.txt` + `Disallow: /archive/` was counter-productive: a URL blocked by + robots.txt can still appear in results when externally linked, and the + Disallow also prevents compliant crawlers from reading the wrapper + pages' ``. The Disallow is reverted; a new + `nginx/archive.conf` snippet emits `X-Robots-Tag: noindex, noarchive` + for the whole `/archive/` tree, which crawlers honour for any resource + (HTML and PDF alike). The deploy vhost should `include + snippets/archive.conf`. +- **`cmd_wayback` skips `removed.yaml`.** The eviction procedure says + record in `removed.yaml` *before* dropping the manifest line; `fetch` + and `check` now honour that ordering, but `wayback` did not. A removed + entry whose manifest line was still in place could be submitted to a + third-party archive after a takedown was recorded. +- **The shipped HTML snapshot was refreshed in the working tree** so it + carries the noarchive meta the Phase-5 inject promises. `archive.py + refresh djb-aes-speed` re-fetched cr.yp.to, applied + `inject_archive_metas`, and recorded the prior SHA as `previous-sha256`. + `archive/djb-aes-speed/{snapshot.html, PROVENANCE.json}` now reflect the + new bytes; matching SHA is verified by `Archive.hs`. *Caveat surfaced + in pass 3 (below): the prior snapshot was not committed at the moment + of this refresh, so its bytes are no longer recoverable via `git log + -S`. A pass-3 fix to `refresh` now refuses to replace an uncommitted + prior, but the historical artifact survives — `previous-sha256` + records a hash whose bytes this working tree cannot reproduce.* +- **The URL-changed error in `cmd_fetch`** now points at + `archive.py refresh {slug}` instead of asking the author to delete the + directory by hand. + +Tag integration remains the one deferred refinement (it needs a Tags.hs +design pass). + +**Refinements pass 3 (2026-05-23).** A third audit surfaced gaps the pass-2 +fixes didn't fully close: + +- **`refresh` refuses to replace an uncommitted prior snapshot.** Pass 2 + preserved a prior snapshot through *failed* re-fetches, but a *successful* + one happily discarded uncommitted bytes — `previous-sha256` then pointed + at a hash no `git log -S` could recover. Pass 3 shells out to `git + ls-files` + `git diff --quiet HEAD` and refuses the refresh unless both + the prior PROVENANCE.json and its artifact are tracked and clean. +- **`refresh` is atomic across *every* exit path.** Pass 2 handled the + ordinary `cmd_fetch returns 0 but the artifact wasn't produced` case but + not fatal `sys.exit`s (e.g. a `removed.yaml` conflict halting `cmd_fetch` + mid-refresh) nor mid-refresh exceptions, and it never rolled back the + `data/archive-index.json` rewrite. The work is now wrapped in + `try/finally` that restores both the slug directory and the index on any + exit path — normal failure, `SystemExit`, `KeyboardInterrupt`, or + exception. +- **Removal enforcement now uses the same equivalence as link matching.** + Pass 2 introduced `normalizeUrl` for incoming citations but compared + removals as literal URL strings, so a tracking-laden manifest URL could + bypass a takedown. Python gains `normalize_url` mirroring the Haskell + helper, and `fetch` / `check` / `wayback` compare normalised forms. + `cmd_fetch` additionally rejects two manifest entries whose canonical + forms collide — that would otherwise route both under one slug. +- **`fetch_html` honours `X-Robots-Tag: noarchive` on the captured GET too.** + Pass 1 added HEAD + ranged-GET probes, but a server can emit the header + only on the full document response. The Python tool now downloads that + response itself, checks its header and body directives, then passes those + exact bytes to `monolith --base-url ... -` so the saved snapshot is not + obtained through a second unobservable document request. +- **`nginx/archive.conf` is wired into the deploy template** and + re-`include`s `security-headers.conf` inside its `location` block. + `nginx/vhost.conf.example` now includes `archive.conf`; the snippet + itself re-emits the baseline headers because nginx's `add_header` chain + is inherited from a parent only when the current context declares *no* + `add_header` directives — without the re-include, /archive/ would lose + HSTS, CSP, etc. +- **Contract doc cleanups.** The Phase-5 paragraph claiming `robots.txt` + disallows `/archive/` is reworded to acknowledge the pass-2 reversal; + the Phase-1 checkbox claiming `Archive.hs` does not re-hash is updated + to point at `verifyArtifactSha`; the pass-2 note about the refreshed + djb snapshot now carries the caveat that its prior bytes were + uncommitted and are therefore unrecoverable. + +The historical `previous-sha256` value in `archive/djb-aes-speed/ +PROVENANCE.json` is left in place: it is a truthful record that *a* prior +snapshot existed and what its hash was. It just is not recoverable from +git in this working tree — the pass-3 `refresh` precondition exists so +that property is never broken again. + +**Refinements pass 4 (2026-05-23).** A fourth audit completed the +failure-closed paths: + +- **Direct Hakyll builds now enforce removals and missing-artifact failures.** + `Archive.hs` reads `removed.yaml`, rejects normalized manifest conflicts + and duplicate archive targets, and aborts if provenance exists without its + artifact. `ArchiveIndex.hs` filters the generated index through the live + manifest minus normalized removals, so a stale ignored index cannot retain + archive affordances after a takedown when `archive.py` was skipped. +- **`refresh` verifies the prior bytes before replacing them.** A prior + snapshot must now be present, tracked, clean, and match its recorded + SHA-256 before its hash can be written into `previous-sha256`. +- **Failed refresh restores an originally-absent index state.** If + `data/archive-index.json` did not exist before a failed refresh, any index + created by the attempted fetch is deleted during rollback. + +The genuinely-open questions that remain are collected at the end — the list is +short. + +--- + +## Motivation + +The site cites external work — papers, articles, blog posts, documentation. +Three things go wrong with a plain hyperlink over time: + +1. **Link rot.** The target moves, paywalls, or vanishes. A 2019 essay's + citations decay silently; nobody notices until a reader clicks. +2. **Content drift.** The target stays up but changes. The sentence you quoted + is no longer the sentence at that URL. +3. **Opacity to the site's own machinery.** An external link is invisible to + `Backlinks.hs` (`isPageLink` drops every `http(s)://` URL) and to + `embed.py` (it indexes only `_site/**/*.html`). The site knows nothing about + the things it most often points at. A paper cited by six essays has no page, + no backlinks list, no place in any "Related" set. + +The archive fixes all three by keeping a **local, hosted, immutable snapshot** +of each referenced work, giving it a stable URL on this domain, and making that +URL a first-class citizen of the existing backlinks and similar-pages systems. + +This is deliberately *not* a general web crawler. It archives a curated set: +the things this site references. The author adds a URL to a manifest; the build +does the rest. + +### Relationship to existing pieces + +| Existing piece | What it does | Why the archive is different | +|----------------|--------------|------------------------------| +| `static/papers/` | Hosts Levi's **own** typeset PDFs (`preprint:`, `{{pdf:}}`) | The archive holds **third-party** works. Distinct directory, distinct purpose. Never conflate the two. | +| nginx `popup-proxy.conf` | Caches **metadata** (title/abstract) from arXiv / archive.org / PubMed for hover previews | Caches structured metadata, not documents. A preview accelerator, not preservation. | +| `Backlinks.hs` | Inverts **internal** links into a "who links here" map | Indexes site content only; external URLs are dropped. The archive makes referenced works internal enough to index. | +| `embed.py` / `SimilarLinks.hs` | Semantic "Related" block from `_site/**/*.html` embeddings | Only sees site pages. Archived works become site pages, so they enter the embedding corpus for free. | + +--- + +## Goals + +- **Preservation.** Every referenced work the author chooses to archive has a + byte-for-byte local snapshot that survives the original going dark. +- **Stable hosting.** Each snapshot is reachable at a permanent + `/archive/{slug}/` URL on levineuwirth.org, rendered in site chrome. +- **Hyperlink-able.** Archive URLs are ordinary internal links: usable in + prose, wikilinks, citations, and `further-reading`. +- **Indexed.** Archived works appear in the **backlinks** ("Referenced by") and + **similar-pages** ("Related") systems exactly as native content does — and, + where the source structure allows, granularly by section. +- **Curated, low-friction.** Adding an archive is one line in one manifest. + Everything else — fetch, text extraction, page generation, indexing — is + automatic and build-time. +- **Static-friendly.** Every archive page renders at build time; JS is layered + on, never required. Matches the rest of the site's contract. +- **Honest.** Archive pages never impersonate the original. They are framed as + archived copies, link prominently to the source, are kept out of search + engines, and carry a real, advertised removal channel on every page. +- **Safe by default.** No build step ever deletes or overwrites a committed + artifact; destruction and replacement are always explicit, opt-in acts. + +--- + +## Decisions (locked) + +| Topic | Decision | Rationale | +|-------|----------|-----------| +| Trigger | Curated manifest, not auto-crawl | Archives what the site *references*, not the web. Legally and operationally sane. | +| Authored input | One hand-edited file: `archive/manifest.yaml` | One line per archived link. Mirrors `data/commonplace.yaml`'s authoring model. | +| Bibliography seeding | **Rejected** as auto-seeding. `make archive-suggest` prints a "cited but not archived" diff; the author copies lines by hand. | Keeps the manifest the *identity* of the archive, not a cache of the `.bib` files. | +| Per-entry provenance | `archive/{slug}/PROVENANCE.json`, committed — immutable for the current snapshot | An immutability claim that isn't in version control isn't immutable. | +| Mutable state | `data/archive-state.json`, gitignored — link-rot status only | Strict split: immutable facts committed, volatile status disposable. | +| Hakyll input | `data/archive-index.json` — `url` + aliases → slug, written by the tool | Minimal stable shape for the Haskell side; treated like `data/annotations.json`. | +| Missing-index behaviour | `Backlinks.hs` and `Filters/Archive.hs` silently no-op when `archive-index.json` is absent | Preserves the established `.venv`-gated silent-skip convention. The archive degrades to invisible, never to an error. | +| `fetch` idempotence | `fetch` is keyed on `(slug, url)` together; a slug whose recorded URL has changed is refused, not overwritten. `fetch` always rewrites `archive-index.json` to mirror the manifest. | A committed artifact is replaced only by an explicit `refresh`, never as a `fetch` side effect. | +| Artifact storage | `archive/{slug}/` at repo root, **committed to git** | A preservation guarantee that depends on an un-versioned store is weaker. Repo stays reproducible. | +| Per-artifact size cap | 25 MB; `archive.py fetch` warns and skips above it; `git add -f` to override deliberately | A 200 MB scan must never land in an auto-commit silently. | +| Storage migration | If `archive/` exceeds ~5 GB or doubles year-over-year, evaluate a separate archive repo / object store. **Never git LFS.** | LFS breaks `git clone → make build` reproducibility — a regression for a preservation system. | +| HTML snapshots | `monolith -j` → one self-contained HTML file; the pinned `monolith` binary is committed at `tools/bin/monolith` | Single static binary, no headless browser. Strips JS. Committing it (vs downloading) removes a network dependency and keeps the build reproducible from a bare clone. | +| PDF snapshots | Direct download via `requests` | Papers are usually clean PDF URLs (arXiv etc.). | +| Display — PDF | The raw `document.pdf` in an ` + $endif$ + $if(is-html)$ + + $endif$ + + $endif$ + + $if(fulltext)$ + $if(is-pdf)$ +
+ Full text (extracted) + $fulltext$ +
+ $endif$ + $if(is-html)$ +
+

Readable text (extracted)

+ $fulltext$ +
+ $endif$ + $endif$ + + $if(referenced-by)$ + + $endif$ + + $if(similar-links)$ + + $endif$ + + $partial("templates/partials/archive-removal-notice.html")$ + + + diff --git a/templates/partials/archive-banner.html b/templates/partials/archive-banner.html new file mode 100644 index 0000000..d08b3ec --- /dev/null +++ b/templates/partials/archive-banner.html @@ -0,0 +1,5 @@ +
+

Archived copy

+

A local preservation snapshot taken $archived$ — this page is not the original.

+ View the original ↗ +
diff --git a/templates/partials/archive-removal-notice.html b/templates/partials/archive-removal-notice.html new file mode 100644 index 0000000..c34a0f5 --- /dev/null +++ b/templates/partials/archive-removal-notice.html @@ -0,0 +1,5 @@ +

+ This is an archived copy, preserved so that a work cited across the site + survives the original going dark. To request removal, email + ln@levineuwirth.org. +

diff --git a/templates/partials/head.html b/templates/partials/head.html index c9eb017..beec3a0 100644 --- a/templates/partials/head.html +++ b/templates/partials/head.html @@ -2,6 +2,7 @@ $if(home)$Levi Neuwirth$else$$if(title)$$title$ — Levi Neuwirth$else$Levi Neuwirth$endif$$endif$ $if(description)$$endif$ +$if(noindex)$$endif$ @@ -49,6 +50,7 @@ $if(build)$$endif$ $if(reading)$$endif$ $if(composition)$$endif$ $if(photography)$$endif$ +$if(archive)$$endif$ $if(photography-map)$$endif$ $if(photography-map)$$endif$ $if(photography-map)$$endif$ diff --git a/tools/archive.py b/tools/archive.py new file mode 100644 index 0000000..2aacb0e --- /dev/null +++ b/tools/archive.py @@ -0,0 +1,1151 @@ +#!/usr/bin/env python3 +""" +archive.py — Build-time link-archiving tool for levineuwirth.org. + +Reads archive/manifest.yaml, fetches any manifest URL that has no local +artifact yet, stores it under archive//, extracts readable text, +writes the per-entry archive//PROVENANCE.json, and (re)writes the +Hakyll input data/archive-index.json. + +Two artifact types: + * pdf — downloaded directly, stored as document.pdf, text via pdftotext. + * html — snapshotted with `monolith` into a single self-contained + snapshot.html (JavaScript stripped, assets inlined as data + URIs), a restrictive Content-Security-Policy injected, + text extracted with BeautifulSoup. + +Subcommands: + fetch download missing artifacts, (re)generate sidecars + index + refresh deliberately re-snapshot a single entry, recording the prior + SHA in the new PROVENANCE.json's `previous-sha256` + wayback submit archived URLs to the Wayback Machine as a second, + independent copy; backfill the capture URL into PROVENANCE.json + check HEAD/GET-probe every manifest URL for link rot, updating + data/archive-state.json with asymmetric hysteresis + gc delete archive// directories listed in archive/removed.yaml + +Failure policy: + * Integrity errors — a committed artifact whose SHA-256 no longer + matches PROVENANCE.json, or a slug whose manifest URL has changed — + print loudly and exit non-zero, halting `make build`. + * Transient errors — a network failure, an over-cap download, a missing + `monolith` binary, a manifest entry missing its `url:` — print a + warning, skip that entry, and exit zero so the build proceeds (the + entry is retried on the next build). + +See ARCHIVE.md for the full design. + +Gated on .venv by the Makefile (same convention as embed.py). Non-stdlib +dependencies: PyYAML and beautifulsoup4, both already in pyproject.toml. +External tools: `pdftotext` (poppler) for PDF text, and the `monolith` +binary — vendored at tools/bin/monolith, see tools/monolith-version.txt. +""" + +from __future__ import annotations + +import datetime +import hashlib +import json +import os +import re +import shutil +import subprocess +import sys +import urllib.error +import urllib.request +from pathlib import Path +from urllib.parse import parse_qsl, quote, urlencode, urlparse, urlunparse + +import yaml + +# --------------------------------------------------------------------------- +# Configuration +# --------------------------------------------------------------------------- + +REPO_ROOT = Path(__file__).resolve().parent.parent +ARCHIVE_DIR = REPO_ROOT / "archive" +MANIFEST = ARCHIVE_DIR / "manifest.yaml" +REMOVED = ARCHIVE_DIR / "removed.yaml" +INDEX_OUT = REPO_ROOT / "data" / "archive-index.json" +STATE_OUT = REPO_ROOT / "data" / "archive-state.json" + +ROT_FAILS = 3 # consecutive failed scans before `rotted` is considered +ROT_DAYS = 14 # ... and the streak must also span at least this many days + +SIZE_CAP = 25 * 1024 * 1024 # 25 MB per-artifact cap +TIMEOUT = 60 # seconds, per network request +WAYBACK_TIMEOUT = 120 # seconds — Save Page Now is slow +USER_AGENT = ("levineuwirth.org/archive " + "(ln@levineuwirth.org; removal requests honored)") + +# Per-type on-disk names. The artifact is committed; the .txt is generated +# (gitignored) and regenerated whenever the artifact's SHA-256 changes. +ARTIFACT = {"pdf": "document.pdf", "html": "snapshot.html"} +TEXTFILE = {"pdf": "document.txt", "html": "snapshot.txt"} + +# Injected into every HTML snapshot's . Permits exactly what a +# faithful monolith capture needs — inlined images/fonts as data URIs and +# inline styles (as